Wednesday, July 21, 2010

Why Use a NoSQL Database?

NoSQL databases were born mostly from web-centric companies needing to solve problems out of the reach of modern relational databases. For those of us who are not running web-centric companies, are these NoSQL databases useful? I think they could be. Although the problems that motivated the NoSQL database movement may be quite different from ours, these databases offer benefits that may be a good fit for several situations. From my perspective, there are three features of NoSQL databases that are attractive for internal and even commercial applications:

  • Cost: Most of the NoSQL databases are open source and can be used freely. Moreover, they are designed to run on commodity x86 hardware with locally-attached disk. Compared to scaling a modern commercial relational database, which typically requires expensive "big iron" boxes, SAN storage, and other expensive components, NoSQL databases can manage a large set of data at a fraction of the hardware and software costs.
  • Performance: Because of their special purpose, NoSQL databases tend to be "lean and mean", often supporting read/write rates an order of magnitude faster or more than relational databases. Many NoSQL databases can be classified as "persistent hash tables", optimized for performance and availability at the expense of other features.
  • Availability: NoSQL databases use replication and sophisticated fault tolerance algorithms such as Phi Accrual Failure Detection to maximize availability when a node fails. Many NoSQL databases allow new nodes to be dynamically added, after which data--and work--are automatically redistributed. Since field/column semantics are enforced by applications, NoSQL databases don't require table-level schema changes. Even upgrades to new software versions are typically supported via "rolling restarts", which keep the overall database available.
Before you get too excited about a new class of database that is cheap, fast, and highly available, be aware that, compared to relational databases, you're going to give up a lot. Keeping in mind that features of each NoSQL database vary and that this new genre is evolving quickly, consider some things we take for granted from relational databases but we'll mostly find absent with NoSQL databases:

  • SQL: Some NoSQL databases provide a subset of SQL or an SQL-like query language, but these languages are very limited compared to full ANSI SQL. NoSQL databases favor small update transactions and quick queries with little or no support for ACID transactions (see below). Many NoSQL databases provide a way to use MapReduce-style processing for data analysis, but this bulk scan approach facilitates long-running, read-only applications.
  • Data integrity: NoSQL databases don't have a detailed schema - you declare only the major data pools (tables) that will be used. Each row can have a large, variable number of columns, each storing any type of value. NoSQLs have no foreign key support or referential integrity, and they don't have triggers or similar features that help enforce data rules. Instead, you must enforce all data integrity rules within your applications.
  • ACID transactions: Because of their eventual consistency model, NoSQL databases have little or no locking, which means you can't perform multi-record, atomic updates. (Some NoSQL databases allow a single transaction to atomically update a limited number of records in the same partition, but not across partitions.) As a result, updates require strategies such as application-level locking, "soft" transactions, and idempotent algorithms (a subject for another post).
  • Stability: Most of these NoSQL databases are shipping an 0.x release, meaning they haven't yet reached 1.0. They are essentially experimental, which means lots of things: they aren't as stable as production databases, upgrade compatibility isn't guaranteed, and, well, they crash more often than commercial databases. Among other things, this means they're not ready for mission-critical, production databases.
  • An ecosystem: Similarly, because NoSQL databases are so new, you won't find a rich array of tools, add-ons, and even consulting expertise for them. (Though companies such as Riptano are starting to emerge for professional support.) As with all new technologies, it will take time for this field to mature.
Despite their current limitations, NoSQL databases may be a suitable match for certain applications. Especially with the continued decline in storage costs and our end users' growing appetite for Google-like access to information, we are increasingly compelled to make more data available at sub-second speed. This makes the speed and availability factors of NoSQL enticing. Here are some example applications that I think might be good candidates:

  1. Capture and report: Consider applications that capture data from their environment such as events or log records and then provide various forms of reporting. For these applications, records are typically stored once and not updated thereafter. Their challenge is making an ever-expanding data pool accessible for online searching and data analysis. Using a NoSQL database for storage may enable these apps to store more data, more quickly, and more cheaply than with a relational database.
  2. Distributed storage: Today, we would typically use a file system to manage a large number of unstructured objects: documents, images, logs, etc. A large SAN with RAID is one option, but as an alternate solution, a NoSQL database's sharding and replication features may provide dynamically expandable storage at lower cost.
  3. Single updater/copious readers: NoSQL databases were born from social web applications such as Facebook and LinkedIn. These applications experience high update volumes (and even higher fetch volumes), but typically any given object is updated by a single user. In other words, ACID transactions are not a strong requirement while massive scalability and availability are. Correspondingly, collaboration and similar applications that store a large number of objects that are seldom updated by more than one user at a time are also good fits for NoSQL databases.
  4. Hybrid scenarios: Some relational database applications manage a large number of tables, but only a few tables experience large growth. Often, the number of records in the high-growth tables is limited by the performance that can be reasonably provided. That is, records are purged or off-loaded above a certain threshold to avoid response degradation. NoSQL databases offer a new choice where off-loaded records can be stored. In some cases, it may make sense to move the high-volume tables entirely to a NoSQL database, allowing the application to use a hybrid storage approach.
Most companies have a wide range of applications that warrant a database for persistent storage. For a vast majority of these applications, relational databases will continue to be the right solution. Any application whose data fits on a single computer or requires complex transaction processing has little reason to consider a NoSQL database. But for many applications such as the examples above, a NoSQL database may be a good fit. And, as the NoSQL field continues to expand and mature, the range of suitable applications will undoubtedly expand.

Tuesday, July 13, 2010

Where did NoSQL Come From?

Relational databases have reigned for so long that it is truly amazing to see a whole new class of databases emerging. These so-called NoSQL databases are decidedly non-relational in almost every regard: architecture, schema capabilities, APIs ­– even support for ACID transactions. (Carlo Strozzi first used the term NoSQL over 10 years ago to describe a non-relational database. The term was recently recycled to describe an emerging class of distributed database systems. Because absence of SQL is not really a requirement, Strozzi claims the term NoREL would be more apt [though not as catchy].) Given the maturity of today’s open source and commercial relational databases, why do we need a new kind of database? What motivated these new databases and what can they do for us? In this post, I’ll examine these questions from the perspective of a long-time database developer.

One of the first computers I used (late 70’s) was a Burroughs B5700 mainframe. It supported a first-generation database system called DMS, which was really just an index-sequential access method (ISAM) system–barely a database. When Burroughs released the B6700 mainframe, it replaced DMS with a hybrid network/hierarchical system called DMSII. (Similar databases also appeared on mainframes from IBM, Univac, Honeywell, and others.) The network features provided unidirectional pointers between records; hierarchical features allowed a master record to have related child records. Coupled with ACID transactions, these early mainframe databases were the first heavy lifters of large data sets. (Of course, back then we considered a gigabyte to be a lot of data. Today, my phone has 16GB of storage!)

Through the mid-80’s, I wrote a lot of applications for DMSII databases. Even without something for comparison, I considered these early databases tedious to program, largely due to the navigational access model: the unit of access was a record, and you could only fetch or update one record at a time. Complex operations such as producing a bill of materials (also called “parts explosion”) required a deep knowledge of cursors, indexes, and efficient search techniques. And, even simple schema changes often broke existing applications. Back then, database programming was tedious work.

As the need for greater flexibility and programmability became apparent, several new database models began to receive attention in the 80’s. For several years, I helped develop a mainframe database system that used the semantic network model. Later I was project leader for a relational/object database system for Windows and Unix. But the relational model became the clear winner, largely due to the power of SQL. I’ve spent over a decade now writing applications for various relational databases.

SQL not only elevates programmers from navigational to set-based access, but SQL query optimizers largely free programmers from the details of navigating tables and indexes. In fact, mature optimizers can choose sophisticated execution strategies that most programmers would never consider. Today we have a large set of relational database choices supported by a huge ecosystem of tools, applications, and expertise.

So, are these NoSQL databases an evolution of relational databases? Not exactly. NoSQL databases are mostly brand new, developed from scratch. Are they the next database generation, poised to toss relational technology from its throne? No way. They typically support low-level programming interfaces, they provide little in the way of schema evolution, they offer no referential integrity, and they don’t have the ecosystem of relational databases. OK, so what prompted the NoSQL movement and what do these new kids on the block offer?

To understand where the NoSQL movement came from, imagine what you might do if you started and grew a modern Web-centric company:

  • At first, you’re running your new web site out of your dorm room, so you put your data in a MySQL database and run a web server on the same box.
  • When the machine can no longer support the load, you move the web server to its own machine. Then you deploy multiple web servers with a load-balancing front end.
  • Eventually, the database becomes a bottleneck, so you use a distributed caching tool such as Memcached, and you upgrade to bigger boxes.
  • But even on a big machine, your single database instance runs out of gas, so you start partitioning your data into multiple database instances. Your front-end diverts queries to the appropriate shard as needed.
  • Needing yet more performance, you begin to limit your application’s SQL queries to the fastest operations: small transactions, fetch-by-primary-key only, no joins, store multiple values in a single Blob, etc. In other words, you start to use your SQL database mostly as a key/value store.
  • Even with sharding, contention for each shard eventually becomes a bottleneck. So, you use replication to store each record on multiple nodes, yielding greater concurrency, and you develop sophisticated techniques for fault tolerance and maintaining consistency between copies.
At this point, you might realize that you’re not using the relational database for most of the features for which it was designed: SQL, ACID transactions, query optimization, etc. Furthermore, things like complex replication, dynamic expandability, and consistency without expensive locking are not easy with modern relational databases. If you deploy hundreds of database nodes, you soon learn about new problems such as network partition failures. In this new world, relational databases are increasingly unsuitable.

Scenarios analogous to this one have occurred at numerous web companies over the last 10 years. The need for massive scalability, dynamic expansion, lower cost, and new fault tolerance models drove web-centric companies to find new solutions for their unique problems. Amazon developed Dynamo, Google developed BigTable, LinkedIn developed Voldemort, Facebook developed Cassandra, and so forth. The NoSQL movement was born.

It’s exciting to see the Web spawn a whole new class of database technology, despite the presence of the highly mature relational database field. But for those of us who don’t have the scale and access problems of a web-centric company, what can NoSQL databases offer?

In my next post, I’ll explore some use cases where NoSQL databases can complement relational databases and perhaps open-up whole new application possibilities.

Welcome to Guck Tales!

Welcome to my blog site, Guck Tales! This is a software technology blog, intended to chronicle things I find interesting as I research, use, and develop computer software technologies. See my profile for more details about me, but here's a quick background: I've been working with computers and developing software for over 35 years. For reasons unknown to me, I seem to always be involved with databases, either developing DBMS systems or writing database applications. I've worked on many client/server, peer-to-peer, and high performance transaction processing applications, poking every technology I can find with a stick to see if it bites. I love this field. If I hadn't landed in computers, I'd probably be programming on the side anyway. I pinch myself all the time thinking: "They pay me to do this!"

I probably should have started blogging long ago. Not that I'm a great writer, but I love to share things I find: cool tools, new technologies, insights into solving hard problems. What made me decide to finally start a software technology blog is my current area of research: NoSQL databases. What are these? See the next blog entries! I'll start writing about my experiments and findings with this new breed of database, hopefully providing some valuable links and tips to others who may need to walk a similar path. As I stray into other areas of software development, I'll try to chronicle interesting things in other areas as well. If you follow this blog, I hope you find something useful.

Welcome!