Tuesday, July 13, 2010

Where did NoSQL Come From?

Relational databases have reigned for so long that it is truly amazing to see a whole new class of databases emerging. These so-called NoSQL databases are decidedly non-relational in almost every regard: architecture, schema capabilities, APIs ­– even support for ACID transactions. (Carlo Strozzi first used the term NoSQL over 10 years ago to describe a non-relational database. The term was recently recycled to describe an emerging class of distributed database systems. Because absence of SQL is not really a requirement, Strozzi claims the term NoREL would be more apt [though not as catchy].) Given the maturity of today’s open source and commercial relational databases, why do we need a new kind of database? What motivated these new databases and what can they do for us? In this post, I’ll examine these questions from the perspective of a long-time database developer.

One of the first computers I used (late 70’s) was a Burroughs B5700 mainframe. It supported a first-generation database system called DMS, which was really just an index-sequential access method (ISAM) system–barely a database. When Burroughs released the B6700 mainframe, it replaced DMS with a hybrid network/hierarchical system called DMSII. (Similar databases also appeared on mainframes from IBM, Univac, Honeywell, and others.) The network features provided unidirectional pointers between records; hierarchical features allowed a master record to have related child records. Coupled with ACID transactions, these early mainframe databases were the first heavy lifters of large data sets. (Of course, back then we considered a gigabyte to be a lot of data. Today, my phone has 16GB of storage!)

Through the mid-80’s, I wrote a lot of applications for DMSII databases. Even without something for comparison, I considered these early databases tedious to program, largely due to the navigational access model: the unit of access was a record, and you could only fetch or update one record at a time. Complex operations such as producing a bill of materials (also called “parts explosion”) required a deep knowledge of cursors, indexes, and efficient search techniques. And, even simple schema changes often broke existing applications. Back then, database programming was tedious work.

As the need for greater flexibility and programmability became apparent, several new database models began to receive attention in the 80’s. For several years, I helped develop a mainframe database system that used the semantic network model. Later I was project leader for a relational/object database system for Windows and Unix. But the relational model became the clear winner, largely due to the power of SQL. I’ve spent over a decade now writing applications for various relational databases.

SQL not only elevates programmers from navigational to set-based access, but SQL query optimizers largely free programmers from the details of navigating tables and indexes. In fact, mature optimizers can choose sophisticated execution strategies that most programmers would never consider. Today we have a large set of relational database choices supported by a huge ecosystem of tools, applications, and expertise.

So, are these NoSQL databases an evolution of relational databases? Not exactly. NoSQL databases are mostly brand new, developed from scratch. Are they the next database generation, poised to toss relational technology from its throne? No way. They typically support low-level programming interfaces, they provide little in the way of schema evolution, they offer no referential integrity, and they don’t have the ecosystem of relational databases. OK, so what prompted the NoSQL movement and what do these new kids on the block offer?

To understand where the NoSQL movement came from, imagine what you might do if you started and grew a modern Web-centric company:

  • At first, you’re running your new web site out of your dorm room, so you put your data in a MySQL database and run a web server on the same box.
  • When the machine can no longer support the load, you move the web server to its own machine. Then you deploy multiple web servers with a load-balancing front end.
  • Eventually, the database becomes a bottleneck, so you use a distributed caching tool such as Memcached, and you upgrade to bigger boxes.
  • But even on a big machine, your single database instance runs out of gas, so you start partitioning your data into multiple database instances. Your front-end diverts queries to the appropriate shard as needed.
  • Needing yet more performance, you begin to limit your application’s SQL queries to the fastest operations: small transactions, fetch-by-primary-key only, no joins, store multiple values in a single Blob, etc. In other words, you start to use your SQL database mostly as a key/value store.
  • Even with sharding, contention for each shard eventually becomes a bottleneck. So, you use replication to store each record on multiple nodes, yielding greater concurrency, and you develop sophisticated techniques for fault tolerance and maintaining consistency between copies.
At this point, you might realize that you’re not using the relational database for most of the features for which it was designed: SQL, ACID transactions, query optimization, etc. Furthermore, things like complex replication, dynamic expandability, and consistency without expensive locking are not easy with modern relational databases. If you deploy hundreds of database nodes, you soon learn about new problems such as network partition failures. In this new world, relational databases are increasingly unsuitable.

Scenarios analogous to this one have occurred at numerous web companies over the last 10 years. The need for massive scalability, dynamic expansion, lower cost, and new fault tolerance models drove web-centric companies to find new solutions for their unique problems. Amazon developed Dynamo, Google developed BigTable, LinkedIn developed Voldemort, Facebook developed Cassandra, and so forth. The NoSQL movement was born.

It’s exciting to see the Web spawn a whole new class of database technology, despite the presence of the highly mature relational database field. But for those of us who don’t have the scale and access problems of a web-centric company, what can NoSQL databases offer?

In my next post, I’ll explore some use cases where NoSQL databases can complement relational databases and perhaps open-up whole new application possibilities.

No comments:

Post a Comment