Thursday, November 4, 2010

Idempotent Updates with Cassandra

One of the things we don’t get with NoSQL databases is ACID transactions. We get Durability, which means data is persistently stored, but we don’t get:
  • Atomicity: If we update multiple records in a single transaction, there’s no guarantee that all or none of them will be applied when a failure occurs.
  • Consistency: NoSQL databases don’t use locking or data integrity rules to guarantee that every transaction moves the database from one consistent state to another.
  • Isolation: There’s no guarantee that the records updated on one transaction won’t be read–or even updated–by another transaction before the first transaction is complete.
Some NoSQL databases offer some A, C, or I features in a limited context such as for “single partition” updates. We could emulate ACID features by supplementing the database with a synchronization framework such as ZooKeeper, but this reduces performance and scalability. If we’re using a NoSQL database, it’s no doubt largely because we want its high throughput. So, we have to design our updates to be compatible with its eventually consistent model.

With Cassandra, the closest we get to a transaction is a “batch mutate”, which allows us to submit multiple updates in one request. But while those updates are being applied, they can be immediately accessed by other applications. If our application (or Cassandra) crashes, there’s no guarantee about which updates in the batch will succeed. Without ACID transactions, how do we prevent conflicts between applications? How do we recover if a crash occurs?

One strategy is to design updates to be idempotent, which means that applying the same update twice produces the same results as applying the update once. In other words, if an update is successfully processed once, applying the same update again is essentially a no-op. As we’ll see in a moment, this strategy not only simplifies restarts after a crash, it can prevent conflicts between multiple updaters. But given our long dependency on ACID transactions, it might not be obvious how to do this.

To start, we must shed some baggage from SQL programming. First, consider that in Cassandra there is no distinction between inserting a new row and modifying an existing row; we merely update a row. Second, each row can have an arbitrary number of columns, and we can add new columns to an existing row at any time. Here’s an example: for a specific table (which Cassandra calls a ColumnFamily), a typical update looks something like this:

    “Update the row with key k with columns: C1=V1, C2=V2, …”

Cassandra interprets this update as follows:
  • If no row exists in the specified table with the key k, a new row is created with the specified column names and values.
  • If a row already exists with key k, its columns are updated as follows:
            o If a given column Cn does not exist, it is added to the row with value Vn.
            o If column Cn already exists, its value is replaced value Vn.

So, the first time we say “update the row with key k…”, the row is created. But if we process the same update again, since the row already exists and no columns are actually changed, nothing happens. (What happens under the hood is not so simple, but this is effectively what an application sees.) Batch mutates can also delete columns and remove rows and, as we’d expect, deleting a column or row that is already deleted has no effect.

If every batch merely created a single row, it’s pretty obvious that if our application (or Cassandra itself) crashed, we could just figure out where we left off and reprocess the most recent batches. But of course, not all batches are this simple. What happens, for example, if batches from separate applications need to update the same row?

Let’s take a simple example. Suppose we have a database that captures log records and maintains an index of values found within specific log fields. Say we want to easily find all log records with a given value for “User” or “Request” fields. A reasonable approach is to create one table to store log records and a second table as a field-specific index into the log records. For example:


ColumnFamily: LogRecords
Key (timestamp)
Columns
1265040242124
User=Bob
Request=1
1265040242678
User=Mary
Request=1
1265043781000
User=Bob
Request=5



ColumnFamily: Fields (problematic!)
Key (field)
Columns
User
Bob=[1265040242124,1265043781000]
Mary=[1265040242678]
Request
1=[1265040242124,1265040242678]
5=[1265043781000]


In this approach, each log record is stored as a row in LogRecords using its timestamp (in milliseconds) as the key and its fields stored as individual columns. For each field we want to index, we create or update a row within Fields whose key is the indexed field name. For each value found for that field, we create a column whose name is the field value and whose value is a list of log record keys that contained that value (perhaps stored as a long[]). Hence, to find all the log records where “Bob” is mentioned in the “User” field, we fetch the Fields row with key=User, find the column whose name=Bob, and the column value leads us to the appropriate log records. Likewise, we can quickly find all log records where the Request value is “1”. (This approach is a little naïve for a scalable application, but it will help make our point.)

Although this approach basically works, it is problematic when we consider how we might update these tables. In a concurrent environment, suppose two instances simultaneously attempt to update the Fields row with key=User and both attempt to set the column whose name is “Bob”. Even if they first attempted to read the column named “Bob” to see if it already exists, they may both believe they are the first. One will set “Bob” to one value; the other will set “Bob” to another value. When such conflicts occur, because there is no locking, Cassandra will effectively set the “Bob” column to the last update, thereby wiping out the first update. If the updates arrive in the reverse order, we get a different result.

Notice that multiple applications can add columns to the same row without conflicting–they only conflict if they try to update a column with the same name. Hence, we can resolve the conflict by “moving” the column name in the Fields table to the row key, creating a compound key: <field name>/<field value>. (For this example, we’re using a slash ‘/’ to separate the two parts of the key.) This will yield one row per field name/field value pair. As a result, our improved Fields table looks like this:


ColumnFamily: Fields (improved)
Key (Field)
Columns
User/Bob
1265040242124=null
1265043781000=null
User/Mary
1265040242678=null
Request/1
1265040242124=null
1265040242678=null
Request/5
1265043781000=null

If two applications update the Fields row for the “User” field corresponding to “Bob”, there’s no conflict at the row level. It doesn’t matter which order the updates are applied since a single “User/Bob" row will be created regardless. At the column level, each application only sets the column whose name matches the timestamp of the log record it is processing. (In this example, we don’t use the column value, so we just set it to null.) It also doesn’t matter which order the column updates occur since they will all be added to the same row. The only rule we have to follow is that each log record is processed by a single update application.

A side benefit of this approach is that a potentially large set of columns is now distributed over many rows: one per user. If we configure our database schema with an “order preserving partitioner”, our rows will be sorted, and we can perform range queries. For example, we can ask for all Fields rows corresponding to users whose name starts “A” by searching for keys in the range “User/A” (inclusive) and “User/B” (exclusive). Similarly, we can ask for all references to a specific range of “Request” values. (This technique is similar to the one used by the Lucene/Cassandra hybrid called Lucandra.)

There are more complex scenarios in which multiple applications may need to update the same row in the same “transaction”. One example is statistical aggregates, such as maintaining a sum or average of some data value. Even within the confines of the eventually consistent model used by NoSQL databases such as Cassandra, there are ways to accommodate these needs. The key is to use an idempotent update strategy that prevents update conflicts and facilitates application restarts.

Thursday, October 28, 2010

Engineering: Maybe we have it all wrong

[Note: I wrote this in 2009 but never posted it anywhere. Just re-found it and thought I'd post it here.- RG]


When I was in college, the recommended software "engineering" processes were basically waterfall approaches using structured analysis and structured design. Later came the "object" paradigm and OO versions of analysis, design, and even "OO requirements gathering" (which made no sense to me at all.) Then came a myriad of new tools and processes for improving software "engineering": CASE, 4GLs, UML, Spiral, Crystal, Scrum, XP, etc. But these fads rarely seemed to make projects go faster or produce better products. After reading SEI reports, Steve McConnell's books, and countless articles and combining others' views with my own software projects experiences, it seems that most software projects have used – and continue to use – "hero development". Put together a smart group of developers, give them an idea and a keyboard, and let 'em go. Lots of useful software has been developed this way, some of it even great. But lots of software produced this way has been less than satisfactory, and much of it completely crashed and burned.

For years, I thought the problem with software development is that it is rarely conducted as a true engineering discipline. I kept thinking: we should build software like other companies build bridges, jet engines, and lawnmowers. But over the years, I've talked with engineers who work for companies that build these kinds of things, and I've found they have the same complaints we do: innovation stifled, lots of drudgery "patch" work, burdensome bureaucracies, the Peter Principle, etc. I hear stories from Motorola, Xerox, and, more recently, Boeing, which make me believe other engineers have the same challenges and frustrations as software developers despite their more structured engineering practices. Many of their products perform poorly or fail outright. The successful ones succeed mainly because of years of tight design constraints and exhaustive testing, which also make them very expensive.

It seems that, if software teams applied the same structured engineering processes as our lawnmower brethren, there would be no guarantee that we'd be any more successful than we are now.

So this got me to thinking: maybe we've got it all wrong about how to do "engineering". Perhaps because people are social animals that depend heavily on relationships and ad-hoc communication, structured processes are completely against the DNA that allows us to invent. If hero development often works, perhaps we should embrace the best parts of it and add structure and process only for the non-inventive activities. For example:

  • Collaboration: This should be as high-velocity and high-fidelity as possible. This is the electricity that runs the inventive process. This is what makes Web 2.0 work. We should have lots of avenues and few rules for communication.
  • Design and coding: These are core inventive activities and should be as unstructured (process-wise) as possible. Of course good software design principles should be used: modular design, efficient algorithms, proven data structures, etc. But significant innovation occurs in these efforts, so process constraints should be at a minimum.
  • Feature definition: Deciding what features a product should have is the most important activity of software development. Building a 100% perfect product that no one wants is a 100% failure. We've tried "requirements specifications", "user stories", "feature backlogs", etc., but for commercial software products, feature development remains difficult. Because it cries for rapid prototyping, feedback, and refinement, feature definition is an activity ripe for innovation and creativity. Participants should be able to use UI mock-ups, cardboard posters, hand puppets, or anything else that helps distill the success factors of the project at hand.
  • Manufacturing: When a really good car or toaster has been designed, we want the manufacturing tolerances to be as tight as possible. In software, the "manufacturing" activities are things like source control, builds, test execution, and deployment. These should be highly structured and measured. Deviations from allowable tolerances should set off alarms.
  • Documentation and testing: It seems to me these often under-emphasized activities have both a design part, which lends itself to the same innovation as software design, and an execution part, which should be treated like any other manufacturing activity. Both should also honor proven techniques and patterns, but every product should have the leeway for innovative documentation and testing. (Imagine a product for which you could right-click any UI widget and select learn, and it would demonstrate how and when to use that widget.)

In other words, let the design activities be fluid, unstructured, and collaborative. Let the manufacturing processes be structured, automated, and measured. We would still need plans, milestones, schedules, and the like to allow us to operate in a corporate environment. But maybe we could invent a new way to do engineering that works better for the way people really are.

Wednesday, July 21, 2010

Why Use a NoSQL Database?

NoSQL databases were born mostly from web-centric companies needing to solve problems out of the reach of modern relational databases. For those of us who are not running web-centric companies, are these NoSQL databases useful? I think they could be. Although the problems that motivated the NoSQL database movement may be quite different from ours, these databases offer benefits that may be a good fit for several situations. From my perspective, there are three features of NoSQL databases that are attractive for internal and even commercial applications:

  • Cost: Most of the NoSQL databases are open source and can be used freely. Moreover, they are designed to run on commodity x86 hardware with locally-attached disk. Compared to scaling a modern commercial relational database, which typically requires expensive "big iron" boxes, SAN storage, and other expensive components, NoSQL databases can manage a large set of data at a fraction of the hardware and software costs.
  • Performance: Because of their special purpose, NoSQL databases tend to be "lean and mean", often supporting read/write rates an order of magnitude faster or more than relational databases. Many NoSQL databases can be classified as "persistent hash tables", optimized for performance and availability at the expense of other features.
  • Availability: NoSQL databases use replication and sophisticated fault tolerance algorithms such as Phi Accrual Failure Detection to maximize availability when a node fails. Many NoSQL databases allow new nodes to be dynamically added, after which data--and work--are automatically redistributed. Since field/column semantics are enforced by applications, NoSQL databases don't require table-level schema changes. Even upgrades to new software versions are typically supported via "rolling restarts", which keep the overall database available.
Before you get too excited about a new class of database that is cheap, fast, and highly available, be aware that, compared to relational databases, you're going to give up a lot. Keeping in mind that features of each NoSQL database vary and that this new genre is evolving quickly, consider some things we take for granted from relational databases but we'll mostly find absent with NoSQL databases:

  • SQL: Some NoSQL databases provide a subset of SQL or an SQL-like query language, but these languages are very limited compared to full ANSI SQL. NoSQL databases favor small update transactions and quick queries with little or no support for ACID transactions (see below). Many NoSQL databases provide a way to use MapReduce-style processing for data analysis, but this bulk scan approach facilitates long-running, read-only applications.
  • Data integrity: NoSQL databases don't have a detailed schema - you declare only the major data pools (tables) that will be used. Each row can have a large, variable number of columns, each storing any type of value. NoSQLs have no foreign key support or referential integrity, and they don't have triggers or similar features that help enforce data rules. Instead, you must enforce all data integrity rules within your applications.
  • ACID transactions: Because of their eventual consistency model, NoSQL databases have little or no locking, which means you can't perform multi-record, atomic updates. (Some NoSQL databases allow a single transaction to atomically update a limited number of records in the same partition, but not across partitions.) As a result, updates require strategies such as application-level locking, "soft" transactions, and idempotent algorithms (a subject for another post).
  • Stability: Most of these NoSQL databases are shipping an 0.x release, meaning they haven't yet reached 1.0. They are essentially experimental, which means lots of things: they aren't as stable as production databases, upgrade compatibility isn't guaranteed, and, well, they crash more often than commercial databases. Among other things, this means they're not ready for mission-critical, production databases.
  • An ecosystem: Similarly, because NoSQL databases are so new, you won't find a rich array of tools, add-ons, and even consulting expertise for them. (Though companies such as Riptano are starting to emerge for professional support.) As with all new technologies, it will take time for this field to mature.
Despite their current limitations, NoSQL databases may be a suitable match for certain applications. Especially with the continued decline in storage costs and our end users' growing appetite for Google-like access to information, we are increasingly compelled to make more data available at sub-second speed. This makes the speed and availability factors of NoSQL enticing. Here are some example applications that I think might be good candidates:

  1. Capture and report: Consider applications that capture data from their environment such as events or log records and then provide various forms of reporting. For these applications, records are typically stored once and not updated thereafter. Their challenge is making an ever-expanding data pool accessible for online searching and data analysis. Using a NoSQL database for storage may enable these apps to store more data, more quickly, and more cheaply than with a relational database.
  2. Distributed storage: Today, we would typically use a file system to manage a large number of unstructured objects: documents, images, logs, etc. A large SAN with RAID is one option, but as an alternate solution, a NoSQL database's sharding and replication features may provide dynamically expandable storage at lower cost.
  3. Single updater/copious readers: NoSQL databases were born from social web applications such as Facebook and LinkedIn. These applications experience high update volumes (and even higher fetch volumes), but typically any given object is updated by a single user. In other words, ACID transactions are not a strong requirement while massive scalability and availability are. Correspondingly, collaboration and similar applications that store a large number of objects that are seldom updated by more than one user at a time are also good fits for NoSQL databases.
  4. Hybrid scenarios: Some relational database applications manage a large number of tables, but only a few tables experience large growth. Often, the number of records in the high-growth tables is limited by the performance that can be reasonably provided. That is, records are purged or off-loaded above a certain threshold to avoid response degradation. NoSQL databases offer a new choice where off-loaded records can be stored. In some cases, it may make sense to move the high-volume tables entirely to a NoSQL database, allowing the application to use a hybrid storage approach.
Most companies have a wide range of applications that warrant a database for persistent storage. For a vast majority of these applications, relational databases will continue to be the right solution. Any application whose data fits on a single computer or requires complex transaction processing has little reason to consider a NoSQL database. But for many applications such as the examples above, a NoSQL database may be a good fit. And, as the NoSQL field continues to expand and mature, the range of suitable applications will undoubtedly expand.

Tuesday, July 13, 2010

Where did NoSQL Come From?

Relational databases have reigned for so long that it is truly amazing to see a whole new class of databases emerging. These so-called NoSQL databases are decidedly non-relational in almost every regard: architecture, schema capabilities, APIs ­– even support for ACID transactions. (Carlo Strozzi first used the term NoSQL over 10 years ago to describe a non-relational database. The term was recently recycled to describe an emerging class of distributed database systems. Because absence of SQL is not really a requirement, Strozzi claims the term NoREL would be more apt [though not as catchy].) Given the maturity of today’s open source and commercial relational databases, why do we need a new kind of database? What motivated these new databases and what can they do for us? In this post, I’ll examine these questions from the perspective of a long-time database developer.

One of the first computers I used (late 70’s) was a Burroughs B5700 mainframe. It supported a first-generation database system called DMS, which was really just an index-sequential access method (ISAM) system–barely a database. When Burroughs released the B6700 mainframe, it replaced DMS with a hybrid network/hierarchical system called DMSII. (Similar databases also appeared on mainframes from IBM, Univac, Honeywell, and others.) The network features provided unidirectional pointers between records; hierarchical features allowed a master record to have related child records. Coupled with ACID transactions, these early mainframe databases were the first heavy lifters of large data sets. (Of course, back then we considered a gigabyte to be a lot of data. Today, my phone has 16GB of storage!)

Through the mid-80’s, I wrote a lot of applications for DMSII databases. Even without something for comparison, I considered these early databases tedious to program, largely due to the navigational access model: the unit of access was a record, and you could only fetch or update one record at a time. Complex operations such as producing a bill of materials (also called “parts explosion”) required a deep knowledge of cursors, indexes, and efficient search techniques. And, even simple schema changes often broke existing applications. Back then, database programming was tedious work.

As the need for greater flexibility and programmability became apparent, several new database models began to receive attention in the 80’s. For several years, I helped develop a mainframe database system that used the semantic network model. Later I was project leader for a relational/object database system for Windows and Unix. But the relational model became the clear winner, largely due to the power of SQL. I’ve spent over a decade now writing applications for various relational databases.

SQL not only elevates programmers from navigational to set-based access, but SQL query optimizers largely free programmers from the details of navigating tables and indexes. In fact, mature optimizers can choose sophisticated execution strategies that most programmers would never consider. Today we have a large set of relational database choices supported by a huge ecosystem of tools, applications, and expertise.

So, are these NoSQL databases an evolution of relational databases? Not exactly. NoSQL databases are mostly brand new, developed from scratch. Are they the next database generation, poised to toss relational technology from its throne? No way. They typically support low-level programming interfaces, they provide little in the way of schema evolution, they offer no referential integrity, and they don’t have the ecosystem of relational databases. OK, so what prompted the NoSQL movement and what do these new kids on the block offer?

To understand where the NoSQL movement came from, imagine what you might do if you started and grew a modern Web-centric company:

  • At first, you’re running your new web site out of your dorm room, so you put your data in a MySQL database and run a web server on the same box.
  • When the machine can no longer support the load, you move the web server to its own machine. Then you deploy multiple web servers with a load-balancing front end.
  • Eventually, the database becomes a bottleneck, so you use a distributed caching tool such as Memcached, and you upgrade to bigger boxes.
  • But even on a big machine, your single database instance runs out of gas, so you start partitioning your data into multiple database instances. Your front-end diverts queries to the appropriate shard as needed.
  • Needing yet more performance, you begin to limit your application’s SQL queries to the fastest operations: small transactions, fetch-by-primary-key only, no joins, store multiple values in a single Blob, etc. In other words, you start to use your SQL database mostly as a key/value store.
  • Even with sharding, contention for each shard eventually becomes a bottleneck. So, you use replication to store each record on multiple nodes, yielding greater concurrency, and you develop sophisticated techniques for fault tolerance and maintaining consistency between copies.
At this point, you might realize that you’re not using the relational database for most of the features for which it was designed: SQL, ACID transactions, query optimization, etc. Furthermore, things like complex replication, dynamic expandability, and consistency without expensive locking are not easy with modern relational databases. If you deploy hundreds of database nodes, you soon learn about new problems such as network partition failures. In this new world, relational databases are increasingly unsuitable.

Scenarios analogous to this one have occurred at numerous web companies over the last 10 years. The need for massive scalability, dynamic expansion, lower cost, and new fault tolerance models drove web-centric companies to find new solutions for their unique problems. Amazon developed Dynamo, Google developed BigTable, LinkedIn developed Voldemort, Facebook developed Cassandra, and so forth. The NoSQL movement was born.

It’s exciting to see the Web spawn a whole new class of database technology, despite the presence of the highly mature relational database field. But for those of us who don’t have the scale and access problems of a web-centric company, what can NoSQL databases offer?

In my next post, I’ll explore some use cases where NoSQL databases can complement relational databases and perhaps open-up whole new application possibilities.

Welcome to Guck Tales!

Welcome to my blog site, Guck Tales! This is a software technology blog, intended to chronicle things I find interesting as I research, use, and develop computer software technologies. See my profile for more details about me, but here's a quick background: I've been working with computers and developing software for over 35 years. For reasons unknown to me, I seem to always be involved with databases, either developing DBMS systems or writing database applications. I've worked on many client/server, peer-to-peer, and high performance transaction processing applications, poking every technology I can find with a stick to see if it bites. I love this field. If I hadn't landed in computers, I'd probably be programming on the side anyway. I pinch myself all the time thinking: "They pay me to do this!"

I probably should have started blogging long ago. Not that I'm a great writer, but I love to share things I find: cool tools, new technologies, insights into solving hard problems. What made me decide to finally start a software technology blog is my current area of research: NoSQL databases. What are these? See the next blog entries! I'll start writing about my experiments and findings with this new breed of database, hopefully providing some valuable links and tips to others who may need to walk a similar path. As I stray into other areas of software development, I'll try to chronicle interesting things in other areas as well. If you follow this blog, I hope you find something useful.

Welcome!