A Quick Introduction to Apache Cassandra
Cassandra, used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines.
Cassandra provides a scalable, high-availability datastore with no single point of failure. Originally developed by Avinash Lakshman (the author of Amazon Dynamo) and Prashant Malik at Facebook, to solve their Inbox-search problem, the code was published in July 2008 as free software, under the Apache V2 license. The development has continued at an amazing pace, driven in part by contributions from IBM, Twitter and Rackspace. Since February 2010, Cassandra has been an “Apache top-level project”.
Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster.
This architecture is a lot more complex to implement behind the scenes, but we won’t have to deal with that. The nice folks working at the Cassandra core bust their heads against the quirks of distributed systems (the interested reader might start here, here or here, to learn more about how such a cluster design can be implemented).
Not having to distinguish between a Master and a Slave node allows you to add any number of machines to any cluster in any datacenter, without having to worry about what type of machine you need at the moment. Every server accepts requests from any client. Every server is equal.
The CAP-Theorem & Tunable Consistency
Possibly the best-known peculiarity of distributed systems is the CAP-Theorem by Dr. E. A. Brewer. It states that of the three attributes, Consistency, Availability and Partition Tolerance, any such system can only fulfill two at a time (take this with a grain of salt).
Without going into too many details, RDBMS have for years focused on the first two, consistency and availability, allowing for such great things as transactions. The whole NoSQL movement (from a 10,000ft view) is essentially about choosing partition tolerance instead of (strong) consistency. This has led to the popular belief that NoSQL databases are completely unsuitable for applications requiring just that. And while this might be true for some, it isn’t for Cassandra.
One of Cassandra’s stand-out features is called “Tunable Consistency”. This means that the programmer can decide if performance or accuracy is more important, on a per-query level. For write-requests, Cassandra will either replicate to any available (replication) node, a quorum of nodes or to all nodes, even providing options how to deal with multi-datacenter setups.
For read-requests, you can instruct Cassandra to either wait for any available node (which might return stale data), a quorum of machines (thereby reducing the probability to get stale data) or to wait for every node, which will always return the latest data and provide us with our long-sought, strong consistency.
Learn more about Tunable Consistency here.
On the surface, Cassandra’s data model seems to be quite relational. With this in mind, diving deeper into ColumnFamilies, SuperColumns and the likes, will make Cassandra look like an unfinished RDBMS, lacking features like JOINS and most rich-query capabilities.
To understand why databases like Cassandra, HBase and BigTable (I’ll call them DSS, Distributed Storage Services, from now on) were designed the way they are, we’ll first have to understand what they were built to be used for.
DSS were designed to handle enormous amounts of data, stored in billions of rows on large clusters. Relational databases incorporate a lot of things that make it hard to efficiently distribute them over multiple machines. DSS simply remove some or all of these ties. No operations are allowed, that require scanning extensive parts of the dataset, meaning no JOINS or rich-queries.
There are only two ways to query, by key or by key-range. The reason DSS keep their data model to the bare minimum is the fact, that a single table is far easier to distribute over multiple machines, than several, normalized relations or graphs.
Think of the ColumnFamily model as a (distributed Hash-)Map with up to three dimensions. The two-dimensional setup consists of just a ColumnFamily with some columns in it, “some” meaning a couple of billion if you so wish. So a ColumnFamily is just a map of columns.
I have yet to figure out why, but it seems as if all these terms are just names for different dimensions of a map. A three-dimensional Cassandra “table” would be achieved by putting SuperColumns into a ColumnFamily, thus making it a SuperColumnFamily (please hold back any cries of astonishment), a map of a map of columns.
In this setup, the SuperColumnFamily would represent the highest dimension and the SuperColumn would represent the two remaining dimensions, taking the place of the ColumnFamily in the previous example. This multi-dimensional map contains columns, triplets consisting of a name, a value and a timestamp.
Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key. As discussed in greater detail here, some other limitations apply, that in most cases should not concern you.
Is Cassandra right for me?
Cassandra draws heavily from BigTable (data model) and Dynamo (architecture), two of the most well-known and powerful databases today. This alone might be enough to consider it, but I’ll go ahead and provide list of questions you may ask yourself:
1 Do you require exceptional performance on large datasets?
Cassandra provides fantastic write- and very good read-throughputs (beating most popular competitors), only comparable to HBase. Deciding between the two is mostly a matter of personal preference. If you don’t want the whole Hadoop stack with all its moving parts and increased architectural complexity, choose Cassandra, if Hadoop is fine, HBase may be the better integrated fit. If you’re dealing with small to medium data volumes, relational solutions become a lot more interesting, because of the flexibility they provide.
2 Do you need rich, flexible queries and a high-level data model?
Cassandra reduces its data model to the absolute minimum, in order to keep the dataset as partition-able as possible. This pays back when talking about linear scalability, but leaves you with “nothing more” than a distributed hash-map. But don’t judge too early, as even one of Google’s core technologies, BigTable, does exactly the same. Like C provides a programmer with fantastic performance at the cost of some comfort features like garbage collection, Cassandra provides you with a core piece of Google-scale technology, at the cost of dealing with indexes yourself or thinking about queries you’ll need beforehand.
3 Do you require application transparency?
Because of Cassandra’s very low-level data model, applications require extensive knowledge about the dataset. If application transparency is what you need, Cassandra is not for you.
If you’ve come to the decision that Cassandra might be worth a try, I’ll list a few starting points here. DataStax, a company providing Cassandra tools, support and solutions to enterprises, hosts a community edition to simplify things for you. The package includes smart installers for all major OS, DataStax’s utility to manage your Cassandra setup visually and some sample applications/databases.
Careful though, the package comes with python included, which might mess up your default python interpreter. In this case, just remove the DataStax python directory from your (PYTHON) PATH variable. Now pick a client (recommended) or use Thrift, a framework for language independent services.