A Quick Introduction to Apache Cassandra

Cassandra, used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability datastore with no single point of failure. Originally developed by Avinash Lakshman (the author of Amazon Dynamo) and Prashant Malik at Facebook, to solve their Inbox-search problem, the code was published in July 2008 as free software, under the Apache V2 license. The development has continued at an amazing pace, driven in part by contributions from IBM, Twitter and Rackspace. Since February 2010, Cassandra has been an “Apache top-level project”.

Architecture

Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. This architecture is a lot more complex to implement behind the scenes, but we won’t have to deal with that. The nice folks working at the Cassandra core bust their heads against the quirks of distributed systems (the interested reader might start here, here or here, to learn more about how such a cluster design can be implemented). Not having to distinguish between a Master and a Slave node allows you to add any number of machines to any cluster in any datacenter, without having to worry about what type of machine you need at the moment. Every server accepts requests from any client. Every server is equal.

The CAP-Theorem & Tunable Consistency

Possibly the best-known peculiarity of distributed systems is the CAP-Theorem by Dr. E. A. Brewer. It states that of the three attributes, Consistency, Availability and Partition Tolerance, any such system can only fulfill two at a time (take this with a grain of salt). Without going into too many details, RDBMS have for years focused on the first two, consistency and availability, allowing for such great things as transactions. The whole NoSQL movement (from a 10,000ft view) is essentially about choosing partition tolerance instead of (strong) consistency. This has led to the popular belief that NoSQL databases are completely unsuitable for applications requiring just that. And while this might be true for some, it isn’t for Cassandra. One of Cassandra’s stand-out features is called “Tunable Consistency”. This means that the programmer can decide if performance or accuracy is more important, on a per-query level. For write-requests, Cassandra will either replicate to any available (replication) node, a quorum of nodes or to all nodes, even providing options how to deal with multi-datacenter setups. For read-requests, you can instruct Cassandra to either wait for any available node (which might return stale data), a quorum of machines (thereby reducing the probability to get stale data) or to wait for every node, which will always return the latest data and provide us with our long-sought, strong consistency. Learn more about Tunable Consistency here.

Data model

On the surface, Cassandra’s data model seems to be quite relational. With this in mind, diving deeper into ColumnFamilies, SuperColumns and the likes, will make Cassandra look like an unfinished RDBMS, lacking features like JOINS and most rich-query capabilities. To understand why databases like Cassandra, HBase and BigTable (I’ll call them DSS, Distributed Storage Services, from now on) were designed the way they are, we’ll first have to understand what they were built to be used for. DSS were designed to handle enormous amounts of data, stored in billions of rows on large clusters. Relational databases incorporate a lot of things that make it hard to efficiently distribute them over multiple machines. DSS simply remove some or all of these ties. No operations are allowed, that require scanning extensive parts of the dataset, meaning no JOINS or rich-queries. There are only two ways to query, by key or by key-range. The reason DSS keep their data model to the bare minimum is the fact, that a single table is far easier to distribute over multiple machines, than several, normalized relations or graphs. Think of the ColumnFamily model as a (distributed Hash-)Map with up to three dimensions. The two-dimensional setup consists of just a ColumnFamily with some columns in it, “some” meaning a couple of billion if you so wish. So a ColumnFamily is just a map of columns. I have yet to figure out why, but it seems as if all these terms are just names for different dimensions of a map. A three-dimensional Cassandra “table” would be achieved by putting SuperColumns into a ColumnFamily, thus making it a SuperColumnFamily (please hold back any cries of astonishment), a map of a map of columns. In this setup, the SuperColumnFamily would represent the highest dimension and the SuperColumn would represent the two remaining dimensions, taking the place of the ColumnFamily in the previous example. This multi-dimensional map contains columns, triplets consisting of a name, a value and a timestamp. Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key. As discussed in greater detail here, some other limitations apply, that in most cases should not concern you.

Is Cassandra right for me?

Cassandra draws heavily from BigTable (data model) and Dynamo (architecture), two of the most well-known and powerful databases today. This alone might be enough to consider it, but I’ll go ahead and provide list of questions you may ask yourself: 1 Do you require exceptional performance on large datasets? Cassandra provides fantastic write- and very good read-throughputs (beating most popular competitors), only comparable to HBase. Deciding between the two is mostly a matter of personal preference. If you don’t want the whole Hadoop stack with all its moving parts and increased architectural complexity, choose Cassandra, if Hadoop is fine, HBase may be the better integrated fit. If you’re dealing with small to medium data volumes, relational solutions become a lot more interesting, because of the flexibility they provide. 2 Do you need rich, flexible queries and a high-level data model? Cassandra reduces its data model to the absolute minimum, in order to keep the dataset as partition-able as possible. This pays back when talking about linear scalability, but leaves you with “nothing more” than a distributed hash-map. But don’t judge too early, as even one of Google’s core technologies, BigTable, does exactly the same. Like C provides a programmer with fantastic performance at the cost of some comfort features like garbage collection, Cassandra provides you with a core piece of Google-scale technology, at the cost of dealing with indexes yourself or thinking about queries you’ll need beforehand. 3

Do you require application transparency? Because of Cassandra’s very low-level data model, applications require extensive knowledge about the dataset. If application transparency is what you need, Cassandra is not for you.

Getting started

If you’ve come to the decision that Cassandra might be worth a try, I’ll list a few starting points here. DataStax, a company providing Cassandra tools, support and solutions to enterprises, hosts a community edition to simplify things for you. The package includes smart installers for all major OS, DataStax’s utility to manage your Cassandra setup visually and some sample applications/databases. Careful though, the package comes with python included, which might mess up your default python interpreter. In this case, just remove the DataStax python directory from your (PYTHON) PATH variable. Now pick a client (recommended) or use Thrift, a framework for language independent services. From then, you might want to check out this reading list, start experimenting immediately by following this tutorial or try Cassandra by example.

Frequently Asked Questions about Apache Cassandra

What is the architecture of Apache Cassandra?

Apache Cassandra is built on a distributed architecture that allows for high availability, fault tolerance, and scalability. It uses a ring design instead of using a master-slave architecture. Each node in the cluster has the same role and communicates with each other equally. Data is distributed among all the nodes in a cluster, and each node is responsible for a certain set of data. This architecture ensures that there is no single point of failure and the system can continue to operate even if a node fails.

How does Apache Cassandra handle data replication?

Apache Cassandra provides robust support for data replication, which is a critical feature for any distributed system. It replicates data across multiple nodes to ensure data safety and availability. The replication factor can be configured based on the requirements. If a node fails, the data can be retrieved from another node where the data has been replicated.

What is the write process in Apache Cassandra?

In Apache Cassandra, writes are handled in a way that ensures high performance and consistency. When a write operation occurs, the data is first written to a commit log for durability and then to an in-memory structure called the memtable. Once the memtable is full, the data is flushed to an SSTable on disk. This process ensures that write operations are fast and data is not lost in case of a system failure.

How does Apache Cassandra ensure data consistency?

Apache Cassandra uses a protocol called tunable consistency to ensure data consistency. This allows users to choose the level of consistency they need for their application. For example, a user can choose to have a consistency level of one, which means that a write or read operation needs to be confirmed by only one node. On the other hand, a consistency level of quorum requires a majority of nodes to confirm a write or read operation.

What is the read process in Apache Cassandra?

When a read request is made in Apache Cassandra, the system first checks the memtable and if the data is not found there, it checks the SSTables. If the data is still not found, it checks the bloom filter, which is a data structure that can tell us if the data is definitely not in the SSTable. This process ensures that read operations are efficient and fast.

What are the key features of Apache Cassandra?

Apache Cassandra has several key features that make it a popular choice for handling large amounts of data. These include its distributed architecture, support for data replication, tunable consistency, and high performance for write and read operations. It also provides support for data partitioning and multi-data center replication, which allows for geographical distribution of data.

How does Apache Cassandra handle data partitioning?

Apache Cassandra uses a method called consistent hashing for data partitioning. In this method, data is distributed across the nodes in the cluster based on the hash value of the keys. This ensures that the data is evenly distributed and allows for easy scalability as new nodes can be added without disrupting the existing data distribution.

What is the role of Apache Cassandra in big data analytics?

Apache Cassandra plays a crucial role in big data analytics as it can handle large amounts of data spread across many commodity servers. It provides high availability and fault tolerance along with powerful capabilities to handle complex data structures. It is widely used in real-time analytics, where it can process large volumes of data in real-time.

How does Apache Cassandra support multi-data center replication?

Apache Cassandra supports multi-data center replication, which allows for geographical distribution of data. This means that you can have your data replicated in different data centers around the world. This feature is crucial for ensuring high availability and disaster recovery.

What are the use cases of Apache Cassandra?

Apache Cassandra is used in a variety of applications that require scalability and high availability. Some of the common use cases include real-time analytics, messaging systems, recommendation engines, product catalogs and playlists in web and mobile applications, and sensor data in IoT applications. It is also used in fraud detection systems where it can process large volumes of data in real-time.