Originally published at: http://www.sitepoint.com/rethinkdb-ruby-map-reduce-joins/
Last time around we covered some of the basics of RethinkDB: installing it, querying it, inserting documents – all with the Ruby driver. In this article, we’ll take a deeper look at some of the features of RethinkDB. Before we get started, you should have a copy of RethinkDB installed and running.
Say we have a bunch of data spread across a group of nodes. We’re trying to run some kind of computation on this data; how should we do it? One obvious solution (that’s also obviously bad) is trying to move all the data to one node where an algorithm is run in order to get the information we need. There are many reasons why this is a horrible idea:
- We might not have enough storage on the single node
- It would take a lot of bandwidth, etc.
So, how do we take advantage of all these nodes? Well, write the algorithm so that it can run in a parallelized fashion across the cluster. Unfortunately, without any “guidelines”, this can be a pretty difficult task. For example, what if one of the nodes fails while we’re running our algorithm? How do we divide up the dataset so that all the nodes pull their own weight?
Back in the day, when researchers/developers at Google came across these problems, they developed a framework called MapReduce. Essentially, MapReduce forces you to structure your algorithm in a certain way and, in return, it can handle system failures, etc. for you. With MapReduce, your code remains unchanged regardless of whether it’s running on one node or a thousand. It turns out that RethinkDB includes an implementation of MapReduce that allows you to apply computations to your data set in an efficient way.
So, how does MapReduce work? Say we’re trying to operate on many pieces of data and we put them all in a list. The MapReduce paradigm consists of two steps (RethinkDB introduces a separate step called grouping which we’ll discuss later): map and reduce. Map, just like the
map Ruby method, takes a list, operates on it, and then spits out another list. Reduce, just like
reduce in Ruby, takes a list and “boils” it down a value. If you write your algorithm using these “map” and “reduce” pieces, RethinkDB will figure out how to efficiently split up the computation (over tables and shards).