RethinkDB in Ruby: Map Reduce and Joins
Last time around we covered some of the basics of RethinkDB: installing it, querying it, inserting documents – all with the Ruby driver. In this article, we’ll take a deeper look at some of the features of RethinkDB. Before we get started, you should have a copy of RethinkDB installed and running.
MapReduce
Say we have a bunch of data spread across a group of nodes. We’re trying to run some kind of computation on this data; how should we do it? One obvious solution (that’s also obviously bad) is trying to move all the data to one node where an algorithm is run in order to get the information we need. There are many reasons why this is a horrible idea:
- We might not have enough storage on the single node
- It would take a lot of bandwidth, etc.
So, how do we take advantage of all these nodes? Well, write the algorithm so that it can run in a parallelized fashion across the cluster. Unfortunately, without any “guidelines”, this can be a pretty difficult task. For example, what if one of the nodes fails while we’re running our algorithm? How do we divide up the dataset so that all the nodes pull their own weight?
Back in the day, when researchers/developers at Google came across these problems, they developed a framework called MapReduce. Essentially, MapReduce forces you to structure your algorithm in a certain way and, in return, it can handle system failures, etc. for you. With MapReduce, your code remains unchanged regardless of whether it’s running on one node or a thousand. It turns out that RethinkDB includes an implementation of MapReduce that allows you to apply computations to your data set in an efficient way.
So, how does MapReduce work? Say we’re trying to operate on many pieces of data and we put them all in a list. The MapReduce paradigm consists of two steps (RethinkDB introduces a separate step called grouping which we’ll discuss later): map and reduce. Map, just like the map
Ruby method, takes a list, operates on it, and then spits out another list. Reduce, just like reduce
in Ruby, takes a list and “boils” it down a value. If you write your algorithm using these “map” and “reduce” pieces, RethinkDB will figure out how to efficiently split up the computation (over tables and shards).
Counting
Let’s take a dead-simple example, pretty similar to the documentation, which (unfortunately) focuses on Python. Remember the “people” table we made last time around? Of course you don’t. So, let’s put that together again. Remember, for any irb
commands noted here, the following has to be in place:
require 'rethinkdb'
include RethinkDB::Shortcuts
conn = r.connect(:host => "localhost", :port => 28015)
Alright, let’s make the table and put some stuff in it:
r.db_create("testdb").run(conn)
r.db("testdb").table_create("people").run(conn)
r.db("testdb").table("people").insert({:name => "Dhaivat", :age => 18,
:gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "John",
:age => 27, :gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "Jane",
:age => 94, :gender => "F"}).run(conn)
Some of these calls might fail if you already have the right structure setup inside RethinkDB.
Alright, let’s say we want to count the number of documents in the people
table. How can we do this with map-reduce (the RethinkDB implementation is referred to in lower case; Google’s implementation is “MapReduce”)? Our map operation will get, as input, each element in the list containing each document in our table. We’ll take that and transform into a list of 1’s and then the reduce operation will sum them up and produce one number. It’s pretty simple in code:
map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)
That should nicely spit out “3” (if, of course, you have three documents in the table). The first two lines are just plain old Ruby in terms of syntax but there are a couple things to take note. As written, map_op
takes only one argument: the map operation is meant to map each element in the sequence to one element in the resulting sequence. reduce_op
takes two of the 1’s generated by map
and adds them together. We then construct a query by passing the procs to map
and reduce
, respectively.
What if we wanted to figure out the sum of the ages instead? Simple:
map_op = Proc.new { |person| person["age"] }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)
Basically, the only difference is map
returns the age instead of 1
. It turns out that there’s a significantly easier way to accomplish both of these tasks: the count
and sum
methods. Take a look:
r.db('testdb').table('people').count().run(conn)
r.db('testdb').table('people').sum("age").run(conn)
We’re supposed to use map-reduce queries when we can’t find something equivalent in the existing API. The map-reduce paradigm is incredibly powerful, but somewhat of an unnecessary pain if we can find something that already does the job – you’ll be surprised how far just count
, sum
, avg
, and filter
can get you.
Grouping
Let’s rewind a little bit. The map step gets each element in the list/sequence. What if we want to group the elements in a special way? For example, maybe we want to count the number of females and the number of males. That’s where we use grouping. The best way to learn it is to see it in action:
group_op = Proc.new { |person| person['gender'] }
r.db('testdb').table('people').group(group_op).count.run(conn)
The proc group_op
returns a key for each element in the sequence and each element is grouped according to its key. In this case, we are creating groups based on gender so we use person['gender']
as the key. Then, we use the count function in order to count the elements within each grouping. The result should look like this:
{"F"=>1, "M"=>2}
In case you don’t want to use the count
utility function, we can use the same map_op
and reduce_op
procs we defined earlier:
map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").group(group_op).
map(map_op).reduce(reduce_op).run(conn)
Instead of operating on each element within the sequence/list, map_op
now operates on each element in each group. Then, reduce_op
reduces each group to one value which we see in the output.
Hadoop
In case you’re a Hadoop user or have heard about Hadoop, you might be wondering: how is map-reduce on RethinkDB any different from that in Hadoop? Well, one of the core reasons is that Hadoop is meant for unstructured “blob” data. The Hadoop Distributed File System (HDFS) is setup for reads of 64MB blocks at a time and focuses on file-based access. On the other hand, RethinkDB is a document store and “understands” the structure of the documents. Generally, RethinkDB should be used as the backend you’re constantly putting stuff into and querying out of (e.g. directly connected to a web app) and Hadoop can be used for batch processing of the data saved in RethinkDB.
Joins
Let’s switch gears completely. If you’ve used any sort of SQL database before, you’re probably familiar with “joins”. They take information from two separate tables and combine them to produce a set of information which has components from both tables. We see them most often with relationships between things we are storing. For example, if each person can have many pets, that’s a one-to-many relationship where we might use a table join. That’s as good an example as any so we’ll take a look at how to model it in RethinkDB. Let’s first create the right table:
r.db('testdb').table_create('pets').run(conn)
Get the IDs of the documents in the “people” table:
itr = r.db('testdb').table('people').run(conn)
itr.each do |document|
p document["id"]
end
Pick one of those IDs. We’ll create a pet with an owner from the people table:
person_id = <picked id goes here>
r.db("testdb").table("pets").
insert({:name => "Bob", :animal => "dog",
:person_id => person_id}).run(conn)
Each person can have one or more pets. So, each pet must have an owner and can store the id
for that owner. We end up having a field called “person_id” inside each pet document. We can create the join as follows:
r.db('testdb').table("pets").
eq_join("person_id", r.db("testdb").table("people")).run(conn)
Here, we are tacking on eq_join("person_id", r.db("testdb").table("people"))
. That tells RethinkDB that we’re trying to connect person_id
in the “pets” table to the primary key in the “people” table. That should spit something out that looks like this:
[{"left"=>
{"animal"=>"dog",
"id"=>"bfc9d1c1-442a-40d0-a58c-93641fe45ffb",
"name"=>"Bob",
"person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"},
"right"=>
{"age"=>94, "id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded", "name"=>"Jane"}}]>
The “left” side represents the side that contains the relating ID (in this case, person_id
) and “right” contains the other side. We can flatten the representation pretty easily:
r.db('testdb').table("pets").
eq_join("person_id", r.db("testdb").table("people")).
zip.run(conn)
All we’ve done is tack on a call to zip
. Now, we get an iterator representing:
[{"age"=>94,
"animal"=>"dog",
"id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded",
"name"=>"Jane",
"person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"}]
Awesome! We’ve basically combined the information from two tables as we’d hoped to with the join.
Wrapping It Up
I hope you’ve enjoyed this overview of some of the features of RethinkDB (and a couple of comparisons with other systems). In the next post, we will look at how to build a web app using RethinkDB as a backend.
Please drop questions in the comments section below.