Ruby
Article

RethinkDB in Ruby: Map Reduce and Joins

By Dhaivat Pandya

rethinkdb

Last time around we covered some of the basics of RethinkDB: installing it, querying it, inserting documents – all with the Ruby driver. In this article, we’ll take a deeper look at some of the features of RethinkDB. Before we get started, you should have a copy of RethinkDB installed and running.

MapReduce

Say we have a bunch of data spread across a group of nodes. We’re trying to run some kind of computation on this data; how should we do it? One obvious solution (that’s also obviously bad) is trying to move all the data to one node where an algorithm is run in order to get the information we need. There are many reasons why this is a horrible idea:

  • We might not have enough storage on the single node
  • It would take a lot of bandwidth, etc.

So, how do we take advantage of all these nodes? Well, write the algorithm so that it can run in a parallelized fashion across the cluster. Unfortunately, without any “guidelines”, this can be a pretty difficult task. For example, what if one of the nodes fails while we’re running our algorithm? How do we divide up the dataset so that all the nodes pull their own weight?

Back in the day, when researchers/developers at Google came across these problems, they developed a framework called MapReduce. Essentially, MapReduce forces you to structure your algorithm in a certain way and, in return, it can handle system failures, etc. for you. With MapReduce, your code remains unchanged regardless of whether it’s running on one node or a thousand. It turns out that RethinkDB includes an implementation of MapReduce that allows you to apply computations to your data set in an efficient way.

So, how does MapReduce work? Say we’re trying to operate on many pieces of data and we put them all in a list. The MapReduce paradigm consists of two steps (RethinkDB introduces a separate step called grouping which we’ll discuss later): map and reduce. Map, just like the map Ruby method, takes a list, operates on it, and then spits out another list. Reduce, just like reduce in Ruby, takes a list and “boils” it down a value. If you write your algorithm using these “map” and “reduce” pieces, RethinkDB will figure out how to efficiently split up the computation (over tables and shards).

Counting

Let’s take a dead-simple example, pretty similar to the documentation, which (unfortunately) focuses on Python. Remember the “people” table we made last time around? Of course you don’t. So, let’s put that together again. Remember, for any irb commands noted here, the following has to be in place:

require 'rethinkdb'
include RethinkDB::Shortcuts
conn = r.connect(:host => "localhost", :port => 28015)

Alright, let’s make the table and put some stuff in it:

r.db_create("testdb").run(conn)
r.db("testdb").table_create("people").run(conn)
r.db("testdb").table("people").insert({:name => "Dhaivat", :age => 18, 
    :gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "John", 
    :age => 27, :gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "Jane", 
    :age => 94, :gender => "F"}).run(conn)

Some of these calls might fail if you already have the right structure setup inside RethinkDB.

Alright, let’s say we want to count the number of documents in the people table. How can we do this with map-reduce (the RethinkDB implementation is referred to in lower case; Google’s implementation is “MapReduce”)? Our map operation will get, as input, each element in the list containing each document in our table. We’ll take that and transform into a list of 1’s and then the reduce operation will sum them up and produce one number. It’s pretty simple in code:

map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)

That should nicely spit out “3” (if, of course, you have three documents in the table). The first two lines are just plain old Ruby in terms of syntax but there are a couple things to take note. As written, map_op takes only one argument: the map operation is meant to map each element in the sequence to one element in the resulting sequence. reduce_op takes two of the 1’s generated by map and adds them together. We then construct a query by passing the procs to map and reduce, respectively.

What if we wanted to figure out the sum of the ages instead? Simple:

map_op = Proc.new { |person| person["age"] }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)

Basically, the only difference is map returns the age instead of 1. It turns out that there’s a significantly easier way to accomplish both of these tasks: the count and sum methods. Take a look:

r.db('testdb').table('people').count().run(conn)
r.db('testdb').table('people').sum("age").run(conn)

We’re supposed to use map-reduce queries when we can’t find something equivalent in the existing API. The map-reduce paradigm is incredibly powerful, but somewhat of an unnecessary pain if we can find something that already does the job – you’ll be surprised how far just count, sum, avg, and filter can get you.

Grouping

Let’s rewind a little bit. The map step gets each element in the list/sequence. What if we want to group the elements in a special way? For example, maybe we want to count the number of females and the number of males. That’s where we use grouping. The best way to learn it is to see it in action:

group_op = Proc.new { |person| person['gender'] }
r.db('testdb').table('people').group(group_op).count.run(conn)

The proc group_op returns a key for each element in the sequence and each element is grouped according to its key. In this case, we are creating groups based on gender so we use person['gender'] as the key. Then, we use the count function in order to count the elements within each grouping. The result should look like this:

{"F"=>1, "M"=>2}

In case you don’t want to use the count utility function, we can use the same map_op and reduce_op procs we defined earlier:

map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").group(group_op).
    map(map_op).reduce(reduce_op).run(conn)

Instead of operating on each element within the sequence/list, map_op now operates on each element in each group. Then, reduce_op reduces each group to one value which we see in the output.

Hadoop

In case you’re a Hadoop user or have heard about Hadoop, you might be wondering: how is map-reduce on RethinkDB any different from that in Hadoop? Well, one of the core reasons is that Hadoop is meant for unstructured “blob” data. The Hadoop Distributed File System (HDFS) is setup for reads of 64MB blocks at a time and focuses on file-based access. On the other hand, RethinkDB is a document store and “understands” the structure of the documents. Generally, RethinkDB should be used as the backend you’re constantly putting stuff into and querying out of (e.g. directly connected to a web app) and Hadoop can be used for batch processing of the data saved in RethinkDB.

Joins

Let’s switch gears completely. If you’ve used any sort of SQL database before, you’re probably familiar with “joins”. They take information from two separate tables and combine them to produce a set of information which has components from both tables. We see them most often with relationships between things we are storing. For example, if each person can have many pets, that’s a one-to-many relationship where we might use a table join. That’s as good an example as any so we’ll take a look at how to model it in RethinkDB. Let’s first create the right table:

r.db('testdb').table_create('pets').run(conn)

Get the IDs of the documents in the “people” table:

itr = r.db('testdb').table('people').run(conn)
itr.each do |document|
    p document["id"]
end

Pick one of those IDs. We’ll create a pet with an owner from the people table:

person_id = <picked id goes here>
r.db("testdb").table("pets").
    insert({:name => "Bob", :animal => "dog",
    :person_id => person_id}).run(conn)

Each person can have one or more pets. So, each pet must have an owner and can store the id for that owner. We end up having a field called “person_id” inside each pet document. We can create the join as follows:

r.db('testdb').table("pets").
    eq_join("person_id", r.db("testdb").table("people")).run(conn)

Here, we are tacking on eq_join("person_id", r.db("testdb").table("people")). That tells RethinkDB that we’re trying to connect person_id in the “pets” table to the primary key in the “people” table. That should spit something out that looks like this:

[{"left"=>
   {"animal"=>"dog",
    "id"=>"bfc9d1c1-442a-40d0-a58c-93641fe45ffb",
    "name"=>"Bob",
    "person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"},
  "right"=>
   {"age"=>94, "id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded", "name"=>"Jane"}}]>

The “left” side represents the side that contains the relating ID (in this case, person_id) and “right” contains the other side. We can flatten the representation pretty easily:

r.db('testdb').table("pets").
    eq_join("person_id", r.db("testdb").table("people")).
    zip.run(conn)

All we’ve done is tack on a call to zip. Now, we get an iterator representing:

[{"age"=>94,
  "animal"=>"dog",
  "id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded",
  "name"=>"Jane",
  "person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"}]

Awesome! We’ve basically combined the information from two tables as we’d hoped to with the join.

Wrapping It Up

I hope you’ve enjoyed this overview of some of the features of RethinkDB (and a couple of comparisons with other systems). In the next post, we will look at how to build a web app using RethinkDB as a backend.

Please drop questions in the comments section below.

Comments
BabyCthulhu

I am a bit confused about the results of the zip operation.
In the example it seems as if the result would not include the id column of table pets, but the one from people.
Does it always overwrite existing keys or is it a typo?

Recommended
Sponsors
Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in Ruby, once a week, for free.