RethinkDB in Ruby: Map Reduce and Joins

rethinkdb

Last time around we covered some of the basics of RethinkDB: installing it, querying it, inserting documents – all with the Ruby driver. In this article, we’ll take a deeper look at some of the features of RethinkDB. Before we get started, you should have a copy of RethinkDB installed and running.

Key Takeaways

RethinkDB includes an implementation of MapReduce that allows efficient computation on data spread across a group of nodes. The MapReduce paradigm consists of two steps: map and reduce. Map operates on a list and spits out another list, while reduce takes a list and boils it down to a value.
RethinkDB’s ‘grouping’ feature allows elements to be grouped in a special way. For instance, if you want to count the number of females and males in a data set, you can use ‘grouping’. The function creates groups based on a key and then counts the elements within each group.
Hadoop and RethinkDB use map-reduce differently. Hadoop is designed for unstructured “blob” data and focuses on file-based access, while RethinkDB is a document store that understands the structure of the documents. RethinkDB is typically used as the backend for constant data input and output, while Hadoop is used for batch processing of stored data.
RethinkDB supports ‘joins’, which combine information from two separate tables to produce a set of information with components from both tables. Joins are often used to model relationships between stored entities. For example, if each person can have many pets, that’s a one-to-many relationship where a table join might be used.

MapReduce

Say we have a bunch of data spread across a group of nodes. We’re trying to run some kind of computation on this data; how should we do it? One obvious solution (that’s also obviously bad) is trying to move all the data to one node where an algorithm is run in order to get the information we need. There are many reasons why this is a horrible idea:

We might not have enough storage on the single node
It would take a lot of bandwidth, etc.

So, how do we take advantage of all these nodes? Well, write the algorithm so that it can run in a parallelized fashion across the cluster. Unfortunately, without any “guidelines”, this can be a pretty difficult task. For example, what if one of the nodes fails while we’re running our algorithm? How do we divide up the dataset so that all the nodes pull their own weight?

Back in the day, when researchers/developers at Google came across these problems, they developed a framework called MapReduce. Essentially, MapReduce forces you to structure your algorithm in a certain way and, in return, it can handle system failures, etc. for you. With MapReduce, your code remains unchanged regardless of whether it’s running on one node or a thousand. It turns out that RethinkDB includes an implementation of MapReduce that allows you to apply computations to your data set in an efficient way.

So, how does MapReduce work? Say we’re trying to operate on many pieces of data and we put them all in a list. The MapReduce paradigm consists of two steps (RethinkDB introduces a separate step called grouping which we’ll discuss later): map and reduce. Map, just like the map Ruby method, takes a list, operates on it, and then spits out another list. Reduce, just like reduce in Ruby, takes a list and “boils” it down a value. If you write your algorithm using these “map” and “reduce” pieces, RethinkDB will figure out how to efficiently split up the computation (over tables and shards).

Counting

Let’s take a dead-simple example, pretty similar to the documentation, which (unfortunately) focuses on Python. Remember the “people” table we made last time around? Of course you don’t. So, let’s put that together again. Remember, for any irb commands noted here, the following has to be in place:


require 'rethinkdb'
include RethinkDB::Shortcuts
conn = r.connect(:host => "localhost", :port => 28015)

Alright, let’s make the table and put some stuff in it:


r.db_create("testdb").run(conn)
r.db("testdb").table_create("people").run(conn)
r.db("testdb").table("people").insert({:name => "Dhaivat", :age => 18,
    :gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "John",
    :age => 27, :gender => "M"}).run(conn)
r.db("testdb").table("people").insert({:name => "Jane",
    :age => 94, :gender => "F"}).run(conn)

Some of these calls might fail if you already have the right structure setup inside RethinkDB.

Alright, let’s say we want to count the number of documents in the people table. How can we do this with map-reduce (the RethinkDB implementation is referred to in lower case; Google’s implementation is “MapReduce”)? Our map operation will get, as input, each element in the list containing each document in our table. We’ll take that and transform into a list of 1’s and then the reduce operation will sum them up and produce one number. It’s pretty simple in code:


map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)

That should nicely spit out “3” (if, of course, you have three documents in the table). The first two lines are just plain old Ruby in terms of syntax but there are a couple things to take note. As written, map_op takes only one argument: the map operation is meant to map each element in the sequence to one element in the resulting sequence. reduce_op takes two of the 1’s generated by map and adds them together. We then construct a query by passing the procs to map and reduce, respectively.

What if we wanted to figure out the sum of the ages instead? Simple:


map_op = Proc.new { |person| person["age"] }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").map(map_op).reduce(reduce_op).run(conn)

Basically, the only difference is map returns the age instead of 1. It turns out that there’s a significantly easier way to accomplish both of these tasks: the count and sum methods. Take a look:


r.db('testdb').table('people').count().run(conn)
r.db('testdb').table('people').sum("age").run(conn)

We’re supposed to use map-reduce queries when we can’t find something equivalent in the existing API. The map-reduce paradigm is incredibly powerful, but somewhat of an unnecessary pain if we can find something that already does the job – you’ll be surprised how far just count, sum, avg, and filter can get you.

Grouping

Let’s rewind a little bit. The map step gets each element in the list/sequence. What if we want to group the elements in a special way? For example, maybe we want to count the number of females and the number of males. That’s where we use grouping. The best way to learn it is to see it in action:


group_op = Proc.new { |person| person['gender'] }
r.db('testdb').table('people').group(group_op).count.run(conn)

The proc group_op returns a key for each element in the sequence and each element is grouped according to its key. In this case, we are creating groups based on gender so we use person['gender'] as the key. Then, we use the count function in order to count the elements within each grouping. The result should look like this:


{"F"=>1, "M"=>2}

In case you don’t want to use the count utility function, we can use the same map_op and reduce_op procs we defined earlier:


map_op = Proc.new { |person| 1 }
reduce_op = Proc.new { |val_a, val_b| val_a + val_b }
r.db("testdb").table("people").group(group_op).
    map(map_op).reduce(reduce_op).run(conn)

Instead of operating on each element within the sequence/list, map_op now operates on each element in each group. Then, reduce_op reduces each group to one value which we see in the output.

Hadoop

In case you’re a Hadoop user or have heard about Hadoop, you might be wondering: how is map-reduce on RethinkDB any different from that in Hadoop? Well, one of the core reasons is that Hadoop is meant for unstructured “blob” data. The Hadoop Distributed File System (HDFS) is setup for reads of 64MB blocks at a time and focuses on file-based access. On the other hand, RethinkDB is a document store and “understands” the structure of the documents. Generally, RethinkDB should be used as the backend you’re constantly putting stuff into and querying out of (e.g. directly connected to a web app) and Hadoop can be used for batch processing of the data saved in RethinkDB.

Joins

Let’s switch gears completely. If you’ve used any sort of SQL database before, you’re probably familiar with “joins”. They take information from two separate tables and combine them to produce a set of information which has components from both tables. We see them most often with relationships between things we are storing. For example, if each person can have many pets, that’s a one-to-many relationship where we might use a table join. That’s as good an example as any so we’ll take a look at how to model it in RethinkDB. Let’s first create the right table:


r.db('testdb').table_create('pets').run(conn)

Get the IDs of the documents in the “people” table:


itr = r.db('testdb').table('people').run(conn)
itr.each do |document|
    p document["id"]
end

Pick one of those IDs. We’ll create a pet with an owner from the people table:


person_id = <picked id goes here>
r.db("testdb").table("pets").
    insert({:name => "Bob", :animal => "dog",
    :person_id => person_id}).run(conn)

Each person can have one or more pets. So, each pet must have an owner and can store the id for that owner. We end up having a field called “person_id” inside each pet document. We can create the join as follows:


r.db('testdb').table("pets").
    eq_join("person_id", r.db("testdb").table("people")).run(conn)

Here, we are tacking on eq_join("person_id", r.db("testdb").table("people")). That tells RethinkDB that we’re trying to connect person_id in the “pets” table to the primary key in the “people” table. That should spit something out that looks like this:


[{"left"=>
   {"animal"=>"dog",
    "id"=>"bfc9d1c1-442a-40d0-a58c-93641fe45ffb",
    "name"=>"Bob",
    "person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"},
  "right"=>
   {"age"=>94, "id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded", "name"=>"Jane"}}]>

The “left” side represents the side that contains the relating ID (in this case, person_id) and “right” contains the other side. We can flatten the representation pretty easily:


r.db('testdb').table("pets").
    eq_join("person_id", r.db("testdb").table("people")).
    zip.run(conn)

All we’ve done is tack on a call to zip. Now, we get an iterator representing:


[{"age"=>94,
  "animal"=>"dog",
  "id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded",
  "name"=>"Jane",
  "person_id"=>"6cf24b3d-aa75-4e8c-be88-868c74099ded"}]

Awesome! We’ve basically combined the information from two tables as we’d hoped to with the join.

Wrapping It Up

I hope you’ve enjoyed this overview of some of the features of RethinkDB (and a couple of comparisons with other systems). In the next post, we will look at how to build a web app using RethinkDB as a backend.

Please drop questions in the comments section below.

Frequently Asked Questions about RethinkDB, Ruby, Map, Reduce, and Joins

What is the difference between Map and Reduce in Ruby?

In Ruby, Map and Reduce are two different methods used for handling arrays or hashes. The Map method is used to transform the elements of an array or hash based on a given block of code. It returns a new array containing the results of the transformation. On the other hand, Reduce is used to combine all elements of an array or hash based on a binary operation defined in a block of code. It returns a single value that is the result of the combination.

How does RethinkDB work with Ruby?

RethinkDB is a NoSQL database that stores JSON documents. It can be integrated with Ruby using the ‘rethinkdb’ gem. This allows you to perform CRUD operations, run complex queries, and even use real-time push notifications. The ‘rethinkdb’ gem provides a Ruby-like syntax for interacting with the database, making it easy to use for Ruby developers.

How can I use Map and Reduce for joins in RethinkDB with Ruby?

In RethinkDB, you can use Map and Reduce functions to perform joins between tables. The Map function is used to transform each document in a table into a format that can be joined with another table. The Reduce function is then used to combine the results into a single output. This can be done using the ‘map’ and ‘reduce’ methods provided by the ‘rethinkdb’ gem in Ruby.

What is the difference between ‘each’, ‘map’, and ‘reduce’ in Ruby?

Each’, ‘map’, and ‘reduce’ are all methods used for iterating over collections in Ruby. ‘Each’ simply executes a block of code for each element in the collection, without changing the collection itself. ‘Map’ transforms each element in the collection based on a block of code and returns a new collection with the transformed elements. ‘Reduce’ combines all elements in the collection into a single value based on a binary operation defined in a block of code.

How can I use the ‘unless’ and ‘until’ keywords in Ruby?

Unless’ and ‘until’ are two keywords in Ruby that provide alternative ways to control the flow of your program. ‘Unless’ is the opposite of ‘if’. It executes a block of code only if a certain condition is false. ‘Until’ is the opposite of ‘while’. It repeatedly executes a block of code until a certain condition becomes true.

How can I use the Enumerable module in Ruby?

The Enumerable module in Ruby provides a collection of methods for working with enumerable objects, such as arrays and hashes. These methods include ‘map’, ‘reduce’, ‘select’, ‘reject’, and many others. To use these methods, you simply need to include the Enumerable module in your class and define an ‘each’ method for the class.

How can I solve problems using Map and Reduce in Ruby?

Map and Reduce are powerful methods in Ruby that can be used to solve a wide range of problems. Map is used to transform each element in a collection based on a block of code, while Reduce is used to combine all elements in a collection into a single value. By combining these two methods, you can perform complex transformations and calculations on your data.

What are some common use cases for Map and Reduce in Ruby?

Map and Reduce are commonly used in Ruby for data transformation and aggregation. For example, you can use Map to convert all strings in an array to integers, or to calculate the square of each number in an array. You can use Reduce to calculate the sum or product of all numbers in an array, or to find the maximum or minimum value in an array.

How can I optimize my use of Map and Reduce in Ruby?

To optimize your use of Map and Reduce in Ruby, you should avoid unnecessary transformations and calculations. For example, if you only need to calculate the sum of an array, you can use the ‘reduce’ method directly without using ‘map’. Also, you should use the ‘lazy’ method to create lazy enumerators, which can significantly improve performance for large collections.

How can I debug my Map and Reduce code in Ruby?

To debug your Map and Reduce code in Ruby, you can use the ‘p’ method to print the value of variables or expressions at different points in your code. You can also use the ‘debugger’ method to start a debugging session, which allows you to step through your code line by line and inspect the value of variables at each step.