“How is this going to scale?”
If you’ve worked in web application development, you’ve heard that question more than once. You have a great feature in hand that works fine for a couple of thousand customers, but how does it perform for millions?
Scaling problems are not simple, and there are no “one size fits all” solutions, but we can look at some of the ways that the web behemoths have addressed their scaling issues and glean a few ideas of how to solve them. We might even learn of a handy tool to add to our tool kit along the way.
Most tough scaling issues in web applications can have an element of processing done in the background. For example, if you run a dating site, finding new, compatible matches for each of your customers can be processed in the background. There’s no need to try and immediately find the right ten potential partners for the customer while they’re waiting for a page load; instead, you can serve it up to them the next time they hit the home page having already done all the required processing.
The advantage of being able to background process means that you can bring more machine power to bear on the problem. You can take a predicament, like finding a person’s soul mate, and distribute it among a cluster of machines that chew on the pieces and then return the results you need.
Doing massive background processing is exactly how the big guns, such as Facebook and Google, have cracked their scalability problems. Google’s results come from massive prebuilt indexes generated with background jobs; it certainly doesn’t search the entire Web at the point when you hit the search button.
One design pattern that both Google and Facebook share is the ability to distribute computations among large clusters of machines that all share a common data source. The pattern is called Map/Reduce, and Hadoop is an open source implementation of this. This article is an introduction to Hadoop. Even if you don’t currently have a massive scaling issue, it can be worthwhile to become familiar with Map/Reduce as a concept, and playing with Hadoop is a good way to do that.
Hadoop runs in a Unix-type shell. If you’re running Linux or Mac OS X, you’re all set, but if you’re on Windows, the easiest way to play along is to run Linux in a virtual machine. For instruction on setting this up using the free VirtualBox virtualization software, have a look at this SitePoint tutorial. It’s also possible to run Hadoop under Windows using the Cygwin shell emulator.
You’ll need to have SSH installed. On OS X, SSH is installed by default, while on Linux, installing it is as simple as grabbing it from your package manager. For example, on Ubuntu:
$ sudo apt-get install ssh
Next, you’ll need to be able to SSH into your localhost without a passphrase. On OS X, you’ll first have to enable remote access. Go to System Preferences > Sharing and check the box next to Remote Login. To test SSH, try running this command:
$ ssh localhost
If you’re prompted for a passphrase, you’ll need to run these commands to generate a key for passphraseless SSH:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Finally, you’ll need to have Java installed. Often this is already the case; otherwise head over to Java.com and grab it.
What exactly is Map/Reduce? The central idea behind Map/Reduce is distributed processing. You have a cluster of machines that host a shared file system where the data is stored, and allow for distributed job management.
Processes run across the cluster are called jobs. These jobs have two phases: a map phase where the data is collected, and a reduce phase where the data is further processed to create a final result set.
Let’s take the example of finding a soul mate on a dating site. All the profiles for millions of users are stored in the shared file system, which is spread across a cluster of machines.
We submit a job to the cluster with the profile pattern that we’re seeking. In the map phase, the cluster finds all the profiles that match our requirements. In the reduce phase these profiles are sorted to find the top matches, among which only the top ten are returned.
Hadoop is a Map/Reduce framework that’s broken into two large pieces. The first is the Hadoop File System (HDFS). This is a distributed file system written in Java that works much like a standard Unix file system. On top of that is the Hadoop job execution system. This system coordinates the jobs, prioritizes and monitors them, and provides a framework that you can use to develop your own types of jobs. It even has a handy web page where you can monitor the progress of your jobs.
From a high level the Hadoop cluster system looks like Figure 1, “The Hadoop cluster architecture”.
Clients connect to the cluster and submit jobs. Those jobs then go to MapReduce agents, which process the jobs on the data in the local HDFS portion of the file system. All of the HDFS nodes talk to the NameNode to register themselves within the cluster.
Now, if you’re looking at all this and saying, “That’s all well and good, but it’s Java and we use .NET,” there’s no need to worry. Hadoop is a platform, and you can have clients to this platform in whatever language you want. And because Hadoop is so good at being a distributed processing platform, it’s garnering support across all languages.
In addition to the Hadoop core, I’m going to introduce the Hive system. Hive is a distributed SQL database that sits on top of Hadoop and HDFS. There are three reasons why it’s worth knowing about Hive. First, because it makes it much easier to use Hadoop, to introduce it as a technology that you can use in your projects today. Second, because it uses SQL, a technology that most of use are familiar with. Third, because it’s in production use by Facebook, which means that it’s robust, stable, and well-maintained.
With Hadoop and Hive you’ll be able to host databases that hold billions of records and run queries on them (which are implemented as Hadoop jobs) in a reasonable amount of time. That time will be adjustable by changing the size and performance characteristics of the machines in the cluster.
Just a word now to put you at ease about the whole cluster issue. Hadoop allows you to maintain and deploy a cluster of machines, but it’s unnecessary to have a cluster of machines just to play with it. In fact, any old single machine will do. You can run HDFS, Hadoop, and Hive all locally, and it works just fine.
Let’s now dig into each of the technologies in more depth.
Installing and Configuring Hadoop
Go to the Hadoop Releases page and download the stable Hadoop release. The Hadoop release has both the job execution framework and the HDFS system built into it. Unzip the archive into a directory in your file system, and cd into it. For the purposes of this tutorial, we’ll be running Hadoop in what’s called “pseudo-distributed mode.” That is, Hadoop will simulate a distributed cluster by running each node in a separate Java process. To configure Hadoop for this mode, we’ll need to edit three configuration files, all of which are located in the
Just replace the configuration element in each file with the following information:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>
The other configuration setting that’s required is in the
conf/hadoop-env.sh file. Open that file and set JAVA_HOME to point to the root of your Java installation. For example, on Mac OS X:
With all this in place, you should be able to run the following command from the Hadoop directory:
$ bin/hadoop version
Hadoop 0.20.2Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010
The Hadoop File System (HDFS)
HDFS is a valuable resource in its own right. With HDFS you can take a set of commodity machines and build out a single federated file system. If more disk space is required, you can either add more machines or add more disks to each machine, or both.
You can access your HDFS through several methods. There’s an API that your programs can use to connect to the file system and retrieve, add, remove, or update files and directories. A command line interface can also be used, as well as a web interface to browse the directories and access the files. There’s even a Fuse implementation (MountableHDFS) where you can mount files right on your desktop.
There are two primary elements to the HDFS installation. One is the NameNode server. This is a single server process that’s located on one machine. Each of the HDFS machines connects to the NameNode server to coordinate its identity and manage its piece of the file system.
The other element is the HDFS node running on the each machine. It handles the storing of data on that particular machine, and connecting to the NameNode.
To get started, let’s first format the NameNode:
$ bin/hadoop namenode -format
Now we’ll start all the Hadoop daemons (this includes the HDFS name node and data node, as well as the Hadoop job and task trackers):
With everything up and running, you should now be able to execute file system commands using the hadoop executable. For example:
$ bin/hadoop fs -ls hdfs:/
Found 1 items drwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp
-fs (meaning file system) parameter has many commands including the usual ls, mv, rm, tail, head, and so forth. There are also commands to move files into the HDFS file system from the local file system, moveToLocal, and vice versa, moveFromLocal.
We can add a file to the file system like this:
$ bin/hadoop fs -put test.xml hdfs:/$ bin/hadoop fs -ls hdfs:/Found 2 items-rw-r--r-- 1 jherr supergroup 17 2010-06-25 15:13 /test.xmldrwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp
You can also view this file system in your browser. Go to
http://localhost:50070, which is your NameNode’s dashboard. Click on the Browse the filesystem link and you’ll see your file system, as shown in Figure 2, “The HDFS in the browser”.
Figure 2. The HDFS in the browser
As you can see, a distributed file system based in Java that runs across Unix, Windows, and Mac, as well as working on commodity hardware, is a valuable resource all on its own. Having said that, it’s worth noting that HDFS offers no panacea. It’s fairly inefficient time-wise for random access, and when storing lots of small files. HDFS is optimized for its central purpose, which is to provide a shared data store for Hadoop jobs. If you’re looking for an extensible file system for images, HTML files, or similar, you might look at NFS, or using a hosted system like Amazon’s S3.
Configuring and Running Hadoop
Now that the underlying HDFS is configured and running, it’s time to do the same for the JobTracker and MapReduce portions of Hadoop.
If the JobTracker is running, you should be able to navigate to port 50030 on the machine that’s running the tracker. The dashboard is shown in Figure 3, “The Hadoop JobTracker”.
Figure 3. The Hadoop JobTracker
With the JobTracker running and the HDFS set up underneath, you’re ready to install Hive and do some real work with distributed SQL.
Setting Up Hive
Hive sits on top of Hadoop, so once your cluster is set up, it’s a snap for Hive to be up and running. The process starts with downloading and installing Hive. You’ll need to check out Hive from the Subversion repository and compile it with ant, as described in that document.
Once it’s installed and the Hive environment variables are set up, you can run the Hive command line client like so:
Hive history file=/tmp/jherr/hive_job_log_jherr_201006251643_880032913.txt
hive> show tables;
OK Time taken: 5.508 seconds
The show tables command shows that there are no tables currently in the distributed database. From here you can follow the instructions on the Getting Started page to add a large movie database to the Hive installation.
This database provides a good starting point to experiment with queries, and to view how Hive creates Hadoop jobs on the fly to satisfy the query. Take, for example, a very simple query on the movie database, as shown below:
hive> SELECT movieid, AVG( rating ) FROM u_data GROUP BY movieid;
Total MapReduce jobs = 1 ... Starting Job = job_201006251641_0002, Tracking URL = http://localhost: 50030/jobdetails.jsp?jobid=job_201006251641_0002 Kill Command = /Users/jherr/hadoop/bin/../bin/hadoop job - Dmapred.job.tracker=localhost:8021 -kill job_201006251641_0002 2010-06-25 04:51:40,648 map = 0%, reduce =0% ... 2010-06-25 04:52:04,895 map = 100%, reduce =100% Ended Job = job_201006251641_0002 OK 1 3.8783185840707963 2 3.2061068702290076 ... 1682 3.0 Time taken: 29.449 seconds
I’ve removed some of the details but you can see the general flow. The Hive client creates a job to satisfy the query, which you can monitor from the JobTracker web application. The Hive client then monitors the job and produces the result set.
You can see an example of the job page when it’s completed in Figure 4, “The completed job”.
Figure 4. The completed job
There’s even a cool graphic section at the bottom of the page that shows the progress of the Hadoop cluster nodes as they map all the data, then reduce it down to the expected query results. This is shown in Figure 5, “The job status graph”.
Figure 5. The job status graph
In practice you can use the command line client to connect to Hive or one of the drivers, such as HiveODBC. In addition you can provide your own code to Hive to add custom query functions and filtering that’s then distributed into the cluster. The very handy Hive Getting Started guide shows an example of this in Python.
Where to Go Next
It’s important to note that there are several ways to make use of Hadoop; Hive is just one tool that utilizes Hadoop’s resources. There are lots of projects that use Hadoop in innovative ways, such as:
Pig—a powerful data flow language that’s based on the Hadoop framework
HBase—similar to Hive but with more of an object persistence bent
Hypertable—a high-performance distributed object storage system
ZooKeeper—an easy-to-use, distributed process management system based on Hadoop
Of course, you can also develop your own project to leverage the distributed power of Hadoop and HDFS.
Hadoop is a solution for your scaling problems today, as well as a framework for developing your own highly scalable solutions in the future. It’s a well-written and documented solution that’s open source and very popular. There are even a number of books out on Hadoop that can give you an in-depth look into this technology.
Ultimately, Hadoop is an emerging technology you should become familiar with, as it’s most likely to come in handy for your future projects.