Scaling Big Time with Hadoop

Key Takeaways

Hadoop is an open-source framework that utilizes Map/Reduce to solve massive scaling issues by distributing computations across clusters of machines.
Hadoop’s architecture includes the Hadoop Distributed File System (HDFS), which supports the storage of large data sets across a network of machines to facilitate rapid data transfer rates between compute nodes.
Hive, a distributed SQL database that sits on top of Hadoop, simplifies data querying and can handle queries on databases with billions of records by translating SQL queries into MapReduce jobs.
Hadoop can run on a single machine in a pseudo-distributed mode for development and testing, which simulates a distributed environment.
Hadoop is highly extensible, supporting various programming languages and capable of integrating with different software, making it a robust solution for handling and analyzing big data in various organizational contexts.

“How is this going to scale?”

If you’ve worked in web application development, you’ve heard that question more than once. You have a great feature in hand that works fine for a couple of thousand customers, but how does it perform for millions?

Scaling problems are not simple, and there are no “one size fits all” solutions, but we can look at some of the ways that the web behemoths have addressed their scaling issues and glean a few ideas of how to solve them. We might even learn of a handy tool to add to our tool kit along the way.

Most tough scaling issues in web applications can have an element of processing done in the background. For example, if you run a dating site, finding new, compatible matches for each of your customers can be processed in the background. There’s no need to try and immediately find the right ten potential partners for the customer while they’re waiting for a page load; instead, you can serve it up to them the next time they hit the home page having already done all the required processing.

The advantage of being able to background process means that you can bring more machine power to bear on the problem. You can take a predicament, like finding a person’s soul mate, and distribute it among a cluster of machines that chew on the pieces and then return the results you need.

Doing massive background processing is exactly how the big guns, such as Facebook and Google, have cracked their scalability problems. Google’s results come from massive prebuilt indexes generated with background jobs; it certainly doesn’t search the entire Web at the point when you hit the search button.

One design pattern that both Google and Facebook share is the ability to distribute computations among large clusters of machines that all share a common data source. The pattern is called Map/Reduce, and Hadoop is an open source implementation of this. This article is an introduction to Hadoop. Even if you don’t currently have a massive scaling issue, it can be worthwhile to become familiar with Map/Reduce as a concept, and playing with Hadoop is a good way to do that.

Requirements

Hadoop runs in a Unix-type shell. If you’re running Linux or Mac OS X, you’re all set, but if you’re on Windows, the easiest way to play along is to run Linux in a virtual machine. For instruction on setting this up using the free VirtualBox virtualization software, have a look at this SitePoint tutorial. It’s also possible to run Hadoop under Windows using the Cygwin shell emulator.

You’ll need to have SSH installed. On OS X, SSH is installed by default, while on Linux, installing it is as simple as grabbing it from your package manager. For example, on Ubuntu:

$ sudo apt-get install ssh

Next, you’ll need to be able to SSH into your localhost without a passphrase. On OS X, you’ll first have to enable remote access. Go to System Preferences > Sharing and check the box next to Remote Login. To test SSH, try running this command:

$ ssh localhost

If you’re prompted for a passphrase, you’ll need to run these commands to generate a key for passphraseless SSH:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Finally, you’ll need to have Java installed. Often this is already the case; otherwise head over to Java.com and grab it.

Introducing Map/Reduce

What exactly is Map/Reduce? The central idea behind Map/Reduce is distributed processing. You have a cluster of machines that host a shared file system where the data is stored, and allow for distributed job management.

Processes run across the cluster are called jobs. These jobs have two phases: a map phase where the data is collected, and a reduce phase where the data is further processed to create a final result set.

Let’s take the example of finding a soul mate on a dating site. All the profiles for millions of users are stored in the shared file system, which is spread across a cluster of machines.

We submit a job to the cluster with the profile pattern that we’re seeking. In the map phase, the cluster finds all the profiles that match our requirements. In the reduce phase these profiles are sorted to find the top matches, among which only the top ten are returned.

Hadoop is a Map/Reduce framework that’s broken into two large pieces. The first is the Hadoop File System (HDFS). This is a distributed file system written in Java that works much like a standard Unix file system. On top of that is the Hadoop job execution system. This system coordinates the jobs, prioritizes and monitors them, and provides a framework that you can use to develop your own types of jobs. It even has a handy web page where you can monitor the progress of your jobs.

From a high level the Hadoop cluster system looks like Figure 1, “The Hadoop cluster architecture”.

Figure 1. The Hadoop cluster architecture

Clients connect to the cluster and submit jobs. Those jobs then go to MapReduce agents, which process the jobs on the data in the local HDFS portion of the file system. All of the HDFS nodes talk to the NameNode to register themselves within the cluster.

Now, if you’re looking at all this and saying, “That’s all well and good, but it’s Java and we use .NET,” there’s no need to worry. Hadoop is a platform, and you can have clients to this platform in whatever language you want. And because Hadoop is so good at being a distributed processing platform, it’s garnering support across all languages.

In addition to the Hadoop core, I’m going to introduce the Hive system. Hive is a distributed SQL database that sits on top of Hadoop and HDFS. There are three reasons why it’s worth knowing about Hive. First, because it makes it much easier to use Hadoop, to introduce it as a technology that you can use in your projects today. Second, because it uses SQL, a technology that most of use are familiar with. Third, because it’s in production use by Facebook, which means that it’s robust, stable, and well-maintained.

With Hadoop and Hive you’ll be able to host databases that hold billions of records and run queries on them (which are implemented as Hadoop jobs) in a reasonable amount of time. That time will be adjustable by changing the size and performance characteristics of the machines in the cluster.

Just a word now to put you at ease about the whole cluster issue. Hadoop allows you to maintain and deploy a cluster of machines, but it’s unnecessary to have a cluster of machines just to play with it. In fact, any old single machine will do. You can run HDFS, Hadoop, and Hive all locally, and it works just fine.

Let’s now dig into each of the technologies in more depth.

Installing and Configuring Hadoop

Go to the Hadoop Releases page and download the stable Hadoop release. The Hadoop release has both the job execution framework and the HDFS system built into it. Unzip the archive into a directory in your file system, and cd into it. For the purposes of this tutorial, we’ll be running Hadoop in what’s called “pseudo-distributed mode.” That is, Hadoop will simulate a distributed cluster by running each node in a separate Java process. To configure Hadoop for this mode, we’ll need to edit three configuration files, all of which are located in the conf directory.

Just replace the configuration element in each file with the following information:

Example 1. conf/core-site.xml

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>

Example 2. conf/hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>

Example 3. mapred-site.xml

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>

The other configuration setting that’s required is in the conf/hadoop-env.sh file. Open that file and set JAVA_HOME to point to the root of your Java installation. For example, on Mac OS X:

Example 4. conf/hadoop-env.sh (excerpt)

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home

With all this in place, you should be able to run the following command from the Hadoop directory:

$ bin/hadoop versionHadoop 0.20.2Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010

The Hadoop File System (HDFS)

HDFS is a valuable resource in its own right. With HDFS you can take a set of commodity machines and build out a single federated file system. If more disk space is required, you can either add more machines or add more disks to each machine, or both.

You can access your HDFS through several methods. There’s an API that your programs can use to connect to the file system and retrieve, add, remove, or update files and directories. A command line interface can also be used, as well as a web interface to browse the directories and access the files. There’s even a Fuse implementation (MountableHDFS) where you can mount files right on your desktop.

There are two primary elements to the HDFS installation. One is the NameNode server. This is a single server process that’s located on one machine. Each of the HDFS machines connects to the NameNode server to coordinate its identity and manage its piece of the file system.

The other element is the HDFS node running on the each machine. It handles the storing of data on that particular machine, and connecting to the NameNode.

To get started, let’s first format the NameNode:

$ bin/hadoop namenode -format

Now we’ll start all the Hadoop daemons (this includes the HDFS name node and data node, as well as the Hadoop job and task trackers):

$ bin/start-all.sh

With everything up and running, you should now be able to execute file system commands using the hadoop executable. For example:

$ bin/hadoop fs -ls hdfs:/Found 1 items drwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp

The -fs (meaning file system) parameter has many commands including the usual ls, mv, rm, tail, head, and so forth. There are also commands to move files into the HDFS file system from the local file system, moveToLocal, and vice versa, moveFromLocal.

We can add a file to the file system like this:

$ bin/hadoop fs -put test.xml hdfs:/$ bin/hadoop fs -ls hdfs:/Found 2 items-rw-r--r-- 1 jherr supergroup 17 2010-06-25 15:13 /test.xmldrwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp

You can also view this file system in your browser. Go to http://localhost:50070, which is your NameNode’s dashboard. Click on the Browse the filesystem link and you’ll see your file system, as shown in Figure 2, “The HDFS in the browser”.

Figure 2. The HDFS in the browser

The HDFS in the browser

As you can see, a distributed file system based in Java that runs across Unix, Windows, and Mac, as well as working on commodity hardware, is a valuable resource all on its own. Having said that, it’s worth noting that HDFS offers no panacea. It’s fairly inefficient time-wise for random access, and when storing lots of small files. HDFS is optimized for its central purpose, which is to provide a shared data store for Hadoop jobs. If you’re looking for an extensible file system for images, HTML files, or similar, you might look at NFS, or using a hosted system like Amazon’s S3.

Configuring and Running Hadoop

Now that the underlying HDFS is configured and running, it’s time to do the same for the JobTracker and MapReduce portions of Hadoop.

If the JobTracker is running, you should be able to navigate to port 50030 on the machine that’s running the tracker. The dashboard is shown in Figure 3, “The Hadoop JobTracker”.

Figure 3. The Hadoop JobTracker

The Hadoop JobTracker

With the JobTracker running and the HDFS set up underneath, you’re ready to install Hive and do some real work with distributed SQL.

Setting Up Hive

Hive sits on top of Hadoop, so once your cluster is set up, it’s a snap for Hive to be up and running. The process starts with downloading and installing Hive. You’ll need to check out Hive from the Subversion repository and compile it with ant, as described in that document.

Once it’s installed and the Hive environment variables are set up, you can run the Hive command line client like so:

$ ./bin/hive Hive history file=/tmp/jherr/hive_job_log_jherr_201006251643_880032913.txt hive> show tables; OK Time taken: 5.508 seconds hive>

The show tables command shows that there are no tables currently in the distributed database. From here you can follow the instructions on the Getting Started page to add a large movie database to the Hive installation.

This database provides a good starting point to experiment with queries, and to view how Hive creates Hadoop jobs on the fly to satisfy the query. Take, for example, a very simple query on the movie database, as shown below:

hive> SELECT movieid, AVG( rating ) FROM u_data GROUP BY movieid; Total MapReduce jobs = 1 ... Starting Job = job_201006251641_0002, Tracking URL = http://localhost: 50030/jobdetails.jsp?jobid=job_201006251641_0002 Kill Command = /Users/jherr/hadoop/bin/../bin/hadoop job - Dmapred.job.tracker=localhost:8021 -kill job_201006251641_0002 2010-06-25 04:51:40,648 map = 0%, reduce =0% ... 2010-06-25 04:52:04,895 map = 100%, reduce =100% Ended Job = job_201006251641_0002 OK 1 3.8783185840707963 2 3.2061068702290076 ... 1682 3.0 Time taken: 29.449 seconds hive>

I’ve removed some of the details but you can see the general flow. The Hive client creates a job to satisfy the query, which you can monitor from the JobTracker web application. The Hive client then monitors the job and produces the result set.

You can see an example of the job page when it’s completed in Figure 4, “The completed job”.

Figure 4. The completed job

The completed job

There’s even a cool graphic section at the bottom of the page that shows the progress of the Hadoop cluster nodes as they map all the data, then reduce it down to the expected query results. This is shown in Figure 5, “The job status graph”.

Figure 5. The job status graph

The job status graph

In practice you can use the command line client to connect to Hive or one of the drivers, such as HiveODBC. In addition you can provide your own code to Hive to add custom query functions and filtering that’s then distributed into the cluster. The very handy Hive Getting Started guide shows an example of this in Python.

Where to Go Next

It’s important to note that there are several ways to make use of Hadoop; Hive is just one tool that utilizes Hadoop’s resources. There are lots of projects that use Hadoop in innovative ways, such as:

Pig—a powerful data flow language that’s based on the Hadoop framework
HBase—similar to Hive but with more of an object persistence bent
Hypertable—a high-performance distributed object storage system
ZooKeeper—an easy-to-use, distributed process management system based on Hadoop

Of course, you can also develop your own project to leverage the distributed power of Hadoop and HDFS.

Hadoop is a solution for your scaling problems today, as well as a framework for developing your own highly scalable solutions in the future. It’s a well-written and documented solution that’s open source and very popular. There are even a number of books out on Hadoop that can give you an in-depth look into this technology.

Ultimately, Hadoop is an emerging technology you should become familiar with, as it’s most likely to come in handy for your future projects.

Frequently Asked Questions about Scaling with Hadoop

What is the concept of scaling out in Hadoop?

Scaling out in Hadoop refers to the process of adding more nodes to the system to handle increased data. This is a horizontal scaling approach where the workload is distributed across multiple servers, thus improving the system’s performance and capacity. Hadoop’s distributed file system allows it to scale out across many low-cost machines, making it a cost-effective solution for handling big data.

How does HDFS help Namenode in scaling?

Hadoop Distributed File System (HDFS) plays a crucial role in scaling. It stores data across multiple nodes, thus reducing the load on the Namenode. The Namenode only stores metadata, while the actual data is stored in Datanodes. This architecture allows Hadoop to scale horizontally by adding more Datanodes as the data grows.

How does Hadoop compare in terms of scalability and fault tolerance?

Hadoop is highly scalable as it can store and distribute large data sets across hundreds of inexpensive servers operating in parallel. Moreover, it provides fault tolerance. When data is sent to a node, Hadoop creates replicas of the data and stores them in other nodes. This means that in the event of a failure, there is another copy available, making Hadoop a reliable system.

What is the difference between scale-out and scale-up?

Scale-up refers to adding more resources such as CPU or memory to a single node, while scale-out involves adding more nodes to a system. Hadoop uses a scale-out architecture, which allows it to handle more data by simply adding more nodes. This makes it a cost-effective solution for processing large volumes of data.

How does Hadoop handle scaling in the cloud?

Hadoop can effectively handle scaling in the cloud. Cloud platforms provide the flexibility to add or remove resources as per the requirement, which aligns well with Hadoop’s scale-out architecture. Hadoop clusters in the cloud can be resized based on the workload, providing an efficient way to handle big data.

What is the role of YARN in Hadoop’s scalability?

Yet Another Resource Negotiator (YARN) is a key component of Hadoop that manages resources and schedules tasks. YARN allows multiple data processing engines to handle data stored in a single platform, which enhances the scalability of Hadoop.

How does data locality in Hadoop contribute to its scalability?

Data locality in Hadoop refers to the ability to move the computation to the data rather than the other way around. This reduces the network congestion and increases the overall throughput of the system, thereby enhancing its scalability.

How does Hadoop’s distributed file system contribute to its scalability?

Hadoop’s Distributed File System (HDFS) stores data across multiple nodes in a cluster. This allows Hadoop to distribute the data and computation across many servers, thereby enhancing its scalability.

How does Hadoop ensure data integrity while scaling?

Hadoop ensures data integrity through its replication mechanism. When data is stored, Hadoop creates multiple copies and stores them in different nodes. This ensures that even if a node fails, the data is not lost.

How does Hadoop’s MapReduce programming model aid in scaling?

MapReduce is a programming model used by Hadoop for processing large data sets. It divides the task into small parts, which can be processed in parallel. This parallel processing capability aids in scaling as it allows Hadoop to process large volumes of data efficiently.