Code Safari: Threading Ruby

During the week I had a problem: given a list of possible URLs, which ones are real? Seemed like a perfect job for a computer to solve. I reached into my Ruby toolbox, and found a couple of neat tricks along the way. In this week’s code safari we’ll be doing something slightly different—rather than investigating a popular gem, we will be discovering the Ruby standard library itself.

There are a few distinct parts to the URL problem. The first and most obvious is being able to tell whether a URL is active. I didn’t need to be 100% thorough, so I went with the simplest option—try to fetch something from the URL and see what happens. If no exception is raised, it is probably OK. Ruby has a few methods in the standard library for fetching content from URLs. Net/HTTP is the most common and flexible, but for simple cases I prefer open-uri, which sits on top of Net/HTTP and provides a really simple interface, especially for this problem:

require 'open-uri'

def link_valid?(url)
  open(url)
  true
rescue
  false
end

Note here that you can use a rescue block on a method, rather than having to open up a begin block. This is a neat trick that can tidy up your code by an indentation level. I have also used an open rescue (not specifying which exceptions to catch) which is usually bad form; you should be explicit about which errors you expect. In this case however, there are so many different possible exceptions that are poorly documented—such as SocketError, ArgumentError, and errors from the Errno family—and given we are only covering one single Ruby call, a blanket rescue will be sufficient for now.

Now that we have a method for checking link validity, let’s throw some data against it.

urls = %w(
  http://www.sitepoint.com
  http://xaviershay.com
  http://twitter.com
  not_a_link
  http://bogus.bogus
)
 
invalid_links = urls.reject {|x| link_valid?(x) }
 
puts invalid_links

As expected, this code prints out the last two fake URLs.

Progress Please

That would be the end of the story, except it took a fair while to check each link, and there were no progress updates along the way. For five URLs this might be fine, but my list was thousands long. How did I know the program hadn’t just hung? It would be easy to just throw a puts inside the reject block, but that’s not very flexible. Let’s move the code into a method that gives us some room to move and play around with this code.

def validate_urls(urls)
  urls.each do |url|
    yield url, link_valid?(url)
  end
end
&nbsp;
validate_urls(urls) do |url, valid|
  puts "#{url} was #{valid ? "valid" : "invalid"}"
end

We now get updates as each URL is checked.

More Speed

This is nice, but we haven’t solved the underlying problem. We have just made visible how slow this code is. The main problem is that a URL may take a while to open for any number of reasons, be it DNS resolution, the server running slow, or just a slow internet. Since we are processing the URLs sequentially, while we are waiting for one URL we can’t do any other work and our code just sits there idle. Since each URL check is independent of any of the other checks, this is a perfect case for parallelization, or checking more than one link at the same time. There are a few techniques for doing this kind of thing (I covered a Fiber-based approach a few weeks back), but the simplest in this case is to use Thread.

If you are really interested in Ruby concurrency or going to use it heavily, it is worth reading up on how your version of Ruby implements its threading model since there are some important caveats and differences between implementations. For this casual usage though it is sufficient to know simply that they allow us to do more than one thing at the same time.

Our first iteration looks like this:

def validate_urls(urls)
  number_per_thread = (urls.length / 3.0).round # 3 threads
&nbsp;
  urls.each_slice(number_per_thread).map {|urls_subset|
    Thread.new do
      urls_subset.each do |url|
        yield url, link_valid?(url)
      end
    end
  }.each {|thread| thread.join }
end

There are a few things going on here. First I’ll quickly explain the threading code. Thread.new takes a block that will be executed in the new thread. This block will now be executed in parrallel with the rest of your code. If we just left it at that we would have a problem, since the validate_urls method would return and our program would terminate, without ever giving the code inside the thread block a chance to execute! That is where the join method comes in handy—it suspends execution of the main program until the thread has completed running its code. Calling join on every thread we create (after we have created all of them) prevents validate_urls from returning until every thread has completed.

The other important method is each_slice. The Enumerable module provides a rich suite of methods that are available on Array and Hash, and it is well worth reading over the list regularly, as you will usually discover a method that makes your life easier. For any kind of algorithmic challenge there is usually a pre-built solution that you can use. In this case each_slice yields the block for every n elements, effectively chunking up the array for us (note there is also a chunk method on Enumerable which is sometimes useful). We just need to do a small amount of math to figure out how big each slice should be, which will be dependant on the number of URLs and the number of desired threads (here three, but play around with this number to find the best value for your data and machine).

This code is an improvement, but it still has some downfalls. It does not maximally utilize the available threads—it is quite possible for all the “slow” URLs to end up in the first thread’s chunk, while another thread quickly speeds through a list of clearly invalid ones just to sit idly waiting while the first thread does all the hard work.

Pooling the Threads

We need a better way of scheduling work for the threads. Rather than divide up all the work at the start, before we know how long each task is going to take, we can have threads continually take URLs from a common pool as they become available to do more work until the pool is empty.

A naive implementation looks like this:

def validate_urls(urls)
  urls = urls.dup
  &nbsp;
  3.times.map {
    Thread.new do
      while !urls.empty?
        url = urls.pop
        yield url, link_valid?(url)
      end
    end
  }.each {|thread| thread.join }
end

First we create a copy of the passed array, since it is not polite to change the arguments passed to your method. Then we spin a up three threads that each continually takes URLs from the list until it is empty. But all is not roses in the threaded kingdom! We are entering the quagmire of concurrency, where foul stenches belch from the mud and things are not as they first may seem. This code conceptually looks fine, and indeed most of the time it will run fine too, but where ever threads share access to data we must ensure that that access to that data is what is called thread-safe. Allowing two threads access to memory willy-nilly can have disastrous and unpredictable consequences far beyond anything you or I could predict.

You should never assume that classes are thread-safe unless they are explicitly documented as such. (Here is a great but nerdy discussion on the thread-safety of Ruby core classes.) Better safe than sorry, my Mum always used to say. We have two options to fix our code: either we use a thread-safe data structure, or we synchronize access to our array (meaning we only allow one thread to access it at a time). The former option should be preferred, since it is easier and less prone to error. Searching for “Ruby thread-safe array” led me to the Queue class in the standard library, which provides exactly the behaviour we are after in a thread-safe manner—the example in the documentation is practically the same as our use case.

require 'thread'

def validate_urls(urls)
  queue     = Queue.new
  urls.each {|url| queue << url }
&nbsp;
  3.times.map {
    Thread.new do
      while !queue.empty?
        url = queue.pop
        yield url, link_valid?(url)
      end
    end
  }.each {|thread| thread.join }
end

Bam, thread-safe. As one last niceity, let’s make sure only one yield happens at any one time so that the user of this method does not need to bother themselves with any thoughts of concurrency. As far as they are concerned, they will just have a very fast validate_urls method. For this we will use the second of our two options above: synchronization. This uses another built-in Ruby class Mutex that allows us to ensure that only one thread is executing a block of code at any one time. I’ll include it here in a full listing of the entire program:

require 'open-uri'
require 'thread'
&nbsp;
def link_valid?(url)
  open(url)
  true
rescue
  false
end
&nbsp;
def validate_urls(urls)
  queue     = Queue.new
  semaphore = Mutex.new
  urls.each {|url| queue << url }
&nbsp;
  3.times.map {
    Thread.new do
      while !queue.empty?
        url = queue.pop
        valid = link_valid?(url)
        semaphore.synchronize {
          yield url, valid
        }
      end
    end
  }.each {|thread| thread.join }
end
&nbsp;
urls = %w(
  http://www.sitepoint.com
  http://xaviershay.com
  http://twitter.com
  not_a_link
  http://bogus.bogus
)
&nbsp;
validate_urls(urls) do |url, valid|
  puts "#{url} was #{valid ? "valid" : "invalid"}"
end

The synchronize method is thread-safe, and will block if another thread is currently inside that code block. Everything combines to give us a fast and robust link checker.

Wrapping Up

We took a simple bit of code and made it quite complicated. How far you tune your own code will depend on your specific requirements, but knowing what the ends of the spectrum look like allows you to make the best choice for your circumstances.

Here are some questions for further study:

What would this code look like using an evented concurrency model? (Such as EventMachine)
How would you extend this technique to write a spider program? (One that discovers links on a page and follows them)

Let us know how you go in the comments, and tune in next week for more exciting adventures in the code jungle.