Giant Killing with Beanstalkd
If you have ever dabbled in Service Orientated Architecture (SOA) or even read some interesting articles about it, you probably have come across the term “Message Queue”.
The really terse explanation of a Message Queue, or MQ, is it allows services within your architecture to adopt a “fire and forget” approach to interacting with other services. By placing a queue in the system, non time sensitive operations may be carried out at the leisure of services that care about them, disregarding technology or programming language.
As an example, let’s take a “send to a friend” feature within a Job Board application. Once the user has completed the form and clicked “Send”, do we really want the nitty-gritty of sending an email to a friend to live in our Job Board application?
Background Jobs
A common approach to this problem is to use background workers like Resque or Sidekiq. For the problem at hand, these are fine and somewhat more suitable. The only problem I have with that is:
- The logic of sending email lives in our application that does not necessarily care about email.
- I will probably duplicate the process of communicating to my SMTP server through a few applications within the architecture.
- Background workers know a little too much about their origin, i.e. what models they came from, what they can access (my whole app stack).
If your architecture is growing, it may be worth considering moving some background workers to a MQ. For me, MQs just work. You drop in some data and a daemon or application that cares about that message picks it up some time later and acts on it. Meanwhile, the originator has carried on focusing on its core business.
As the architecture grows and you add more services, some of these service may need to send email as well. At that point, you have established a clear, trusted method of sending email. You simply drop some data in the email queue and it will get sent.
Beanstalkd
Hopefully now you are getting the gist of why MQs are awesome. There are a few open source MQs available, most notable are RabbitMQ (there is a nice article on RubySource with details) and my personal favourite and what we will be using today Beanstalkd.
Getting started with Beanstalkd really couldn’t be simpler. On OSX, you want to use homebrew (brew install beanstalkd
) or for debian linux flavour you can use sudo apt-get install beanstalkd
. It seems pretty well supported by most package mangers across platforms. You can see the details on the Beanstalkd download docs.
Once installed, you can open the terminal and execute beanstalkd
. This will startup a Beanstalkd instance using its default port 11300
on localhost in the foreground. Not always ideal to run it in the foreground, so my typical command looks something like:
beanstalkd -b ~/beanstore &
This simply persists the queue data in a binstore under the directory ~/beanstore
instead of just memory and runs the process in the backgound (the ampersand). For development, these settings are fine. When it comes to production, I would suggest you have a read of the docs pertaining to the admin tool that ships with Beanstalkd.
Beanstalkd Lingo
Beanstalkd has some nice vocabulary for describing the main players and operations. Let’s walk through them.
Tubes
A tube is a namespace for your messages. A Beantstalkd instance can have multiple tubes. On a vanilla boot, Beanstalkd will have a single tube named default
.
The idea is you wish a certain process to listen to messages coming in on a specific tube. As mentioned, tubes just act as namespaces for the consumers of the queue.
Jobs
The Jobs are what we are placing in a tube. It’s common for me to place JSON in a tube and marshall that at the other end.
Beanstalkd doesn’t really care about the content of the job, so things like YAML, plain text or Thrift would be just fine.
In a normal, happy path operation, jobs have 2 states:
- Ready – Waiting to be processed.
- Reserved – Being processed
If all goes well, the job is deleted. If there is a problem with the job, say our SMTP server is down, the job is put in a state of “Buried”. It will remain “Buried” until the tube is “kicked”. This will simply place the job back into the “Ready” state. So, with the SMTP back up, we kick the tube and the world keeps spinning.
One other state we haven’t covered is “Delayed”. This simply means the job does not enter the state of “Ready” until some pre-determined interval has elapsed. I personally have not used this state much, so won’t cover it any more than mentioning that it exists.
OM NOM NOM
Now we have Beanstalkd running on our development boxes, we want to get some jobs in the queue. To achieve that, my usual weapon of choice is the Beaneater gem. Getting a job into a tube is as simple as:
require 'beaneater'
require 'json'
beanstalk = Beaneater::Pool.new(['localhost:11300'])
tube = beanstalkd.tubes['my-tube']
job = {some: 'key', value: 'object'}.to_json
tube.put job
And that is it. Now we get to the interesting bit, consuming the tube and all the jobs who live there.
I am a big fan of a daemon process handling that. If the tubes start getting too full, we can spin up more daemons to help clear the backlog of jobs. Of course, we can also kill them off as required.
So far I have used the Dante gem for wrapping scripts into daemons. It seemed a bit lighter than Daemon Kit and I like to keep my daemons from getting bloated. The benefit of using Dante over something like ruby script/my_mailer_script.rb
for me is nothing more than Dante gives you Process ID (PID) file generation out the box. With that, I can keep the daemons in check with monit.
Beaneater provides a really nice API for consuming jobs in 2 ways. The first is manually stepping through the process of reserving a job, working on it, then deleting if it completes correctly or burying if an exception is raised. It looks something like this:
beanstalkd.tubes.watch!('my-tube')
loop do
job = beanstalk.tubes.reserve
begin
# ... process the job
job.delete
rescue Exception => e
job.bury
end
end
A couple of things here worth mentioning. Yes, I’m using an infinite loop and the reserve
method on the tube will actually sit and wait for a job to be “Ready”, reserve it, and continue.
Beaneater provides a better interface for long running tasks and the above can simply be condensed into:
beanstalkd.jobs.register('my-tube') do |job|
# ... process the job
end
beanstalkd.jobs.process!
This method wraps the behaviour (albeit in a much better way) of the previous example, reserving, processing, then deleting or burying based on the outcome.
No Magic Beans
The beauty of Beanstalkd is its absolute simplicity. There is really not much more I would be willing to dive into as an introduction. In terms of getting things running quickly, it is no more complicated than any of the background worker solutions discussed earlier.
It does make sense to be pragmatic in your adoption of MQs, to be honest. Resque, Sidekiq etc. all have their place and work very well, but Beanstalkd addresses a few more problems, namely, interfacing between services which may or may not be written in Ruby (.NET clients for Beanstalkd are available).
In fact, the entire thing is completely language agnostic. The neckbeard way of communicating with beanstalkd is via it’s own protocol over TCP. The Beaneater gem, as you will probably know, abstracts all that protocal stuff into a well packaged API for us. It is safe to say I’ll be leaning on Beaneater gem when using Beanstalkd for some time to come.
If I had any advice on designing/composing tube consumers, stick to the Single Responsibility Principle (SRP) as much as possible. There will come a time when you will have to kick a buried job. If that job writes to a database AND sends an email, what happens when the sending of the email blows up? Replaying said message will result in a duplicate database entry. By splitting the processing of the job into the smallest responsibilities that are reasonable, the less you have to worry about performing duplicate actions.
I really urge you too look to Beanstalkd as your application architecture grows. In personal experience, I have found it simple to get running, straightforward to manage and maintain, and the ruby client via Beaneater is one of the better interfaces I have used.