Code Safari: Spelunking Selenium in Search of Sockets
At The Conversation, we recently got our hands on a beautiful new eight core machine to sit in the corner and act as our build server. It’s a top of the line machine: fast processors, SSD—no cruft. We called it “Tank”. By parallelizing our Selenium test suite we took our build time from twenty minutes down to four. It is a sight to behold, eight Firefox windows firing up at the same time and clicking through your app. This five fold improvement seemed too good to be true, and indeed very quickly the devil revealed himself:
Errno::EBADF:
Bad file descriptor
# .../lib/selenium/webdriver/firefox/socket_lock.rb:57:in `initialize'
# .../lib/selenium/webdriver/firefox/socket_lock.rb:57:in `new'
# .../lib/selenium/webdriver/firefox/socket_lock.rb:57:in `can_lock?'
# .../lib/selenium/webdriver/firefox/socket_lock.rb:43:in `lock'
# .../lib/selenium/webdriver/firefox/socket_lock.rb:29:in `locked'
# .../lib/selenium/webdriver/firefox/launcher.rb:32:in `launch'
# .../:21:in `initialize'
Randomly failing specs. The bane of any developer’s existence. We would get this error about 30% of the time. Not cool. The stack trace points to selenium web driver as the culprit; let’s dive in and see if we can figure out what is going on.
Diving In
This is a thorny problem to tackle since there is no obvious reproducible test case. Re-running the build won’t cut it—if it works we won’t know if it was our change or just one of the times it happens to work. So our first task will be to narrow down the offending code and extract a test case that will allow us to experiment. Let’s look at the code at the source of the stack trace.
We were running an older version of the gem that what is currently released—the problem hasn’t been fixed in the newer versions either, but for the sake of focus I began my investigation with the same old version. You can usually find a git tag or a handy commit message (“Bump to 0.1.2”) to locate the right version of code, but in this case it was easier to simply unpack the gem.
gem unpack selenium-webdriver -v 0.1.2
cd selenium-webdriver-0.1.2
From there we can open up the offending method:
# lib/selenium/webdriver/firefox/socket_lock.rb
def can_lock?
@server = TCPServer.new(Platform.localhost, @port)
ChildProcess.close_on_exec @server
true
rescue SocketError, Errno::EADDRINUSE => ex
$stderr.puts "#{self}: #{ex.message}" if $DEBUG
false
end
The Errno
class of exceptions bubble up from the operating system, in this case from opening a new server listening on a given port. Given this code will likely be executed at the same time by multiple processes (as we start up one for each of the eight cores), and concurrent load seems a likely candidate for causing Weird Errorstm, this seems a good spot to start. Let’s tighten up our feedback loop by shifting this code out into a standalone Ruby script that we can use to investigate concurrent behaviour in a concentrated fashion, rather than re-running our build every time we want to test a change. This is an important first step in any debugging endeavour: minimize the time between making a change and being able to see the result of that change.
require 'socket'
50.times do
fork do
1000.times do
sleep (rand / 1000.0)
begin
server = TCPServer.new('127.0.0.1', 5678)
server.close
rescue SocketError, Errno::EADDRINUSE => e
end
end
end
end
Fifty processes, all trying to use the same port at the same time. If that doesn’t expose the error, nothing will. Unfortunately, it ran fine on my local OSX machine. Dispirited, I remembered the important second step of debugging: always check assumptions in as close to the target environment as possible. My local machine runs OSX, but Tank is running Ubuntu Linux, which likely would have different behaviours for socket level operations. I ssh’ed into Tank and ran the above script. Sure enough, bad file descriptors were thrown all over the place.
But what does it mean? Is this error “expected” in the same way that Errno::EADDRINUSE
is, or will it cause failures down the line if we silently rescue it inside the can_lock?
method. To answer this question, we need some more context from the surrounding code. A few more methods from the SocketLock
class:
# lib/selenium/webdriver/firefox/socket_lock.rb
def locked(&blk)
lock
begin
yield
ensure
release
end
end
private
def lock
max_time = Time.now + @timeout
until can_lock? || Time.now >= max_time
sleep 0.1
end
unless did_lock?
raise Error::WebDriverError, "unable to bind to locking port #{@port} within #{@timeout} seconds"
end
end
The lock
method will keep retrying can_lock?
as long as it fails, so this suggests we should be able to simply retry a connection that fails with a bad file descriptor. locked
is just a wrapper to automatically release the lock after the block is executed. Moving up the stack trace to confirm why we are locking, we find this beautifully self-documenting method:
# lib/selenium/webdriver/firefox/launcher.rb
def launch
socket_lock.locked do
find_free_port
create_profile
start_silent_and_wait
start
connect_until_stable
end
self
end
The driver is using a port as a mutex to ensure that only one process is scanning for a free port at any one time. It creates a connection on the newly found port, then releases the locked port, which allows another process to find a free port. Since all eight processes are being started up on the “default” selenium port, they will all queue up behind each other (continually spinning on the can_lock?
method—recreating the situation replicated by our minimal test case above) and one at a time scan through potential ports.
I was curious as to the reasoning behind this logic, so thought I’d try my luck in IRC. I guessed that the #selenium
channel would exist, and indeed it had a promising 58 members in it. Asking who I could talk to about the Ruby web driver led me to a discussion with jarib
who pointed me towards the wiki, which documents the following justification:
The approach of using a socket as a mutex was chosen because it allows
the same algorithm to be used by different JVMs or even languages at the
same time without fear of starting multiple instances of Firefox all listening
on the same port (which looks terrible, I can assure you :) The mechanism
is simple enough so that most languages can be used to implement it without
requiring any special libraries.
That seems fair. It also contains a description of the start up process, which matches what we figured out above, though with a bit more detail.
This investigation suggest two conclusions. First, that silently rescuing Errno::EBADF
in the can_lock?
method is a safe and desired thing to do. Second, that explicitly specifying unique ports for each process would likely avoid the problem all together, as well as giving a faster (though likely negligible in practice) start up time.
Wrapping Up
What at first blush appeared to be a nasty hard to find error from the depths of our low level-libraries, turned out to be fairly simple to track down. This is one of the major strengths of Ruby: third-party code is transparent and easy to spelunk around in. It is trivial to investigate and play around with all but the lowest level of code (in the C libraries).
To sharpen your skills, try substituting out the SocketLock
for a simpler file lock (does a unique file exist or not?) and seeing how that would affect the selenium-webdriver
code base.
(This issue hasn’t been fixed in selenium as of this writing—if it affects you can follow the status on this ticket)
Tune in next week for more exciting adventures in the code jungle!