License to SIGKILL

Schneeman, Richard Schneeman.
Every program wants to live forever. What happens when a program is forced to exit before it’s done running, and why would we want to do that?
Unix Signals
Feel free to skip if you are familiar with signals.
In Unix, processes can communicate to each other with pre-defined signals. You can see a list of unix signals here. This ability to communicate is extremely important in a process oriented program. For example, the Puma webserver can add concurrency by spawning child “worker” processes. It accepts requests into a master process and then hands them off to the next available child. If the system that is running the Puma master process needs to shut down or restart, we don’t simply want all current requests to be stopped in their tracks. Instead, we want the child workers to finish processing the request if they can, clean up any external connections or temporary files they may have generated, then exit. The system can safely do this by sending a signal to the parent “master” process which is then, in turn, sent to the child processes.
You may have seen the movie Tron Legacy. The movie opens with a hacker breaking into a corporate network. The CEO sees it happening and deftly responds by typing in a $ kill -9
command into the terminal. This kill
command in linux (and Mac OS X) sends the signal number 9, which is SIGKILL
, to a process. SIGKILL
means “end now without cleanup”. This is similar to using CTRL+ALT+DELETE on windows (though windows is not POSIX compliant and doesn’t support processes).
When we need our long running processes to exit gracefully, the signal SIGKILL
is too strong. That signal forces processes to exit immediately and can leave your system in a bad state. What should you use instead? The signal SIGTERM
(signal number 15) is the “termination signal”. This tells a program that it needs to stop what it’s doing and clean up before exiting.
Live and Let Die
When Ruby receives a SIGTERM
signal it raises a SignalException
error. In Ruby, an exception can happen at any point while a process is running, so critical clean up code should always use an ensure
block.
begin
# do something
ensure
# clean up something
end
There are notable caveats here, such as an exception can be raised while an ensure block is already running, so we can’t always rely on it to execute. For an in-depth look into errors in Ruby, I recommend Avdi’s Exceptional Ruby. That being said, it’s still a best practice to use ensure
blocks to safeguard your code.
Since we already have this failsafe behavior, Ruby uses it when a SignalException
is raised. To verify this we can write a trivial script:
Thread.new do
begin
while true
sleep 1
end
ensure
puts "ensure called"
end
end
current_pid = Process.pid
signal = "SIGTERM"
Process.kill(signal, current_pid)
When you run this you’ll see:
ensure called
Terminated: 15
You’ll notice that, in addition to the “ensure called”, we also get the number of the signal that was used to exit the process (15, which corresponds to SIGTERM
). Neat. This behavior is really convenient, since any program that has error handling is already equipped to gracefully exit. By putting sensitive operations in an ensure
block, we’re making it more likely that the program will do the right thing. After all, the ensure
blocks get called then the program will exit. Note that if you re-run the program with SIGKILL
instead, it exits with a different number and we don’t get output from the ensure
block.
Tomorrow Never Dies
I’m sure you’ve had a frustrating app on your computer that was frozen and wouldn’t die no matter how many times you clicked the “close” button. Some stubborn programs will never exit, no matter how many times you send SIGTERM
to them. This can happen when the program gets stuck trying to clean itself up. We can reproduce this easily:
thread = Thread.new do
begin
while true
sleep 1
end
ensure
while true
puts "ensure called"
sleep 1
end
end
end
current_pid = Process.pid
signal = "SIGTERM"
Process.kill(signal, current_pid)
The output will look like this:
ensure called
ensure called
ensure called
ensure called
ensure called
# ...
ensure called
It will never end until the machine is restarted or SIGKILL
is sent. Instead of this trivial example, it’s easy to imagine your Ruby program waiting on a database query or network call to finish. If it is hung and your program never gets a response, it will never exit. That’s why it’s always critical to timeout sensitive code, though be careful with timeout.rb.
It’s important to note that all ensure
blocks in scope will be called when a SignalException
is raised. This means that, in addition to your own code, all the codes in any dependencies will be called. If your system is hanging on exit and you can’t determine an errant ensure
block that you’ve committed, it may be from a library you’re using.
Say (Dr.) No to Signal Trapping
Another way that you can prevent a program from exiting is to use Signal.trap
. When you run this code, the signal will get captured and the program will not exit.
Signal.trap('TERM') do
puts "Die Another Day"
end
current_pid = Process.pid
signal = "SIGTERM"
Process.kill(signal, current_pid)
When you execute the program, you get the output "Die Another Day"
but it continues to execute. It is possible to trap and re-raise the same signal, however this is a very large hammer. We can’t depend on a signal being sent to the program, nor can we rely on this code getting run in the block. Worse yet, when we do get a signal, the system needs us to clean up and exit as quickly as possible. The best practice would be to use ensure
blocks whenever possible and only resort to signal trapping when it’s really necessary.
From Russia with Love and Signals
So far, we’ve looked at how your Ruby code handles signals, but how would you know what signals to send? Before any restart or shutdown you would send a SIGTERM
to let it clean up, then monitor the process to see if it shuts down in a reasonable time frame. If it doesn’t, send a SIGKILL
to shut down the process, ending any infinite ensure blocks. You would make a note of when your process does not exit from a SIGTERM
as it could mean that, when you force kill the process, you’re interrupting some important work or cleanup process. The company I work for, Heroku, goes through these steps every time you deploy or restart your application. If, for some reason, your application won’t exit on time, the system emits an R12 – Exit Timeout and records the error on your dashboard view so you can investigate later.
While it’s difficult to conceptualize, an exception might stop your entire program at any time, so it’s nice to know that adding ensure
to places that should already have them is all you need to do to be safe. Whether you’re working for Her Majesty’s Secret Service or in an IT department at a Casino Royale, you can take a Quantum of Solace knowing that your programs can exit gracefully.