License to SIGKILL
Every program wants to live forever. What happens when a program is forced to exit before it’s done running, and why would we want to do that?
Feel free to skip if you are familiar with signals.
In Unix, processes can communicate to each other with pre-defined signals. You can see a list of unix signals here. This ability to communicate is extremely important in a process oriented program. For example, the Puma webserver can add concurrency by spawning child “worker” processes. It accepts requests into a master process and then hands them off to the next available child. If the system that is running the Puma master process needs to shut down or restart, we don’t simply want all current requests to be stopped in their tracks. Instead, we want the child workers to finish processing the request if they can, clean up any external connections or temporary files they may have generated, then exit. The system can safely do this by sending a signal to the parent “master” process which is then, in turn, sent to the child processes.
You may have seen the movie Tron Legacy. The movie opens with a hacker breaking into a corporate network. The CEO sees it happening and deftly responds by typing in a
$ kill -9 command into the terminal. This
kill command in linux (and Mac OS X) sends the signal number 9, which is
SIGKILL, to a process.
SIGKILL means “end now without cleanup”. This is similar to using CTRL+ALT+DELETE on windows (though windows is not POSIX compliant and doesn’t support processes).
When we need our long running processes to exit gracefully, the signal
SIGKILL is too strong. That signal forces processes to exit immediately and can leave your system in a bad state. What should you use instead? The signal
SIGTERM (signal number 15) is the “termination signal”. This tells a program that it needs to stop what it’s doing and clean up before exiting.
Live and Let Die
When Ruby receives a
SIGTERM signal it raises a
SignalException error. In Ruby, an exception can happen at any point while a process is running, so critical clean up code should always use an
begin # do something ensure # clean up something end
There are notable caveats here, such as an exception can be raised while an ensure block is already running, so we can’t always rely on it to execute. For an in-depth look into errors in Ruby, I recommend Avdi’s Exceptional Ruby. That being said, it’s still a best practice to use
ensure blocks to safeguard your code.
Since we already have this failsafe behavior, Ruby uses it when a
SignalException is raised. To verify this we can write a trivial script:
Thread.new do begin while true sleep 1 end ensure puts "ensure called" end end current_pid = Process.pid signal = "SIGTERM" Process.kill(signal, current_pid)
When you run this you’ll see:
ensure called Terminated: 15
You’ll notice that, in addition to the “ensure called”, we also get the number of the signal that was used to exit the process (15, which corresponds to
SIGTERM). Neat. This behavior is really convenient, since any program that has error handling is already equipped to gracefully exit. By putting sensitive operations in an
ensure block, we’re making it more likely that the program will do the right thing. After all, the
ensure blocks get called then the program will exit. Note that if you re-run the program with
SIGKILL instead, it exits with a different number and we don’t get output from the
Tomorrow Never Dies
I’m sure you’ve had a frustrating app on your computer that was frozen and wouldn’t die no matter how many times you clicked the “close” button. Some stubborn programs will never exit, no matter how many times you send
SIGTERM to them. This can happen when the program gets stuck trying to clean itself up. We can reproduce this easily:
thread = Thread.new do begin while true sleep 1 end ensure while true puts "ensure called" sleep 1 end end end current_pid = Process.pid signal = "SIGTERM" Process.kill(signal, current_pid)
The output will look like this:
ensure called ensure called ensure called ensure called ensure called # ... ensure called
It will never end until the machine is restarted or
SIGKILL is sent. Instead of this trivial example, it’s easy to imagine your Ruby program waiting on a database query or network call to finish. If it is hung and your program never gets a response, it will never exit. That’s why it’s always critical to timeout sensitive code, though be careful with timeout.rb.
It’s important to note that all
ensure blocks in scope will be called when a
SignalException is raised. This means that, in addition to your own code, all the codes in any dependencies will be called. If your system is hanging on exit and you can’t determine an errant
ensure block that you’ve committed, it may be from a library you’re using.
Say (Dr.) No to Signal Trapping
Another way that you can prevent a program from exiting is to use
Signal.trap. When you run this code, the signal will get captured and the program will not exit.
Signal.trap('TERM') do puts "Die Another Day" end current_pid = Process.pid signal = "SIGTERM" Process.kill(signal, current_pid)
When you execute the program, you get the output
"Die Another Day" but it continues to execute. It is possible to trap and re-raise the same signal, however this is a very large hammer. We can’t depend on a signal being sent to the program, nor can we rely on this code getting run in the block. Worse yet, when we do get a signal, the system needs us to clean up and exit as quickly as possible. The best practice would be to use
ensure blocks whenever possible and only resort to signal trapping when it’s really necessary.
From Russia with Love and Signals
So far, we’ve looked at how your Ruby code handles signals, but how would you know what signals to send? Before any restart or shutdown you would send a
SIGTERM to let it clean up, then monitor the process to see if it shuts down in a reasonable time frame. If it doesn’t, send a
SIGKILL to shut down the process, ending any infinite ensure blocks. You would make a note of when your process does not exit from a
SIGTERM as it could mean that, when you force kill the process, you’re interrupting some important work or cleanup process. The company I work for, Heroku, goes through these steps every time you deploy or restart your application. If, for some reason, your application won’t exit on time, the system emits an R12 – Exit Timeout and records the error on your dashboard view so you can investigate later.
While it’s difficult to conceptualize, an exception might stop your entire program at any time, so it’s nice to know that adding
ensure to places that should already have them is all you need to do to be safe. Whether you’re working for Her Majesty’s Secret Service or in an IT department at a Casino Royale, you can take a Quantum of Solace knowing that your programs can exit gracefully.