This article was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible.
The word DevOps is a portmanteau of two words: development and operations, and it’s a relatively new term used in agile system administration.
In the past, developers would build products, services and infrastructure, and then the responsibility for maintaining them would shift to sysadmins.
Instead, DevOps emphasises communication and collaboration between developers and IT system operators and integrates people better into the software workflow — from development to production — keeping a closer connection to agile values and principles.
When you’re providing SaaS, customers expect 24/7 uptime. When your marketing team is selling your product as an indispensable piece of software, your customers won’t know what to do when it goes down. Adopting a DevOps approach means everyone is invested in writing good code and keeping uptime strong.
Keeping your sanity while being on-call
No matter what the team structure, keeping systems up at all times inevitably means having your DevOps team on-call. Problems don’t come up according to a schedule, so you’ll need a team prepared to deal with them at all hours.
Being on-call can be daunting to a new team member. Solving problems and fixing issues is easy during the day when you have caffeine in the bloodstream, music on the stereo, and team members you can call on.
But it’s a little different when you’re awoken at 2am by an SMS and need to quickly take action to fix an issue that’s potentially costing your company a lot of money every second.
Below are some tips I’ve picked up over the years of being on-call for SitePoint and other organizations.
Who’s on point?
Generally we make sure everyone who is on-call is aware of those on the same roster. Once an alert goes out, the person who responds first is tasked with fixing the issue directly, while others can help by offering advice and information to make the fix easier.
This approach avoids a situation where multiple developers, all trying to fix the same issue, unintentionally make the problem worse, or create new problems. This can be an implicit agreement, as we have, or someone can take the lead at the time. Either way, having a person “on point” keeps things clear and prevents further issues.
Other implicit cues can help your dev team to be aware of a situation. I’ll only ever log in to our chat client out of office hours to address an issue, so if someone sees me online, they’ll know I’m busy fixing a problem and can offer help.
Knowing other people are going to be woken up by a second or third alert if I don’t fix the problem quickly also acts as a strong motivator.
It’s also important to cultivate a culture of collaboration for DevOps. If someone is taking control of a fix, but is out of their depth and feeling under pressure, it’s important that they know they can ask for help from others without being drilled for being slow with a fix.
Always be prepared
I take my laptop with me almost everywhere I go when I’m on-call. The exception is when I know I’ll be popping out for a short trip and access to a machine set up for work will only be a few minutes away. I’m quite used to it now, and it doesn’t hinder me.
Once you’re in the habit of ensuring a laptop is always close by, you don’t even think of it. The most frustrating aspect is if I want to go to the pool, where I may not hear alerts on my phone, or go for a run. If it’s the latter, I can still do so, but I’ll have to run laps close to home, rather than a long run that takes me far away from a computer.
This may sound obvious, but make sure your machine can cope with the tasks you’ll be asking of it. Does it have the credentials and certificates that you need? Does it have Bluetooth connectivity to enable tethering? If your on-call machine isn’t one you normally use, make sure it’s up-to-date and test it regularly.
Tailor your environment
Make sure your house is on-call-friendly before an issue arises. You’ll need a consistent physical environment to make sure you can navigate it in the dark while stressed and confused after sleep.
I’ve stayed up late some nights, gone to bed only to be awoken by a SMS 20 minutes later. That’s the worst possible time.
When I’m woken up in a haze, barely able to open my eyes or stand, at least I’ll know exactly where my phone is, where my computer is, and I can make it there without really having to think about it. From there I can set about fixing whatever problem has arisen.
Another tip: Choose a dark background for your desktop wallpaper and keep the brightness turned down — or use an app like Redshift — to avoid being blinded as you log in to fix an issue.
Prevention is better than cure
Early on in my employment at SitePoint, when I first took over the sysadmin duties, almost every single day I was woken up at the very early hours of the morning by an alarm. I was a complete wreck by the end of it. That’s a pretty big motivator to fix things.
After working on the underlying issues, now I get an alert maybe once a month outside of business hours.
These days our infrastructure is such that issues are more likely to present themselves while everyone is in the office, when developers push new code.
Improving your infrastructure — and putting it to the test outside of a real problem — means you can be confident in your systems and know what to do when something goes wrong.
Know the infrastructure
Having knowledge of your organization’s dependencies and how they relate can help you to quickly understand the root cause of an issue. A lot of times I’m able to solve problems quicker because I know how the system is laid out and so I know what to check first. Of course, the hardest problems are the ones you don’t expect.
Related to this: Make sure your documentation is up to date and covers all the bases. Bad documentation can hinder rather than help.
The Right Tools
Of course, keeping yourself sane while on-call is easier when you have tools that get out of your way and help you focus on the things that matter.
PagerDuty is an operations performance platform aimed at giving you a single view of your infrastructure, meaning events and incidents can be handled by a team spread across the world, with everyone aware of issues as they come up.
This level of situational awareness extends further, with the service offering detailed analytics measuring team and system performance for incident response, as part of their enterprise plan. With tools like these you can help improve your team’s mean time to acknowledge an incident, as well as its mean time to resolve.
During an incident, alerts come through email, phone call, push notification, or through integrations with other services. The service has continuous routing and automatic escalation to make sure every alert is given the attention it warrants.
PagerDuty has more than a 100 integrations with services like AppDynamics, Crashlytics, New Relic and Sensu. But if a service you use isn’t on the list, the PagerDuty API can work with any system that can make an HTTP API call or send an email.
When it comes to making on-call easier, PagerDuty has a wealth of scheduling options, with Follow-the-Sun schedules for global teams, meaning each team member in a given location can work during business hours (never wake up at 2am again!). There are also options for secondary on-call rosters to automatically escalate an incident if the first person does not respond (it happens!).
The service is also smart about avoiding the “crying wolf” alerts, sending one alert for each incident in a service you’re responsible for, and only when that incident requires urgent action. If multiple services are generating alerts at the same time,
PagerDuty will bundle the alerts and notify you once (you’ll still be able to see each individually).
Once you’ve resolved an incident and have caught your breath, you can dive into what went wrong with the service’s detailed event timelines, and then look for root causes or trends with its analytics services.
Being on-call can be a daunting experience for a new DevOps team member. But with the right approach, a culture of collaboration, knowledge of the infrastructure, and the right tools, the experience of a 2am wake-up call can be manageable, and you can solve issues without losing too much sleep — or sanity.
How do you manage being on-call? Do you have any tips? Have you tried PagerDuty? Let us know in the comments below.
Adam Bolte is SitePoint's systems administrator and free software activist. He has been running various GNU/Linux distributions as his desktop of choice since 1998, and has a tendency to install the Linux kernel onto any device he owns.