The Beginner’s Guide to Being On-call

Share this article

This article was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible.
The word DevOps is a portmanteau of two words: development and operations, and it’s a relatively new term used in agile system administration. In the past, developers would build products, services and infrastructure, and then the responsibility for maintaining them would shift to sysadmins. Instead, DevOps emphasises communication and collaboration between developers and IT system operators and integrates people better into the software workflow — from development to production — keeping a closer connection to agile values and principles. When you’re providing SaaS, customers expect 24/7 uptime. When your marketing team is selling your product as an indispensable piece of software, your customers won’t know what to do when it goes down. Adopting a DevOps approach means everyone is invested in writing good code and keeping uptime strong.

Keeping your sanity while being on-call

No matter what the team structure, keeping systems up at all times inevitably means having your DevOps team on-call. Problems don’t come up according to a schedule, so you’ll need a team prepared to deal with them at all hours. Being on-call can be daunting to a new team member. Solving problems and fixing issues is easy during the day when you have caffeine in the bloodstream, music on the stereo, and team members you can call on. But it’s a little different when you’re awoken at 2am by an SMS and need to quickly take action to fix an issue that’s potentially costing your company a lot of money every second. Below are some tips I’ve picked up over the years of being on-call for SitePoint and other organizations.

Who’s on point?

Generally we make sure everyone who is on-call is aware of those on the same roster. Once an alert goes out, the person who responds first is tasked with fixing the issue directly, while others can help by offering advice and information to make the fix easier. This approach avoids a situation where multiple developers, all trying to fix the same issue, unintentionally make the problem worse, or create new problems. This can be an implicit agreement, as we have, or someone can take the lead at the time. Either way, having a person “on point” keeps things clear and prevents further issues. Other implicit cues can help your dev team to be aware of a situation. I’ll only ever log in to our chat client out of office hours to address an issue, so if someone sees me online, they’ll know I’m busy fixing a problem and can offer help. Knowing other people are going to be woken up by a second or third alert if I don’t fix the problem quickly also acts as a strong motivator. It’s also important to cultivate a culture of collaboration for DevOps. If someone is taking control of a fix, but is out of their depth and feeling under pressure, it’s important that they know they can ask for help from others without being drilled for being slow with a fix.

Always be prepared

I take my laptop with me almost everywhere I go when I’m on-call. The exception is when I know I’ll be popping out for a short trip and access to a machine set up for work will only be a few minutes away. I’m quite used to it now, and it doesn’t hinder me. Once you’re in the habit of ensuring a laptop is always close by, you don’t even think of it. The most frustrating aspect is if I want to go to the pool, where I may not hear alerts on my phone, or go for a run. If it’s the latter, I can still do so, but I’ll have to run laps close to home, rather than a long run that takes me far away from a computer. This may sound obvious, but make sure your machine can cope with the tasks you’ll be asking of it. Does it have the credentials and certificates that you need? Does it have Bluetooth connectivity to enable tethering? If your on-call machine isn’t one you normally use, make sure it’s up-to-date and test it regularly.

Tailor your environment

Make sure your house is on-call-friendly before an issue arises. You’ll need a consistent physical environment to make sure you can navigate it in the dark while stressed and confused after sleep. I’ve stayed up late some nights, gone to bed only to be awoken by a SMS 20 minutes later. That’s the worst possible time. When I’m woken up in a haze, barely able to open my eyes or stand, at least I’ll know exactly where my phone is, where my computer is, and I can make it there without really having to think about it. From there I can set about fixing whatever problem has arisen. Another tip: Choose a dark background for your desktop wallpaper and keep the brightness turned down — or use an app like Redshift
— to avoid being blinded as you log in to fix an issue.

Prevention is better than cure

Early on in my employment at SitePoint, when I first took over the sysadmin duties, almost every single day I was woken up at the very early hours of the morning by an alarm. I was a complete wreck by the end of it. That’s a pretty big motivator to fix things. After working on the underlying issues, now I get an alert maybe once a month outside of business hours. These days our infrastructure is such that issues are more likely to present themselves while everyone is in the office, when developers push new code. Improving your infrastructure — and putting it to the test outside of a real problem — means you can be confident in your systems and know what to do when something goes wrong.

Know the infrastructure

Having knowledge of your organization’s dependencies and how they relate can help you to quickly understand the root cause of an issue. A lot of times I’m able to solve problems quicker because I know how the system is laid out and so I know what to check first. Of course, the hardest problems are the ones you don’t expect. Related to this: Make sure your documentation is up to date and covers all the bases. Bad documentation can hinder rather than help.

The Right Tools

Of course, keeping yourself sane while on-call is easier when you have tools that get out of your way and help you focus on the things that matter. PagerDuty is an operations performance platform aimed at giving you a single view of your infrastructure, meaning events and incidents can be handled by a team spread across the world, with everyone aware of issues as they come up. This level of situational awareness extends further, with the service offering detailed analytics measuring team and system performance for incident response, as part of their enterprise plan. With tools like these you can help improve your team’s mean time to acknowledge an incident, as well as its mean time to resolve. During an incident, alerts come through email, phone call, push notification, or through integrations with other services. The service has continuous routing and automatic escalation to make sure every alert is given the attention it warrants. PagerDuty has more than a 100 integrations with services like AppDynamics, Crashlytics, New Relic and Sensu. But if a service you use isn’t on the list, the PagerDuty API can work with any system that can make an HTTP API call or send an email. When it comes to making on-call easier, PagerDuty has a wealth of scheduling options, with Follow-the-Sun schedules for global teams, meaning each team member in a given location can work during business hours (never wake up at 2am again!). There are also options for secondary on-call rosters to automatically escalate an incident if the first person does not respond (it happens!). The service is also smart about avoiding the “crying wolf” alerts, sending one alert for each incident in a service you’re responsible for, and only when that incident requires urgent action. If multiple services are generating alerts at the same time, PagerDuty will bundle the alerts and notify you once (you’ll still be able to see each individually). Once you’ve resolved an incident and have caught your breath, you can dive into what went wrong with the service’s detailed event timelines, and then look for root causes or trends with its analytics services.

Conclusion

Being on-call can be a daunting experience for a new DevOps team member. But with the right approach, a culture of collaboration, knowledge of the infrastructure, and the right tools, the experience of a 2am wake-up call can be manageable, and you can solve issues without losing too much sleep — or sanity. How do you manage being on-call? Do you have any tips? Have you tried PagerDuty? Let us know in the comments below.

Frequently Asked Questions (FAQs) about Being On-Call

What are the best practices for managing on-call schedules?

Managing on-call schedules effectively is crucial to ensure that your team is not overwhelmed and that issues are resolved promptly. Here are some best practices:


1. Rotation: Rotate on-call duties among your team members to prevent burnout. This also ensures that everyone gets a chance to learn and grow.

2. Clear Expectations: Set clear expectations about what being on-call entails. This includes response times, responsibilities, and escalation procedures.

3. Training: Provide adequate training to your team members so they are prepared to handle any issues that may arise.

4. Tools: Use scheduling tools to manage and track on-call schedules. This helps avoid confusion and ensures that everyone knows who is on-call at any given time.

5. Time Off: After a particularly challenging on-call shift, give your team members some time off to rest and recharge.

How can I improve my on-call experience?

Improving your on-call experience involves a combination of preparation, communication, and self-care. Here are some tips:


1. Preparation: Understand your responsibilities and familiarize yourself with the systems you’ll be monitoring.

2. Communication: Keep lines of communication open with your team. If you’re unsure about something, don’t hesitate to ask.

3. Self-Care: Take care of your physical and mental health. Get enough sleep, eat healthy, and take breaks when needed.

4. Tools: Use tools that can help you manage your on-call duties more efficiently. This could include alerting tools, incident management tools, or communication tools.

5. Feedback: Provide feedback to your team and management about your on-call experience. This can help improve the process for everyone.

What are some common challenges of being on-call and how can I overcome them?

Being on-call can be challenging, but these challenges can be overcome with the right strategies. Some common challenges include:


1. Burnout: This can be prevented by rotating on-call duties among team members and ensuring everyone gets adequate rest.

2. Lack of Training: Ensure that you receive proper training before you start your on-call duties. If you feel unprepared, speak up and ask for more training.

3. Communication Issues: Keep lines of communication open with your team. Use communication tools to stay connected and informed.

4. Technical Issues: Familiarize yourself with the systems you’ll be monitoring. If you encounter a problem, don’t hesitate to ask for help.

5. Stress: Take care of your mental health. Practice stress management techniques like deep breathing, meditation, or yoga.

How can I prepare for my first on-call shift?

Preparing for your first on-call shift can be daunting, but with the right preparation, you can handle it confidently. Here are some tips:


1. Understand Your Responsibilities: Know what is expected of you during your on-call shift. This includes response times, tasks, and escalation procedures.

2. Get Trained: Make sure you receive adequate training. This should cover the systems you’ll be monitoring and how to handle common issues.

3. Set Up Your Workspace: Ensure you have a quiet, comfortable place to work. Have all the necessary tools and resources at your disposal.

4. Communicate: Let your team know when you’re starting your shift. Keep them updated about any issues or challenges you encounter.

5. Take Care of Yourself: Get enough sleep before your shift. Eat healthy and stay hydrated.

What tools can help me manage my on-call duties more effectively?

There are several tools that can help you manage your on-call duties more effectively. These include:


1. Alerting Tools: These tools can notify you when there’s an issue that needs your attention.

2. Incident Management Tools: These tools can help you track and manage incidents. They can also help with communication and collaboration.

3. Scheduling Tools: These tools can help you manage and track on-call schedules.

4. Communication Tools: These tools can help you stay connected with your team. They can also facilitate collaboration and information sharing.

5. Documentation Tools: These tools can help you document incidents and solutions. This can be useful for future reference and for training purposes.

How can I handle the stress of being on-call?

Being on-call can be stressful, but there are ways to manage this stress. Here are some tips:


1. Take Breaks: Don’t forget to take breaks. Even a short walk or a few minutes of deep breathing can help reduce stress.

2. Practice Self-Care: Take care of your physical and mental health. Get enough sleep, eat healthy, and exercise regularly.

3. Stay Connected: Keep in touch with your team. Knowing that you’re not alone can help reduce stress.

4. Ask for Help: If you’re feeling overwhelmed, don’t hesitate to ask for help. Your team is there to support you.

5. Practice Mindfulness: Mindfulness techniques like meditation or deep breathing can help you stay calm and focused.

What should I do if I encounter a problem I can’t solve during my on-call shift?

If you encounter a problem you can’t solve during your on-call shift, don’t panic. Here are some steps you can take:


1. Document the Problem: Write down everything you know about the problem. This includes what you’ve tried, any error messages, and when the problem started.

2. Reach Out to Your Team: Let your team know about the problem. They may have encountered it before and can provide guidance.

3. Escalate the Issue: If you can’t solve the problem, escalate it to a higher level of support. Be sure to provide them with all the information you’ve gathered.

4. Learn from the Experience: Once the problem is resolved, take the time to understand what went wrong and how it was fixed. This can help you handle similar issues in the future.

How can I balance my personal life with my on-call duties?

Balancing your personal life with your on-call duties can be challenging, but it’s not impossible. Here are some tips:


1. Set Boundaries: Let your friends and family know when you’re on-call and what that means. Ask for their understanding and support.

2. Plan Ahead: If you know you’ll be on-call, plan your personal activities around your schedule. This can help reduce stress and prevent conflicts.

3. Take Time Off: After a particularly challenging on-call shift, take some time off to rest and recharge.

4. Use Tools: Use tools that can help you manage your on-call duties more efficiently. This can free up more time for your personal life.

5. Ask for Help: If you’re feeling overwhelmed, don’t hesitate to ask for help. Your team is there to support you.

What are some common mistakes to avoid when being on-call?

When you’re on-call, there are some common mistakes you should avoid. These include:


1. Not Being Prepared: Make sure you understand your responsibilities and are familiar with the systems you’ll be monitoring.

2. Ignoring Self-Care: Don’t neglect your physical and mental health. Get enough sleep, eat healthy, and take breaks when needed.

3. Poor Communication: Keep your team informed about any issues or challenges you encounter. Use communication tools to stay connected.

4. Not Asking for Help: If you’re unsure about something or feeling overwhelmed, don’t hesitate to ask for help.

5. Not Learning from Mistakes: When an issue is resolved, take the time to understand what went wrong and how it was fixed. This can help you handle similar issues in the future.

How can I improve my skills and knowledge for being on-call?

Improving your skills and knowledge for being on-call involves continuous learning and practice. Here are some tips:


1. Training: Take advantage of any training opportunities offered by your organization. This can help you understand the systems you’ll be monitoring and how to handle common issues.

2. Learn from Others: Learn from your team members who have more experience with being on-call. They can provide valuable insights and advice.

3. Stay Updated: Keep up with the latest trends and developments in your field. This can help you handle new challenges more effectively.

4. Practice: The more you practice, the more confident you’ll become. Use your on-call shifts as opportunities to learn and grow.

5. Ask for Feedback: Ask for feedback from your team and management. This can help you identify areas for improvement and develop your skills.

Adam BolteAdam Bolte
View Author

Adam Bolte is SitePoint's systems administrator and free software activist. He has been running various GNU/Linux distributions as his desktop of choice since 1998, and has a tendency to install the Linux kernel onto any device he owns.

devopson-callsponsored
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week