Interview: How SitePoint Manages and Prioritizes Monitoring
This post was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible!
Like many developers, I expect stuff to just work … and I throw a tantrum when it doesn’t! Behind the scenes, many technical people are working their magic to integrate hardware, software and services into workable cohesive systems. In this article I interview Jude Aakjaer about his DevOps duties and experiences at SitePoint.
Craig Buckler: Hey Jude! (Sorry, couldn’t resist.) Could you tell us who you are and what you do for SitePoint?
Jude Aakjaer: Hey Craig. Not a problem — unsurprisingly I do get that quite often!
I’m one of the developers working on products and systems at both SitePoint and Learnable. That means backend programming in Ruby and PHP but also DevOps tasks.
CB: What are the biggest challenges and issues you face daily?
JA: Definitely sorting the signal from the noise. If we jumped at every package update email and website exception we would never get any work done!
As well as updates, we also need to fix issues and bugs with the code. It doesn’t matter how good or robust your code is — errors will occur. The challenge is identifying which problems require immediate attention and which can be examined as part of a wider refactoring task.
CB: Where do you receive alerts from?
JA: We use a variety of tools to monitor different parts of applications and services.
For our Ruby on Rails websites, we use a notifier gem (Airbrake) that alerts us whenever our code throws an exception or there are other unexpected events. We also use an external monitoring website (Wormly) which is configured to detect certain HTTP responses. Lastly, we use the AWS CloudWatch monitoring service which alerts us about hardware problems or failures.
Alerts are primarily sent by SMS and email. As you can imagine, messages are fired from different angles from many applications. We are constantly looking to improve our monitoring tools.
CB: How do you prioritize alerts? Do you base their importance according to business value impact, long-term importance, difficulty, whoever shouts loudest, or other factors?
JA: Alert priorities are context sensitive and we manually determine the order. Obviously if one of our websites has fallen over, that takes highest priority! Other alerts — such as disk space reaching certain levels — are scheduled into weekly review tasks and attended to in a more relaxed manner.
Many of the processes have been in place for a number of years and we can quickly identify what needs to be done. For example, the Wormly alerts are always important. Airbrake reports application-specific issues and we’ll examine the issue frequency to decide when it should be fixed.
We encourage our developers to tackle at least one recurring error per sprint. This also allows us to keep the error reporting noise down to a minimum.
CB: How do you plan monitoring for new systems and services?
JA: Monitoring has a variety of flavors but must be considered from the start.
First, we want to monitor the actual servers the application runs on. Since we’re using AWS for deployment, the built-in CloudWatch statistics let us discover issues such as consistently high CPU and memory usage, running out of disk space or unresponsive servers.
We then monitor the program code itself. The tools report fatal exceptions or unexpected events within the application.
Lastly, we monitor applications as they are seen from the outside world. The monitoring systems send HTTP requests to key pages and compare it to known responses such as successful requests, redirects, or even an error.
All our new applications and services should follow this process. Of course, sometimes something slips through. When that occurs, we write additional tests to detect that event in the future. Getting tripped up the first time a problem occurs is one thing — you’re in trouble if it occurs twice!
We employ various tools and technologies but, naturally, our requirements evolve. It’s important for tools to grow with us.
CB: What advice would you give to someone on a team that’s transitioning to a DevOps model?
JA: That’s a broad question but, at heart, it’s about understanding the concerns of both developers and system administrators. Developers want an environment which can be built and deployed quickly so they can continue with the more interesting issues of application development. System administrators want to ensure best-practice security, privacy and scalable architectures are created. There are times when these two sets of concerns conflict; a pragmatic approach is recommended.
Crucially, you should be constantly building and deploying applications and servers. Your orchestration and deployment scripts must be constantly exercised and improved. You should avoid snowflake systems which few people understand or can recreate. Ideally, aim for phoenix systems which can be burnt and reborn at a moment’s notice by anyone on the team.
Treat your servers like cattle — not pets! It’ll give you the confidence to create new stacks or scale quickly on demand.
CB: Thanks Jude. We appreciate all your efforts in keeping the SitePoint.com services up and running.
PagerDuty: Stop Incidents Becoming Emergencies
Not every company has a team of experts ready to pounce on every alert. PagerDuty can help manage incidents, increase visibility and improve collaboration. The core features:
- PagerDuty is quick to set up and integrates with more than 100 systems
- monitoring is aggregated in a single place — everything can be viewed on one dashboard
- alerts are effective — use SMS, push notifications, phone calls, email or whatever method suits you
- automated escalation policy rules can be defined — the system can prioritize work for you
- you can schedule, collaborate and analyze your systems with ease.
For more information, visit PagerDuty.com.