In a perfect world, every time we rolled out code at the end of a sprint, it would work perfectly in production. There would never be any bugs, and there would never be any issues that forced us to roll back code that has already been deployed.
Of course, we don’t live in a perfect world. That’s one of the reasons why we have agile in the first place. Agile isn’t about pretending that your world is perfect. It’s about adapting to reality, and iterating to improve your processes and your flexibility so that when problems arise you’re able to deal with them.
One of the problems that comes up frequently for teams is the discovery of a new bug in production right in the middle of a sprint. Your team has finished deploying, all the tests passed, and everything has been pushed out to production so customers can start using it.
But maybe an edge case that wasn’t considered comes up. Maybe some aspect of the code that wasn’t fully tested comes to the surface, and starts causing problems for users. How’s your agile team supposed to respond to that?
There are many different approaches to dealing with bugs in production that come up during a sprint. Choosing the one that works best for your team is dependent on how your company is structured, how critical the bug is, and what matters most to your product owner and your customer.
The Minimal Impact Option
If a bug in production is the result of a previous sprint’s work, and it’s having a negative effect on users, the simplest thing to do whenever possible is to roll back the production server to the state that it was in before it was updated after the last sprint. At the very least, this will minimize the impact of the bug on new users.
Doing this requires having a production deployment system setup that supports clean rollbacks. An agile team with the ability to push code into production should ideally be working in an environment that supports continuous deployment, or at the very least deployment tags that allow you to roll back your production servers to a previous state. It’s times like this that you really appreciate having strong deployment or devops engineers on the team.
If it’s possible to solve the problem that simply, the product owner may choose to write a bug story to be worked on in the next sprint. That will prevent this current sprint from being interrupted, and reduce the impact on the team’s velocity. Handling bugs this way also allows the team to consider more carefully the potential impact of the bug, and the best way to fix it.
The Deep Exploration Option
Sometimes fixing a bug in production isn’t as simple as it sounds. For example, the bug could have had an effect on the data being entered into the application, or the bug may actually exist in the data layer. In this case, database recovery may be necessary, which introduces a whole range of other difficulties.
Recognizing the potential scope of a bug is the responsibility of the product owner in concert with the engineering team. When a bug is discovered, it may be necessary for the product owner to pull one or more engineers into meetings to discuss the depth of the impact and make a plan of action. Of course, the team’s velocity in the sprint will likely be reduced merely because of the need to assess the extent of the damage and propose a viable solution.
If the bug is urgent enough and the prognosis is uncertain, it may be necessary to introduce a new spike within the current sprint, and have somebody on the engineering team start looking ahead toward what’s going to be necessary to fix the bug in the next sprint. Bugs can be difficult to estimate because of their unknown nature, and it’s usually a good idea not to assign points to a bug for that reason. However, having one engineer take away a little bit of effort from the current sprint can pay off in the long run, without holding back the whole team.
The Urgent Effort Option
It’s not always possible to put off a bug fix until the next sprint. Sometimes a bug is so critical, and affects such an important aspect of the product, that it’s necessary to implement a fix during the current sprint. Ideally, this effort won’t require the entire development team. It’s the product owner’s responsibility to assess the scope of the damage, and decide whether it’s worth introducing a new story in the middle of a sprint to address a critical bug.
Introducing new stories in the middle of a sprint is never a good idea. A good scrum master should work with the product owner to try to limit changes to a sprint that’s in progress. But that doesn’t mean that it’s never necessary, and a good scrum master should also be able to communicate clearly to the team when and why it’s important to adjust the backlog if that’s the best option.
The goal in this case is to have as small an impact on the sprint as possible. Perhaps the developers who worked on the section of code that is causing the problems can be pulled off of the stories they’re working on, and temporarily assigned to fix the bug. Of course, any stories they’re working on will suffer, and there won’t be any points earned in the sprint for work done on a bug from a previous sprint.
The Nuclear Option
If a critical bug is discovered in production code, the presence of the bug is causing serious problems, and more than half of the development team is needed to work in concert to fix it, sometimes the only thing to do is to stop the sprint and start a new one.
This should always be the last resort for a product owner. While the product owner always has the option of stopping a sprint, it should be realized that the continuity of any work in progress will be lost, and the velocity calculations related to that sprint will be lost as well. For planning purposes, all the work already done within that sprint should be considered forfeit.
Of course, that doesn’t mean that the work is actually discarded. But from an engineering standpoint, stopping work on a story and then resuming it later can have such a negative impact on the focus and concentration needed from an engineering team that this option should only be used in the most extreme of cases.
If your team goes for this option, don’t make the mistake of trying to create a mini bug-sprint that is shorter than a typical sprint. If you think the bug can be fixed in less than the time that it takes to complete a single sprint, keep the backlog from the previous sprint and let the engineers start working on those stories again once the bug is fixed. Your overall velocity should account for the effort needed to fix bugs created by your team, not pretend that it doesn’t exist. Creating sprints of multiple lengths is a serious scrum anti-pattern that will destroy your ability to track your team’s real velocity.
An Ounce of Prevention
Depending on the kind of code your company is deploying and the size of your user base, it’s often a good idea to use canary servers to test the impact of any changes in production on a subset of your users. This allows you to ferret out possible edge cases quickly, and limit the impact on the overall customer base.
Even if canary servers aren’t an option, it’s always a good idea to develop your code with toggles that will allow you to turn new features that have been added to the product on or off at will. Scrum is about developing and deploying full top-to-bottom features each sprint. If a new feature is added this way, and it turns out to have a bug in it that doesn’t have broader implications, toggling it off in production may allow the team to continue moving forward while minimizing the impact on the users.
Don’t Forget the Retrospective
Of course, issues such as these should always be discussed at the retrospective, to make sure that everybody is on the same page about what happened, and how to prevent it or deal with it more effectively in the future.
Ultimately the product owner is responsible for making these calls. The agile engineering team needs to be working in a way that allows for the most productive and efficient response in the case of a bug in production.
I've worked as a Web Engineer, Writer, Communications Manager, and Marketing Director at companies such as Apple, Salon.com, StumbleUpon, and Moovweb. My research into the Social Science of Telecommunications at UC Berkeley, and while earning MBA in Organizational Behavior, showed me that the human instinct to network is vital enough to thrive in any medium that allows one person to connect to another.