Chapter 8 Understanding what Went Wrong
By doing so, you'll learn a great deal about your product, your business, and the way in which you respond to a crisis.
Gather all the clues
Before you think about explaining things to your customers, you need to better understand what went wrong yourself. If you followed our advice on Monitoring and Team Communication you’ll already have a great set of information to help identify how things unfolded.
Infrastructure and Processes
Even though it may appear to have been a part of the product which failed, take this opportunity to also look at your business processes, and the way in which you handled the crisis, as you’ve learned to throughout this guide. This can be even more important than the way in which you solve the physical failure.
There is no single cause
So often you’ll find yourself pinning the blame on a single element, such as ‘the database server failed,’ but this is seldom the true root cause. Why did the database fail? Why was there no fail-over in place? Why did it take so long to restore the backups? Why didn’t Bob answer his pager?
There are no ‘causes’ of a failure, only ‘contributors.’ Take your time to properly understand all of the events which put you in a position where your service failed.
"For complex systems there is a myth that deserves to be busted, and that is the assumption that for outages and accidents, there is a single unifying event."
Mitigate the contributing factors
There’s no sure-fire way to prevent things from going wrong again, but you can take reasonable steps to mitigate some, if not most, of the contributing factors. It’s important that you do this; not only for your own peace of mind, but this is the key aspect which will begin to rebuild trust with your customers.