Chapter 3 Being the First to Know
If your customers are having to tell you that your service is down, then you're already on the back foot. The first step in tackling a service outage is being the first to know.
Use tools like Pingdom and NewRelic to watch for anything out of the ordinary, both inside your own infrastructure and also the infrastructures of any third party services you rely on.
Also think about the way you monitor social traffic and email from your customers, as sometimes an unusual edge case, or a missing piece in your monitoring might be caught by a customer. If you’re only checking your emails and social profiles every few hours, you might get a nasty surprise when you look next.
Ensure the right people are notified
All of the monitoring in the world isn’t worth a dime if it’s notifying a member of the ops team who’s asleep, or sending emails to a member of staff who left last month.
Use tools like PagerDuty and OpsGenie to help ensure that the right members of the team are reliably notified.
As soon as you know, tell your customers
Even before you try to understand the unfolding situation, be sure to let customers know that things aren’t looking normal. Be open and explain that you’re still investigating. This early, proactive approach will help cut down on the hubbub from customers and allow your team to focus on finding out what’s going wrong.