This past Sunday around 6am ET, Amazon Web Services had a mega outage, affecting millions of customers worldwide. Amazon Web Services (AWS) is Amazon’s global cloud computing infrastructure business unit. AWS powers many popular websites including Netflix, Pinterest, Expedia, IMDB, Reddit, and more. Many of my own clients rely on AWS services.
Multiple AWS systems failed simultaneously: DynamoDB, Cognito, and CloudWatch. The latter, CloudWatch, is Amazon’s own cloud monitoring platform. So, the system used to monitor their own cloud went offline, unable to alert on its offline status. Everyone assumed things were humming along until applications relying on AWS services began to fail.
More and more Software-as-a-Service (SaaS) companies rely on cloud hosting providers like AWS for their infrastructure. When an outage strikes, it’s not localized like the power company. The whole world feels it.
One popular service in my area is Shipt, an on-demand grocery delivery service, which uses AWS. Customers were able to place orders, but never received their groceries. Shipt started receiving lots of angry Facebook messages, tweets, and emails. Nobody was manning the phones at 6am on a Sunday.
The outage was resolved after several hours, but most end customers were unaware of the root cause of their favorite sites’ downtime.
I think customers understand that websites can go down. They just don’t want to be surprised. Sunday’s event was about lack of information. When you have to Google news sites to see if a website or app is having problems, that’s a problem.
SaaS companies, like the ones affected by AWS’s outage, are generally powerless to resolve underlying issues with their cloud provider. So, they need to be far more proactive in customer notification when problems arise.
There’s nothing more embarrassing and trust-killing than finding out from your customers that your website is having problems.
I’ve implemented many third-party and custom monitoring solutions for my clients. One client, a $400M online retailer, had frequent issues with its infrastructure provider, and I delivered a monitoring solution that would notify the marketing team of outages. They could quickly get in front of the issue by posting a notice on the website of checkout issues, and were on call to respond to social media inquiries. This one change stopped most customer service calls, and rebuilt trust with customers.
Acquiring new customers is far more expensive than keeping existing ones. Online businesses should focus on how proactively they handle outages. The customers will trust you and stick around.
What steps are you taking to respond to outages and build trust?
Postscript: As I write this, Skype is offline globally most of the day.
© 2015 Mark Richman. All Rights Reserved.