On Tuesday, February 28, 2017, Amazon was behind one of the largest outages the web has suffered to date. The AWS S3 meltdown in the Virginia US-East-1 region took down a range of popular websites during business hours.
Affected enterprises included Trello, Travis CI, Quora, Medium, Docker’s Registry Hub, GitHub, GitLab, Signal, Slack, Razer, Imgur, Twitch.tv, Xero, Zoom.us, Salesforce.com, Strava and SiriusXM.
Websites, Image Storage, Apps and Internet of Things Were Affected
The cloud outage also crippled sites that rely on Amazon for image storage. These included Zendesk, Expedia, Flipboard, Coursera, Bitbucket, Heroku, Twilio, Mailchimp, the Autodesk cloud, Citrix, and Yahoo! Mail. Smartphone apps, cameras and appliances connected to the Internet of Things were also affected.
The outage lasted approximately five hours during the business day. Initially, Amazon offered no explanation as to why one of its largest cloud facilities went down and took a significant portion of the Internet with it.
Just a Typo?
A couple of days later, they informed the public of the rather anticlimactic reason for the outage: a typographical error made by an employee.
The “fix” was intended to debug a performance issue that had been causing the S3 billing system to run much more slowly than usual. The staffer, an Amazon Simple Storage Service (S3) team member, was apparently addressing an issue in the billing system when they mistyped a command. Needless to say, the keystrokes had the opposite effect.
The Amazon tech team has an established playbook from which employees draw to resolve specific issues within their system. The command used was intended to remove a small section of S3 servers to address the billing system problem. Unfortunately, the command was typed in wrong and it stalled a much larger set of servers than intended.
Five Hours of Down Time is Unacceptable
The outage affected storage and indexing on the S3 system and rendered it impossible for Amazon to deal with client requests regarding S3 as well as Lambda and EC2 functions. Websites large and small were affected, as were numerous cameras, apps and “smart” gadgets connected with the Internet of Things.
Amazon cloud storage is known for being affordable and reliable, and it has grown rapidly in the last few years. However, this outage is quite a blow to that assessment and reputation. Restoring service took far longer than it should have.
Is Your Company Protected?
Amazon says it is installing some safety measures to help prevent this from happening again, including partitioning the system into smaller “cells” so that debugging only affects small areas. Amazon is also auditing its tools and adding more safety measures to further protect the system.
Clearly, you can never have too many safeguards when it comes to your company’s technology infrastructure. If you have doubts about the reliability of your cloud server solution, Nachman Networks of Washington DC, Northern Virginia & Maryland can help. Give us a call at (703) 600-3301 to set up a consultation. You can also send us an email at firstname.lastname@example.org. We look forward to working with you!