AWS Outage: Lessons for Organizations to Mitigate the Fallout

AWS, the world’s largest cloud service, suffered a devastating service outage in its US-EAST-1 Region, quickly affecting internet services worldwide. Let’s look at what happened, and the lessons organizations should learn to mitigate the impact of the next big public cloud outage.

AWS emerged as the world’s first public cloud service in 2005, long before Microsoft, Google, and others unveiled their solutions. The cloud service still reigns as the top cloud provider, amassing a revenue of $16.11 billion in Q3 2021 and maintaining a sizable 41% share in the cloud infrastructure market. However, despite a decade and a half of advancements and improvements, it remains prone to periodic outages like other cloud services.

The latest outage struck the company’s US-EAST-1 region, located in northern Virginia and among the oldest of 25 AWS regions distributed worldwide, of which eight are in the U.S. alone. Launched in 2006, the US-EAST-1 region comprises six availability zones that are inter-connected with high-bandwidth networking to offer low latency and high throughput. It is the only AWS region that supports Cluster Compute instances and, until 2017, served as the default Region for businesses accessing resources from the AWS Management Console. The US-EAST-2 region now shoulders this responsibility in Ohio, which went live in 2016.

See more: AWS Outage: Facebook, Netflix, Ring & Disney Plus Among Affected ServicesÂ

Zero Hour: The Outage

It is not known precisely at what hour the outage struck US-EAST-1, but online services, websites, and games relying on AWS began stuttering at around 10:45 AM ET on Tuesday, December 7. The list of affected sites and online services included Facebook, Netflix, Tinder, Robinhood, Roku, Venmo, McDonald’s, Southwest Airlines, Instacart, Disney+, the Associated Press, and some Amazon services like Ring security cameras, IMDb, Prime Video, and Alexa. Several universities and government entities were also affected.

Considering an outage of this scale affected online services worldwide and quickly garnered an avalanche of media queries, Amazon didn’t waste much time in soothing nerves. AWS said the outage affected the global console landing page hosted in US-EAST-1, as well as services like Amazon Connect, Amazon DynamoDB, Amazon Elastic Compute Cloud, and the AWS Management Console and Support Center.

â€œWe are experiencing API and console issues in the US-EAST-1 Region. We have identified the root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1,â€ the cloud infrastructure giant reported on the day of the outage.

The company then began releasing hourly updates about its response to the outage. At 6 PM ET, it said, â€œWe have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. We are seeing improvement in availability across most AWS services. All services are now independently working through service-by-service recovery.â€Â Â Â

â€œWe continue to work toward full recovery for all impacted AWS Services and API operations. In order to expedite overall recovery, we have temporarily disabled Event Deliveries for Amazon EventBridge in the US-EAST-1 Region. These events will still be received & accepted, and queued for later delivery.â€

By 7:30 PM ET, AWS said it had restored all network device issues and was actively working on recovering impaired services, such as SSO, Connect, API Gateway, ECS/Fargate, and EventBridge. In the meantime, it advised AWS cloud users to switch to other region-specific consoles, such as the US-WEST-2 console, to run online services seamlessly.

See More: Why a Multi-CDN Strategy Is the Best Antidote for Website Outages

Understanding the Root Cause

On Friday, December 10, AWS finally named the root cause behind the outage and provided fresh updates. In a detailed statementOpens a new window , the company said the outage occurred at 10:30 ET on December 7 when an automated code execution, designed to scale the capacity of one of the AWS services hosted in the main AWS network, triggered a chain of unexpected behavior from a large number of clients inside the internal network.Â

â€œThis resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.â€

The performance issues immediately impacted the availability of real-time monitoring data, which affected its ability to find the source of congestion. However, after observing elevated internal DNS errors, AWS’s internal operations teams fully recovered DNS resolution errors by 12:28 ET, reducing the load on the impacted networking devices. Through sustained remediation efforts when system visibility was impaired, the teams could recover network devices by 5:22 PM ET. Affected services like API Gateways, EventBridge, Amazon Secure Token Service (STS), and AWS container services, including Fargate, ECS and EKS, were also restored as the day progressed.

â€œThe impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region,â€ AWS said.Â

â€œWhile we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.â€

See More: Facebook Outage: What Went Wrong for the Internet Giant?

Why Do Outages Like These Occur?

The massive outage affecting AWS’s US-EAST-1 Region isn’t the first time a leading public cloud provider suffered an outage of this scale. For instance, the Google Cloud Platform (GCP) suffered a major network service outageOpens a new window in November that impacted Google App Engine services across multiple regions. As a result, several popular websites and games, such as Snapchat, Shopify, Discord, Pokemon Go, and Nest, suffered downtime.Â Â Â

According to BMC Software, there are many reasons why leading public cloud services routinely suffer outages. The most prominent cause is power outages caused by the unavailability of round-the-clock and scalable electric energy required to run massive data centers. Distributed Denial of Service (DDoS) attacks cause data centers to overload with incoming traffic, human error (personnel giving incorrect commands), software and technical issues, networking issues, and periodic maintenance also result in service interruptions.Â

Most of these causes can not be predicted in advance, especially minor technical faults or software bugs. For instance, the recent outage affecting AWS’s US-EAST-1 Region occurred due to a routine automated code execution that is regularly conducted to scale AWS services hosted in the main AWS network. The bug led to a cascading series of malfunctions affecting many network devices and AWS services.

BMC Software recommends that users of public cloud services, notwithstanding tall claims made by service providers, should assume that a cloud outage will happen and plan corrective measures to reduce the impact.Â

â€œIf the impact of an outage for a certain duration is not acceptable for healthy business operations, it may be suitable to invest in high availability SLAs. Similarly, additional monitoring, visibility and control capabilities may be required on part of customers to ensure that a possible cloud outage is least impactful toward their business,â€ the company saysOpens a new window .Â

See More: What Triggers Global Web Outages and How Businesses Can Evade Them

How to Recover From Cloud Outage FalloutÂ

Kevin Beasley, CIO at VAI, tells Toolbox that no organization is always 100% safe from an outage, but having a disaster recovery strategy in place helps them prevent downtime and lost opportunity. â€œThe AWS outage is the most recent example of why it’s imperative to implement a disaster recovery plan and to have solid systems in place that can automatically update consumers about an outage. Without a solid disaster recovery plan and a technology team well versed in data backup, companies are left vulnerable to information outages such as the AWS outage and other devastating incidents.â€

It’s becoming increasingly imperative for business leaders to adopt backup strategies and technologies in preparation for outage recovery. Implementing solutions such as ERP cloud backup systems, detailed disaster recovery plans, and automated monitoring can help organizations prepare better for outages like these, he adds.

Dan Johnson, director for global compliance and continuity at Ensono, says that implementing a robust disaster recovery plan requires frequent assessments and analyses and different prevention, preparedness, response, and recovery measures. â€œThese disaster plans have the capability to minimize the impact on business performance by moving workloads and data over to the recovery site â€” ultimately managing all data until systems are restored.â€

Johnson adds that aside from running a disaster recovery plan, another sound strategy to prevent disruptions from outages is implementing a multi-cloud strategy, which allows disaster recovery plans to work across providers when experiencing issues or a complete outage. â€œWhile many organizations express concern around cost and security in multi-cloud plans, the benefits of these strategies are best for workloads in the long run.â€

What lessons should organizations take from a public cloud outage of this scale? Comment below or let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to hear from you!

AWS Outage: Lessons for Organizations to Mitigate the Fallout

Zero Hour: The Outage

Understanding the Root Cause

Why Do Outages Like These Occur?

How to Recover From Cloud Outage FalloutÂ

Contact ESSID Solutions

Reach out to us for a free consultation on big data consultancy and development services.