Amazon.Com Inc (NASDAQ: AMZN)'s Amazon Web Services (AWS) is revealing the cause of an outage that took down parts of its own services, along with third-party websites and online platforms that utilize AWS on Dec. 7.
Amazon.Com Inc (NASDAQ:AMZN)'s Amazon Web Services (AWS) is revealing the cause of an outage that took down parts of its own services, along with third-party websites and online platforms that utilize AWS on Dec. 7.
What Happened: In a post on the AWS website, the company explains that an automated process caused the outage, which began around 10:30AM ET on Tuesday in the Northern Virginia (US-EAST-1) region.
The post indicates the activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.
“This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks,” the post reads. “These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”
When it comes to the impact of the outage, AWS says customer workloads were not directly effected from the internal networking issues, although the outage did impact to a number of AWS Services which in turn impacted customers using these service capabilities.
Because the main AWS network was not affected, some customer applications which did not rely on these capabilities only experienced minimal impact from this event.
What’s Next: AWS indicates that several steps have been taken to prevent a recurrence of this event. The company immediately disabled the scaling activities that triggered this event and will not resume them until they have deployed all necessary remediations.
“Our systems are scaled adequately so that we do not need to resume these activities in the near-term,” The post indicates. “Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but a latent issue prevented these clients from adequately backing off during this event.”
AWS officials say they are developing a fix for this issue and expect to deploy this change over the next two weeks. They have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event.
Photo: Courtesy of Unsplash