AWS Outage Lessons Learned: If Netflix Can Suffer, So Can You

You’ve reached an archived Flexera blog post that may be out of date. Please visit the blog homepage for the most current posts.

On Christmas Eve and continuing into Christmas Day, AWS had a “Service Event” centered on the ELB (Elastic Load Balancing) service in the US-East region. Although only a small percentage of ELBs were functionally disabled and unable to route traffic to their backend servers, all ELBs in the region experienced a time interval in which they could not scale, nor could changes be made to their configuration, such as adding or removing backend servers from the load balancing pool (for the full details, see the AWS post-mortem).

Many AWS customers were affected at some level – a few more negatively than others – as every ELB in the region was involved to some degree. Among these AWS customers was Netflix, whose Cloud Architect Adrian Cockcroft described the Netflix view on the outage in his blog post.

This brings me to an interesting point that many in the IT community (including Netflix) are discussing in regard to this outage: the benefits of a multi-region setup and how such a configuration can help in these situations.

Don’t Get Seduced by Vendor-Specific Solutions

While we here at RightScale are big believers in using multiple regions for disaster recovery purposes, in many situations a multi-region configuration is overkill for day-to-day production operations. Many of our recommended best practices for production deployments can be found in my white paper, Building Scalable Applications In the Cloud: Reference Architecture and Best Practices.

The vast majority of RightScale customers run their productions systems in a single region, and we advise them to avoid using vendor-specific tools to reduce the potential for any hidden dependencies these tools may introduce. A key takeaway from this recent AWS outage is that while it did affect the entire US-East region, it only affected a single vendor-specific service in the region – the Elastic Load Balancing service.

If your website had been using a different instance-based load balancing solution (HAProxy, nginx, etc.) you would have been totally isolated from this failure and seen no service impact. In a previous blog post I provided some tips for fine-tuning your cloud architecture, of which Tip #9, “Be Wary of Cloud Lock-in,” cautions in part:

The use of vendor-specific tools and virtual appliances may make deploying an application easier in the short term, but many times these services are integrated or tied into other services that can result in cascading outages if one of these underlying services suffers a service disruption. The use of vendor-neutral solutions insulates your application tiers from these service integrations, as well as creating a cloud-portable solution…

Using a vendor-specific tool such as ELB makes things easier during the setup phase, but it not only locks you into the vendor since that tool is not available from other cloud providers, but it also (and more importantly in this case) can result in a service disruption due to a cascading effect. The ELB state data that was inadvertently deleted by a manual mistake (more on this later) was only relevant to a very small percentage of ELBs, yet the entire ELB infrastructure was affected in some way, that is not being able to scale nor make configuration changes.

The Benefit of Using Loosely Coupled Components

One of the mantras that we preach in the world of cloud architecture best practices is to use loosely coupled components. This outage illustrates that what might appear to be a standalone component may in actuality have tight integrations with other infrastructure behind the scenes, and thus the degradation of one aspect of the system can have a cascading affect on other components. Netflix has done a lot of amazing things with regard to its cloud automation tools (take a look at some of the company’s open-source projects – Simian Army in particular is awesome), but its continued dependence on AWS-specific components has always been a mystery to me.

Regarding the root cause of the outage, GigaOm provided a thoughtful take on what happened and Netflix’s subsequent response. It is noteworthy that the original issue was caused by a developer manually running a process that was “currently being automated,” as we advise our customers to automate all processes that touch production systems. (And we do practice what we preach at RightScale in running our own complex web of interconnected systems).

Any automated process that requires a manual “kick-off” (such as what I am assuming might have been the case in the AWS ELB situation) should be subject to both access controls (the “who” that can do it) and logging/auditing (the “what” and “when” the “who” did). Manual errors are to be expected in any environment because humans are imperfect beings, so we should relinquish as much control to our automated systems as possible. The only mistakes they make are the ones we tell them to. 🙂

Highly Available, Resilient Systems Are the Answer

This AWS outage serves as a good illustration of why “best practices” are rightly so. When shortcuts are taken or oversights creep into the mix, what may appear at first blush to be a small, localized issue can potentially cascade into a bigger problem and adversely affect your infrastructure and the applications that depend on them. On the surface, individual components can appear to be distanced from those that are suffering the outage or performance degradation, but in reality they can also be potential points of failure.

The tools are out there to build highly available, resilient systems. How effectively you use these tools will dictate your tolerance to infrastructure service disruptions that will continue to occur, however infrequently. To try some of these vendor-neutral solutions, user permission features, logging and auditing mechanisms, and more, get a free trial of RightScale Cloud Management.