New EC2 Instance Types and Coordinated Failures in the Cloud

You’ve reached an archived Flexera blog post that may be out of date. Please visit the blog homepage for the most current posts.

Amazon announced a new series of instance types today that do not have any local storage. This got me thinking about some of the failure modes we’ve seen. The specifics are that AWS released the “m3” series of instances with 15/30GB of memory and 4/8 cores and no local disk storage, which means that the root volume (from the image) and all additional data storage must use EBS volumes. The move to servers without local disks has been expected for a long time and is no surprise given that the mechanical spindles must be one of the top failure causes. Putting disks into redundant storage systems, such as EBS, improves durability and manageability, which are good things.

What got me thinking is that the move towards EBS-only instance types seems to make it more difficult to run highly available distributed services. The design of distributed systems, such as replicated data stores, assumes that servers fail relatively independently of one another such that if a server that holds a piece of data fails then it is very unlikely that the other server(s) holding a replica of the same data also fail at the same time. To take failures of many servers at the same time into account one has to ensure that all the replication occurs across such likely failure clusters, which is what Amazon’s availability zone concept is intended to enable.

Instances using EBS storage volumes are tightly coupled to the EBS system and it’s difficult to avoid the fact that the EBS system was involved in most of the large scale outages in EC2 in recent times. This means that the EBS system represents a potential source of coordinated failures one has to take into account when deploying highly available distributed systems. EBS is probably the most criticized part of EC2 and so this realization gave me pause. To sort my thoughts I started to list some relevant observations so I could balance them against each other:

  • several memorable outages started within EBS and took down a large number of servers as a result
  • servers within an availability zone can be in the same rack, on the same UPS, or attached to the same router(s), which are also sources of coordinated failures
  • EBS is a distributed storage system, so different instances are likely to be connected to different storage servers/subsystems, it’s not a monolithic system
  • AWS has clearly made progress in isolating the availability zones at the EBS level, the EBS issues last week did not propagate across zones (although some other issues did ripple through), but it would be great if AWS provided information on how earlier issues have been resolved
  • any good replication set-up must ensure that data is replicated across zones and not just within a zone, this is true whether one uses EBS or not
  • a typical replication factor is 3 and sometimes higher, that means that multiple replicas have to be in the same zone when one is using one of the smaller regions that have only two zones
  • at the end of the day, for critical data/services, a DR replication to a different region or cloud provider is required

After re-reading these observations I felt a bit better. By the time one follows best practices one should end up with a system that is as resilient to coordinated EBS failures as to other coordinated failures in an availability zone, so the use of EBS doesn’t seem to introduce significant new factors.

There is another concern, which is consistent performance, which hasn’t been a strength of EBS until the introduction of provisioned IOPS (the ability to reserve a certain minimum rate of I/O operations per second). Sometimes coordinated performance degradations are much more difficult to troubleshoot and remedy than outright failures where a failover decision becomes obvious. For this reason, I would highly recommend the use of provisioned IOPS in distributed systems whose performance hinges on I/O performance.

An additional angle to consider is that the use of EBS volumes and also of provisioned IOPS increases the overall operating cost. In contrast, the 4 local disks in the case of m1.xlarge instances come with the base instance cost. It’s perhaps an apples vs. oranges comparison in some cases, but the costs when launching clusters of servers do add up.

After summing everything up I feel better about the new instance types without local disks, but I have to admit that we’ll continue to prefer the m1.xlarge instances that have 4 local disks for our cassandra (distributed NoSQL storage) clusters, if only because it’s easier to reason about them and troubleshoot when something doesn’t perform as expected.

If you have additional observations or thoughts about this topic, please do leave a comment so we all learn more! You can also talk to us about these topics at the AWS re:Invent conference later this month.