RightScale Server Orchestration and Amazon SWF launch

You’ve reached an archived Flexera blog post that may be out of date. Please visit the blog homepage for the most current posts.

The launch of Amazon’s SWF (see also Werner’s blog) is a good opportunity to talk about some of the exciting new automation features that we have in the works and we will make available with the coming releases. We’ve been using SWF as one of our back-end services for a number of months and it’s a pretty awesome service that greatly accelerated the development of our orchestration features. In my mind, automation is the most fundamental innovation in cloud computing. It enables all the business benefits (pay as you go, scale on demand, resiliency, predictability, etc.) and it lets us increase the scale at which we use computing as well the reliability of services. Cloud computing is inconceivable without automation throughout the entire stack.

RightScale has focused on automation from day one. We provided auto-scaling of server arrays early on: automatically launching and terminating servers based on monitoring metrics, such as cpu load. Something most newcomers don’t appreciate is that making the call to launch the next server when the cpu load goes up on the running ones is not the difficult part. The difficult part is bringing the new servers into full operation. That involves loading all required software, configuring everything, and connecting the server with other services, such as load balancing and the databases. This is why a big piece of the RightScale functionality concerns itself with configuration management and automating the entire boot process all the way to the point where the application is in production.

With coming releases we are continuing to build on top of this platform and introduce server orchestration. Server orchestration uses a workflow language that lets you automate at the level of RightScale resources, such as servers, deployments, etc. The first functionality we implemented is to let you customize the three key pieces of auto-scaling: (1) deciding when to scale up or down and by how many servers, (2) launching new servers, (3) terminating existing servers.

The way this works is that the RightScale system calls a user-defined decision function every minute to find out whether the server array should be scaled up or down and by how much. The decision function simply returns an integer that indicates how many servers to launch (value >0) or to terminate (value <0), or a value of zero to keep the server count the same. The decision function can retrieve monitoring data using our API and do a calculation similar to the built-in one or it could do something completely different. An interesting example would be to use knowledge of application specific state and metrics to better predict requirements. You may be able to tell early that a flash event is coming and that you need to launch a large number of servers all at once. That’s just one example, the sky is really the limit and I know some of our customers have pretty cool ideas in this area!

When the decision function asks for more servers, RightScale runs a scale-up workflow to actually launch the servers. This puts you in control of how the servers are to be launched and creates an interesting opportunity to carefully manage where the servers are launched. For example, you may want to ensure your servers are equally spread across a number of datacenters for availability reasons. Or you may want to launch where it’s the cheapest. Very similarly, the scale-down workflow can be picky about which servers are being terminated. In the built-in auto-scaling we terminate the oldest servers to ensure a continuous refresh of the running stock. But for some applications it’s preferable to terminate the youngest servers. In addition, the scale-down workflow can gracefully shut down the application, take a last backup, save away log files, and then terminate the server.

As we designed the orchestration functionality we kept coming back to two key requirements: concurrency and fault tolerance. We need to express concurrent activities with ease because, when one operates on many servers, it’s the only way tasks complete in a reasonable amount of time. For example, to perform a rolling upgrade on a number of servers the orchestration ought to grab a set of servers, run them through the upgrade process in parallel, and then move on to the next set.

Where orchestration becomes really exciting is when it is used to recover from failures and automatically relaunches failed resources, possibly in a different datacenter or cloud. That immediately raises the question about the resiliency of the orchestration process itself: what if it is affected by the same broader failure and can’t perform the recovery? Similar concerns arise when an orchestration process runs for a long time. The array auto-scaling example above could be implemented using a “parent” workflow that runs forever and invokes the decision function and scaling sub-workflows periodically. And again, this execution must be resilient to failures.

In order to provide good support for concurrency and to offer a fault-tolerant execution environment we decided to base our orchestration system on a workflow language that is built around the open source Ruote workflow system. Ruote offers a multitude of very nice structured concurrency constructs. For example, you can express strategies such as “run concurrently and wait for all” or “run concurrently and wait for the first, then cancel the rest”. The latter may sound unusual but it’s useful when you need a resource and you want to try multiple avenues and pick the first one that succeeds.

Amazon SWF came in very handy to ensure fault-tolerant execution of the Ruote workflows. We retargeted Ruote to leverage Amazon SWF as an execution back-end with the result that workflows are executed by many servers distributed across multiple availability zones. SWF takes care of scheduling the execution of workflow actions, collecting the results, and then atomically handing the results back to Ruote so it can schedule the next wave of actions. The result is a highly resilient orchestration system that can continue the execution of workflows in the face of major failures.

We’re obviously very excited about the upcoming features and can’t wait to make them available to our customers. Now that Amazon SWF is live we’re on the home stretch and hope to be ready for a private beta shortly after the upcoming release. If you’re interested in early access, please send me an email.