Effective Engineering in the Cloud: The Worker Tier

This is the fourth in series of articles about how we deploy on AWS and how it has helped us evolve and grow our business with a focus on being responsive to our customer’s needs, rather than trying to spend the least amount on our product hosting. In my first article, I provided a high-level view of the technology we use. In this post, I’ll describe our Worker Tier and how we use AWS to scale up and down as our load changes.

In our worker tier, we deploy on EC2 instances and Autoscale to grow and shrink our worker pool as our load changes. Just like in our web tier, all of the EC2 instances that we launch use the same base image, which has required OS and third-party software pre-installed.  We launch our EC2 instances over multiple Availability Zones, but again, stick with a single Region. We run all of our workers outside of a VPC but secure them using IAM and security groups.

We run multiple Autoscale groups – essentially one per content type (docs, calendars, sites, mail, backup, restore, export, etc.), since each of these content types have different runtime needs. This allows us to easily adjust to large customer ingest where some domains have a lot of email and others have a ton of documents. We scale up based on the size of our SQS queues and scale down based on the average CPU of each Autoscale group. AWS makes it super easy to pull these (and many other) metrics from, and set triggers in, CloudWatch.

Scaling up is easy; we just increase the size of an Autoscale group, and if there are not enough workers in that group, AWS will launch a few more. We use Chef to deploy our code and configure these instances by simply passing a user-data script into the instance, which tells the Chef client what role to configure. Each of our worker boxes (currently we use m1.large instances) has a pre-set number of worker processes configured, so launching a new worker box increases our backup throughput by the number of workers configured for that backup type. There is a fine dance between maximum utilization and leaving enough headroom to handle unexpected spikes, so we continuously monitor our worker instances (using Nagios and Graphite) to ensure that we take full advantage of their CPU and Memory.

Scaling down is bit more tricky for our workers, since most of our jobs are long-running. It can take many hours to process a new user’s backup or some of our larger restore jobs. It is very “uncloudy” to force (or even expect) an EC2 instance to stay up for that full job duration, so we have built resumability into all of our worker processes. This is reasonably complex code, and it behaves differently for each of the worker processes, but at the high level we:

  • Keep track of the progress for each process in a database;
  • Requeue any work in progress if we get a gentle shutdown signal;
  • Monitor for zombie jobs that were terminated ungracefully and restart them when we find them.

We use the smallest instance class possible (m1.large currently) for our workers since that provides the most flexibility in terms of scaling our worker pool up and down. We try to shut down worker boxes aggressively in order to keep our overall Autoscale CPU high. We do not want to keep a bunch of lightly-utilized worker boxes running, since they are charged by the hour. While optimizing for cost savings is not our top priority, we can’t use a pure brute force approach of keeping a maximum worker pool running. Doing so at scale, you can rack up some serious AWS charges if you don’t pay attention to shutting things down (just ask how we know ;->).

AWS is a key enabler in terms of the way that we can scale our worker pools as our load increases over time. It also provides a super resilient deployment model so that our workers can always run in a healthy Availability Zone. The ease with which we can scale our worker pools down is also a key contributor to operational efficiency, while at the same time being responsive to the needs of our customers. In my next post, I’ll describe our storage tier along with the good, bad, and ugly of S3.