Effective Engineering in the Cloud: The Storage Tier

This is the fifth in an ongoing series of articles about how we deploy on AWS and how it helps us evolve and grow our business with a focus on being responsive to our customer’s needs, rather than trying to spend the least amount on our product hosting. In my first article, I provided a high level view of the technology we use. In this post, I’ll describe our Storage Tier and how we use S3 for all of our permanent storage. You can read about the worker tier, the DB tier, and the web tier in my earlier posts.

As I described in my post about our DB tier, we store metadata associated with all of the content that we backup in RDS, but the actual content itself is stored on S3. We chose S3 because it is easy to interact with, scales without limit, is super reliable (11 9’s durable, 4 9’s available), and comes with a reasonably low startup cost.  With S3, you only pay for the storage you use, so this is again one of the most “cloudy” services from AWS that we use. No need to provision racks of storage in advance of our needs, just write content and pay as you go. But, as with all Good Things, sometimes they are too good to be true.

Be sure you know which access method you need to use. A LIST request costs roughly 10X a GET request, and when you make hundreds of millions of requests per day, that delta can make a significant difference. This is a subtle difference that you will likely not catch during your initial coding (we didn’t), so be sure that you have monitoring and CloudWatch alerts setup on your AWS bill so that you can find out that you have a problem before the end of your billing cycle.

AWS is also slow (as compared to EBS or direct attached storage), and does not “really” behave like a file system. While we store the content for all of our readily accessible backups on S3, we do this in a massively parallel way so that the cost of writing to S3 does not bottleneck our overall backup throughput. You can iterate over content stored in S3 using the list command, but the only search criteria you can use is a “path” prefix. Also, for any list or delete operations (more on this in a moment), we use bulk interactions. One other approach we take to reduce our S3 cost is to de-duplicate content at the file level (for example, multiple users sharing the same document within a domain) and also compress content before writing it to S3.

Since our product does not impose any storage space limitations on our backups, we end up with an enormous amount of content stored in S3. It is impractical in terms of time and expense to do real-time backups of our backups, so in addition to our distributed architecture, we heavily leverage S3’s durability and IAM roles to ensure that only one process in our whole environment has delete permissions. This process only deletes expired user content (per our service agreement) and is the most reviewed, tested, and monitored code that we deploy.

So, there you have it. AWS S3 is a great tool for persisting content. It is super easy to use, and by far the most reliable AWS offering we rely on. It scales up seamlessly as our business grows, but as is always the case, things get more expensive when you do them at scale, so we have done a decent amount of work to optimize our storage costs. In my next post I’ll talk a bit about our search tier.


[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]