Effective Engineering in the Cloud: The DB Tier
This is the third in series of articles about how we deploy on AWS and how it has helped us evolve and grow our business with a focus on being responsive to our customer’s needs, rather than trying to spend the least amount on our product hosting. In my first article, I provided a high-level view of the technology we use. In this post, I’ll describe our DB Tier and how AWS has helped us, plus a few of the trade-off decisions we have made.
For our DB tier we deploy MySQL on RDS. We use application-level sharding and take advantage of multi-az deployments for some (but not all) of our DB instances. Our DB tier is our least cloudy (in contrast to the web tier), because with RDS you cannot scale an instance up or down based on load. We have to add RDS instances as our data load grows and keep them running full time. Consequently, we take advantage of RDS Reserved Instances which allow us to pay now and save later with a break-even point at 6 months.
OK, so why MySQL? Initially, this was a legacy technology decision, but as we implement new products and features, we find that MySQL works extremely well. Also, there is a large community of support and knowledge, so when we get into a bind, the path out is pretty clear. MySQL also fits into our model of keeping things simple and loosely-coupled. Now, in order to scale up in support of our customer adoption, we have partitioned our database at the application level. All DB interactions occur against an individual user, so we have a single index DB that keeps track of which partition holds a user’s data and a large number of data partitions. By using an index-lookup scheme rather than MySQL’s built-in sharding, we can easily migrate users from one partition to another. In general, we let content live where it is initially written, but over time a partition will become unbalanced (in terms of CPU, or storage space, or connection count), so we move some users out of that partition. Moving a user’s content is a simple dump/load/delete script and imposes no downtime for any customer.
RDS is generally reliable, and we have never had an issue with MySQL itself, but since RDS runs on top of ECS instances, with attached EBS storage, our RDS instances do go down once in a while. For mission-critical databases, like our index DB, we use multi-az deployments so that if the primary is unreachable, AWS will swap over to the hot standby in another availability zone. While this failover does not happen immediately, our retry logic is able reconnect once the standby instance is promoted to master.
RDS also automatically manages backups and allows a point-in-time restore down to a 5-minute window. One thing to be aware of is that while a restore from a snapshot will be complete in under an hour, a point-in-time restore starts with the most recent full snapshot (taken daily) and then runs through the transaction log. For a busy database, this recovery can be painfully slow. We once had a 35 hour restore; thankfully, that was a test run!
I consider RDS as the foundation of our AWS deployment. We grow it over time using well-established engineering practices, and it provides a solid base upon which we can build our customer-facing features. In my next post, I’ll describe our worker tier, which is where most of our work actually happens.