AWS offers multiple options to host your databases serving OLTP workloads – host your own managed database on Amazon EC2 instances or use Amazon RDS managed by AWS. RDS manages high availability, automated backups, database upgrades, OS patches, security, and read replica. RDS also offers the cloud-native option Amazon Auroradatabase engine, which is compatible with MySQL and PostgreSQL. Aurora delivers higher throughput compared to the standard MySQL and PostgreSQL databases.
While running production workloads on hosted databases with Amazon RDS or Amazon EC2, you may have encountered the following questions:
In this post, I provide best storage practices for running production workloads on Amazon RDS or EC2 instance–hosted databases.
Compared to test, QA, or staging environments, production workloads require fast and consistent I/O performance. While relational databases can be used for multiple purposes, their most common use case is to host an online transaction processing (OLTP) workload. RDS, EC2 hosted databases and Aurora uses different types of storage techniques as shown below:
Amazon RDS provides three storage types:
The I/O capacity of the instance is based on the instance storage type and size. If the DB instance is configured with a gp2 volume, the baseline IOPS capacity is 3x the GiB storage. If the DB instance has allocated a 100-GiB gp2 volume, the baseline IOPS capacity is 300. The more storage you provision, the higher the IOPS capacity.
In addition to baseline IOPS capacity, gp2 volumes also deliver burst capacity up to 3,000 IOPS for extended periods of time. The burst feature is limited to volumes equal to or less than 1 TiB of storage. DB instances for MySQL, MariaDB, Oracle, and PostgreSQL can be configured with 20 GiB–32 TiB, but the max baseline IOPS is limited from 100 to 16,000 IOPS. So, a gp2 volume of 5.34 TiB or more delivers the same baseline: 16,000 IOPS.
If your production workload requires high OLTP, and fast, consistently high throughput performance, you should configure your DB instance with io1 volumes. Compared to gp2 volumes, which deliver a maximum baseline of 16,000 IOPS, io1 volumes can deliver up to 40,000 IOPS for DB instances for MySQL, MariaDB, Oracle, and PostgreSQL and up to 32,000 for SQL Server instances.
If you find the pattern of IOPS usage consistently going beyond more than 16,000, you should modify the DB instanceand change the storage type from gp2 to io1. Amazon RDS also offers magnetic storage, but it’s not suitable for an OLTP workload requiring consistent I/O performance and low latency.
The Magnetic storage type is not recommended for I/O-intensive workloads because the maximum storage is less than that of gp2 or io1. The IOPS capacity is also limited to a maximum of 1,000 IOPS.
Using gp2 storage is ideal for a wide variety of DB workloads. For this storage type, architect database read-and-write workloads in such a way that the sum of ReadIOPS and WriteIOPS values does not exceed baseline IOPS capacity at any given time.
Burst capacity may be available for an extended period of time. However, after burst capacity is used, a consistent high value of read and write IOPS degrades the instance performance. This degradation can be seen by increased WriteLatency or ReadLatency values. Ideally, gp2 storage is good for single-digit millisecond latency, but overuse of IOPS can cause >10-ms latency.
The following images show increased WriteLatency values as WriteIOPS consistently consumes baseline 300 IOPS capacity on an Amazon RDS DB instance. In this example, the Amazon RDS PostgreSQL instance is hosted on a t2.small instance with a 100-GiB gp2 volume.
The above image shows Write IOPS consuming 300 IOPS consistently which is the baseline performance.
The above image shows Write Latency spiked up to 25 milliseconds due to over usage of IOPS.
As a best practice, make sure that your workload doesn’t go beyond the IOPS capacity of the instance. Some of the ways to reduce ReadIOPS values are to:
Amazon RDS DB instances for MySQL, MariaDB, Oracle, and PostgreSQL offer RDS read replicas. These instances are separate DB instances synced up with the source DB instance by replaying database transactional logs. Any data modification on the source DB instance is applied on the read replica. With a read replica, you are reducing the load on your source DB instance by routing read queries from your applications to the read replica. You are also freeing up IOPS capacity for additional write activity at the source DB instance.
With read replicas, it’s important to monitor replication lag. Generally, high replication lag is caused by high write activity at the source DB instance.
In Amazon RDS DB instances, you can monitor replica lag by the CloudWatch metric ReplicaLag. If you find high replica lag, you should also monitor write activity at the source DB instance. This can be accomplished by monitoring CloudWatch metrics WriteIOPS and WriteThroughput. If the source DB instance is IOPS-deficient (that is, all IOPS capacity is being used by the write and read workload), the replica also keeps lagging.
One of the reasons behind lagging replicas is that in most DB engines, read replica recovery involves single-treaded processes. That means, the higher the load at a master instance, the exponentially slower recovery at read replicas. Any additional high write activity at the source DB instance exponentially increases read replica lag. Apart from CloudWatch metrics, with ReplicaLag you can also monitor lag by SQL queries.
In PostgreSQL, read replica lag can be calculated using the following query:
In MySQL, you can check the replication status with the following command:
With an Amazon RDS read replica, configure the client in such a way that a certain level of latency or replication error found at a replica triggers trying another replica endpoint for connection.
A good way to ensure that your application can find the healthiest replica is to call CloudWatch metrics to find the current values of ReplicaLag and Read/Write latency. Replication lag can be found with SQL commands, as shown in previous examples. You can also find the current state of replica by calling the AWS Command Line Interface (AWS CLI)command describe-db-instances. If the current state of the replica is other than “replicating,” the client should try connecting to another replica.
Apart from the benefit of distributing read transactions, read replicas also can be used to shard your data. Following the “share-nothing” architecture of shards, you can create read replicas corresponding to each of your shards and promote them when you decide to convert them into standalone shards.
Amazon RDS DB instances should have sufficient RAM so that your full working set resides in memory. As the read queries can read data from memory, it reduces communication with storage volumes. As such, it reduces the usage of ReadIOPS capacity that can be used for write purposes.
There is no straightforward way to find the size of a working dataset. Look at the read queries and find out how much data is being hit. For example, if the size of a database is 100 GiB, and the working set is 20 GiB, you should use an Amazon RDS DB instance with at least 20 GiB of memory. That allows you to have the complete working set in memory.
EBS volumes are block-level storage volumes that provide persistent block storage. These volumes are highly available storage volumes and can be attached to an EC2 instance in the same Availability Zone. EBS volumes are ideal for EC2 instance–hosted databases. Using EC2 instance ephemeral storage for a database is not recommended.
By using EBS storage volumes with EC2 instances, you can configure volumes with any RAID levels. For example, for greater I/O performance, you can opt for RAID 0, which can stripe multiple volumes together. RAID 1 can be used for data redundancy because it mirrors two volumes together.
Regardless of RAID configuration, EBS volume data is replicated across secondary servers to avoid any data loss. RAID 5 and RAID 6 are not recommended on EC2 instance–hosted databases because the I/O performance is not as good as RAID 0 or RAID 1.
The following table shows the advantages and disadvantages between using these two different configurations of RAID, and suggests possible use cases.
|RAID 0||I/O performance superior compared to fault tolerance||Loss of single volume causes complete data loss||If database requires higher throughput compared to data availability and data is reproducible|
|RAID 1||Fault tolerance is superior compared to I/O performance||Low Write performance||If data is critical and database fault tolerance is more important than I/O performance|
If a database instance is facing storage issues and running into problems such as high commit time and high latencies, sometimes changes in the application can mitigate this degradation. You can modify applications to enable exponential backoff or error retries.
Exponential backoff allows applications progressively longer waits between retries for consecutive error responses. While some algorithms use incremental delay, most exponential backoff algorithms use randomized delay. Here are examples for a different algorithm:
Use certain best practices to achieve faster failover in Amazon RDS Multi-AZ instances and Aurora clusters. Enable TCP keepalive parameters and set them aggressively to ensure that if your client is no longer able to connect to the DB instance then any active connections are closed quickly. This modification also allows applications to react to failover faster and connect quickly to the new endpoint.
You can also reduce DNS caching timeout at the client. Read and write connections are established quickly to the appropriate endpoints. Some of the server TCP setting parameters can also be modified. These changes help faster failover. For example, in PostgreSQL, this can be controlled by the tcp_keepalives_count, tcp_keepalives_idle, and tcp_keepalives_interval parameters.
Monitoring the health of instance storage on a regular basis identifies the early onset of a performance issue before it has a severe effect on database performance. Some of the storage CloudWatch-related metrics that you should regularly monitor are listed here.
Along with the previously mentioned metrics, you should also monitor the following CloudWatch metrics:
As mentioned previously, Amazon RDS DB instances and EC2 instances have IOPS dependency on storage volumes. The gp2 and io1 storage types have their own IOPS limits.
If your workload requires higher IOPS performance and higher throughput, you may plan to migrate to Aurora, which is a high-performance, highly available and cost-effective solution suitable for high throughput workloads. Currently, Auroraoffers MySQL- and PostgreSQL-compatible engines.
While using Aurora, make sure that there is technically no limit of IOPS but throughput could be limited to the underlying Aurora instance limit. For better throughput, go for a higher Aurora instance class.
Aurora is best suited for applications requiring little to no latency for any given IOPS. It’s designed to handle a high velocity of data providing higher throughput compared to traditional MySQL and PostgreSQL engines. Being a row-store database, it’s ideally suited for high volume, high concurrent, OLTP workloads.
Another use case of Aurora is hybrid transaction analytical processing (HTAP). Aurora supports up to 15 replicas. Each of these replicas runs within 15–20 milliseconds of the write instance. With the recently added Amazon Aurora Parallel Query feature, query processing is pushed later to Aurora storage. The query uses potentially thousands of storage nodes in an Aurora cluster to process, refine, and aggregate data before sending it to the compute node.
In this post, you learned about best storage practices for running a production workload on Amazon RDS DB instance and EC2 instance–hosted databases. These practices involved the following:
You also learned about Aurora and how its proprietary storage performs differently than EBS volumes. All of this knowledge helps you run a production workload smoothly and without issues on AWS databases. You can also look into the details of how Aurora handles speed and availability of database using storage layer in this Database Blog post: Introducing the Aurora Storage Engine.