Optimizing AWS RDS and EC2: Addressing Performance and Scaling Bottlenecks

This topic is discussed in episode #006 of our Cloud & DevOps Pod

Optimizing AWS RDS and EC2: Addressing Performance and Scaling Bottlenecks

When it comes to cloud infrastructure, two of the most frequently used Amazon Web Services (AWS), are Relational Database Service (RDS) and Elastic Compute Cloud (EC2). Both are essential for building scalable applications, but as your workloads grow, you might encounter performance bottlenecks that can severely impact your system. In this blog, we’ll explore common performance issues with AWS RDS and EC2, and discuss strategies for optimizing them, focusing on instance types, CPU credits, and storage.

Understanding RDS Performance Bottlenecks

RDS is a managed relational database service that many businesses rely on. However, as applications scale and demand increases, RDS often becomes the first bottleneck. Even though your EC2 instance might seem to be running smoothly with low CPU utilization, it’s common to experience a sluggish application due to RDS performance issues. Why does this happen?

Instance Types: The Role of T-Series Instances

AWS offers a variety of instance types for different workloads, but one of the most commonly used for cost-efficiency in RDS is the T-series (T2, T3, etc.). These instances are often used in development and testing environments because they offer a balance between cost and performance. However, in production, these instance types come with limitations, especially when handling increasing loads.

The key feature of T-series instances is burstable performance. While a T3.large instance, for example, can use up to 20% of its CPU baseline without issues, it cannot sustain CPU bursts indefinitely. When the system exceeds the baseline, it enters the "burst zone," where it consumes CPU credits. Once those credits are exhausted, the instance throttles back to its baseline performance, often leading to significant performance degradation.

In a typical scenario, your application might experience a spike in demand, resulting in more database queries. Initially, the RDS instance handles the extra load well by using CPU credits to burst beyond its baseline capacity. But after an extended period, those credits deplete, and the instance can only operate at 20% CPU, causing the entire application to slow down or even halt.

T3 Unlimited: A Temporary Fix

To mitigate the problem of depleted CPU credits, AWS introduced T3 Unlimited, where you can pay extra to continue bursting after your credits are exhausted. This feature prevents your instance from being throttled back to baseline performance. However, while T3 Unlimited solves the immediate issue, it may not always be the most cost-effective option for long-term scalability. Over time, paying for constant CPU bursting might become more expensive than upgrading to a more appropriate instance type like the M5 or M6 families, or even the Graviton-based instances (T4G, M6G).

Graviton: The Power of ARM Architecture

The recommendation for many workloads today is to move to Graviton-based instances (T4G, M6G), which use the ARM64 architecture. These instances offer better performance at a lower cost compared to their Intel-based counterparts. Whether you're running PostgreSQL, MySQL, or any other supported database, the transition to Graviton is relatively smooth, and the performance gains are substantial.

Graviton-based instances are particularly beneficial if you're using RDS because they provide more efficient CPU usage while maintaining a lower price point. If your application is experiencing frequent performance issues due to exhausted CPU credits, switching to a Graviton instance is almost always a no-brainer.

EC2 and RDS Storage: Moving from GP2 to GP3

While CPU-related performance issues are common, storage bottlenecks can also affect the performance of your EC2 or RDS instances. AWS offers several types of Elastic Block Storage (EBS) volumes, with GP2 and GP3 being the most commonly used for general-purpose workloads.

The Problem with GP2 Storage

With GP2, the amount of IOPS (Input/Output Operations Per Second) you get is directly tied to the volume size. For example, a 1 TB GP2 volume will have enough provisioned IOPS to handle most workloads, but smaller volumes—such as 100 GB or 500 GB—might not have sufficient IOPS to meet the demands of a growing application.

In many cases, when an RDS instance starts slowing down, it’s not just because of CPU limitations but also because the storage volume can’t handle the number of queries being made. When the IOPS limit is reached, the database becomes sluggish, and the application can experience significant delays. In some cases, it can feel like the database has "frozen," when in reality, it’s simply waiting for more IOPS credits to accumulate.

GP3: A Better Storage Solution

The introduction of GP3 storage solves many of the problems associated with GP2. With GP3, IOPS and throughput are decoupled from storage size. This means that even if you only need a 100 GB volume, you can provision up to 16,000 IOPS independently, allowing for much better performance without having to increase your storage size unnecessarily.

For RDS users, this is a game-changer. Applications with smaller datasets but high transaction rates can now optimize for performance without paying for storage they don’t need. If you’re still using GP2, it might be time to migrate to GP3 to take advantage of these performance benefits.

The Bandwidth Bottleneck

While GP3 solves many storage-related issues, there’s another performance constraint to be aware of: bandwidth between the instance and the EBS volume. Each EC2 instance type has a baseline and maximum bandwidth limit for communicating with its EBS storage. For example, a T3.large instance might offer up to 16,000 IOPS, but it can only sustain this maximum performance for short bursts.

After those bursts, the bandwidth reverts to the baseline, and you may find that your instance can no longer handle the required throughput. If your application is storage I/O-heavy, you might need to consider upgrading to a larger instance size with higher sustained bandwidth or opting for an instance family that supports higher baseline throughput.

Optimizing IOPS for RDS

When it comes to RDS, another option for optimizing performance is to increase your provisioned IOPS. If you have a database that handles a lot of read/write operations and is frequently under load, increasing your IOPS capacity can make a significant difference.

It’s important to note that when you scale the storage volume, your IOPS credits are replenished. This can provide a temporary boost in performance, but if your database continues to experience high demand, you might find yourself needing to scale up repeatedly. Provisioning additional IOPS or moving to GP3 storage will give you more flexibility and better long-term performance without requiring constant intervention.

Scaling Considerations for EC2 and RDS

As your application grows, scaling your infrastructure becomes inevitable. While AWS provides auto-scaling options for both EC2 and RDS, there are still limitations and quotas to be aware of. For instance, auto-scaling your EC2 instances is simple in theory, but it’s not uncommon to run into service quotas that limit how much you can scale at once. These quotas can prevent your application from responding quickly to demand spikes, leading to delays and potential downtime.

Similarly, scaling your RDS instance comes with its own challenges. If you’re running a microservices architecture, you might have multiple databases that need to scale independently. Provisioning one terabyte of storage for every single database might not be feasible, both from a cost and performance perspective. Instead, you’ll need to carefully manage IOPS and CPU utilization across each database to ensure optimal performance without breaking the bank.

Conclusion

Optimizing the performance of AWS RDS and EC2 is crucial as your workloads grow. By understanding the limitations of instance types, CPU credits, and storage IOPS, you can make informed decisions to avoid bottlenecks. Moving to Graviton instances, adopting GP3 storage, and carefully managing your instance bandwidth can provide significant performance gains while keeping costs in check. In the world of cloud computing, knowing when and how to optimize can make all the difference in ensuring a seamless user experience for your application.

Edward Viaene
Published on April 4, 2024