The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift run queries in a queueing model. The default WLM configuration has a single queue with five slots. 99% of the time this default configuration will not work for you and you will need to tweak it. Configuring the WLM for your workloads provides two main benefits:
- Scaling workloads by giving them enough resources (e.g. concurrency and memory)
- Isolating and protecting your predictable workloads (i.e. batch operations) from your unpredictable workloads (i.e. ad hoc queries from reporting tools)
You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.
To set-up your WLM for your workloads, we recommend following a four-step process:
- Separate users
- Define workloads
- Group users into workloads
- Select slot count & memory % per queue
We’ve written a detailed guide on tuning WLM that describes the four-steps. The approach will help you eliminate queues wait times and reduce disk-based queries. Both will slow your cluster down, so let’s take a closer look.
Eliminate queue wait times by matching queue slot count to peak concurrency
If you’ve used Redshift for any period of time, you may have come across a situation where a query that used to run for two seconds, all of the sudden starts running much, much slower. The most common reason for this is queueing. The query was waiting in a queue because the # of slots in the cluster was too low for the number of concurrent of queries executing.
The default configuration allows you to run five concurrent queries in one queue. That means if five queries are executing, the sixth one will queue until a slot becomes available.
The goal is to ensure that queries are not waiting in the queue. This can be done by matching the slot count of the queue with the actual concurrency of the queries running in that queue.
You can eliminate queue wait times by:
- Increasing the slot count for you queues
- Reducing concurrency by distributing queries more evenly throughout the day.
Another benefit of using this approach is that you can use Short Query Acceleration for Amazon Redshift (“SQA”) the right way, because there is a downside of SQA. Activating SQA consumes memory within the cluster – which takes us to disk-based queries.
Reduce disk-based queries by assigning enough memory to your queues
Increasing slot count to eliminate queuing can have an adverse side effect: Disk-based queries. “Disk-based” means that the query runs out of RAM, and begins using the hard drive. Queries go disk-based because the query memory exceeds the ‘memory per slot’ in that queue. The memory per slot is calculated as:
memory assigned to that queue / # of slots
Since each queue is assigned a fixed percentage of a cluster’s memory (a value you’ll set when you configure your WLM queue), adding more slots will decrease the memory per slot.
Disk-based queries cause two major problems:
- Queries slow down because they need more I/O
- Concurrency goes up which causes more queueing.
When the frequency of disk-based queries goes up, a chain reaction can occur. More I/O causes more CPU, which in turn make queries run slower, increasing concurrency overall.
As a rule of thumb, maintain your queues so that on average fewer than 10% of queries go disk-based.