Introduction
The AWS team recently released a new feature of their Automatic WLM that we’re really excited about: Query Priorities. This feature aims to address the main limitation of Auto WLM head-on, i.e. it allows you to prioritize some queries over other queries. This is a crucial performance enhancement that is needed to achieve Data SLAs.
We first covered Redshift’s new Automatic WLM feature on our blog before Query Priority was fully released, and found mixed results: it was highly effective (maybe too effective!) at reducing the percentage of disk-based queries, but had the side effect of increasing overall queuing on our clusters since big queries consumed most of the memory for the cluster. For our use-case, this tradeoff didn’t work–the additional queueing caused unacceptable delays to our data pipeline. Since writing that blog post, we ended up reverting all of our clusters to our well-tuned Manual WLM.
We’ve already covered the basics of setting up Automatic WLM with Query Priorities from a very high-level, and in this post are going to give a first look at the performance we’re seeing with this new feature.
Automatic WLM: the basics
As a reminder, Automatic WLM dynamically assigns cluster resources to queries as they run. This is in contrast to Manual WLM, where you need to manually choose a memory and concurrency value for each queue that queries run in. The benefit of having Automatic WLM choose these values for you is less about making it easier (optimally configuring Manual WLM is pretty easy), and more about making it dynamic–in theory each query should get the right amount of cluster resources, no more, no less. This should lead to a more optimal use of your cluster, resulting in a higher query throughput and less wasted memory.
In practice however, Automatic WLM (without Query Priority) has no way of knowing the impact additional queue time will have on your data SLAs and hence your business. So the net result of Automatic WLM on your data SLA might be negative, even if the cluster’s use of resources is more efficient by some measure.
This is exactly the effect we saw in our earlier post: long-running queries ran up to 3x faster (and fewer went to disk), since big queries were allocated more resources. But this came at the expense of additional queue time for queries overall. The net result was a decrease in overall throughput in our data pipeline. Manual WLM allows us to tune that tradeoff, letting some queries go disk-based if it means keeping the overall latency of our pipeline below our data SLA.
Automatic WLM with Query Priority
Automatic WLM with Query Priorities intends to solve this exact problem: they allow you to let Amazon Redshift know how to prioritize queries when allocating cluster resources. Query Priorities are managed via Automatic WLM query queues, with each queue getting a priority:
- Highest
- High
- Normal
- Low
- Lowest
As with Manual WLM, queries are assigned to queues using either database user groups or “query groups” (a label you can set in the database session using the SET query_group TO statement before you submit a query). The idea is that you simply rank your workloads in terms of priority, and Redshift handles allocating cluster resources optimally.
Redshift uses these query priorities in three ways:
- When queries are submitted to the cluster, Redshift uses the priority to decide which queries should run and which should queue. Queries in lower priority queues will still run, but will queue longer on average than queries in higher priority queues.
- When queries run, higher priority queries are allocated more resources (including memory, CPU, and I/O throughput) than lower priority queries. Lower priority queries get all of the system resources when there are no higher priority queries.
- Queries in lower priority queues may get aborted (and requeued) more often to make room for queries in higher priority queues. This occurs only when a higher priority query arrives.
Setting up our first cluster
Following the guidelines in our Automatic WLM configuration post, we looked at one of our development Redshift clusters and first enumerated the workloads we have running. We then grouped their users into groups based on workload:
Workload |
Database User Group |
Automated queries from other parts of our system |
intermix_app |
Data loads and Transformations |
intermix_write |
Validation tests |
intermix_readonly |
Unloads |
intermix_unload |
Maintenance queries |
intermix_maintenance |
DBLink queries |
intermix_dblink |
Developer Ad-hoc queries |
intermix |
Next, we assigned a priority to each workload and decided if any should have Concurrency Scaling enabled:
Workload |
Database User Group |
Priority |
Concurrency Scaling |
Data loads and Transformations |
intermix_write |
Highest |
No |
Automated queries from other parts of our system |
intermix_app |
High |
Yes |
Validation tests |
intermix_readonly |
Normal |
No |
Unloads |
intermix_unload |
Low |
No |
Maintenance queries |
intermix_maintenance |
Lowest |
No |
DBLink queries |
intermix_dblink |
High |
Yes |
Developer Ad-hoc queries |
intermix |
Low |
Yes |
Finally, we rebooted and started checking our Intermix.io dashboard for results.
Results
At a high-level, our results looked markedly improved compared to our previous Automatic WLM tests. We saw a significant improvement in average execution time (light blue) accompanied by a corresponding increase in average queue time (dark blue):
Overall, the net result of this was a small (14%) decline in overall query throughput.
As before, these changes were driven by a dramatic decrease in execution times for our slow, disk-based queries:
Next, we looked at a breakdown of execution time by workload using the Insights feature of our dashboard:
This again showed a dramatic improvement in average execution times, though the overall count of queries (i.e. query throughput) went down slightly:
We also saw an increase in queue times for all users, which is expected given that Redshift must queue lower priority queries in order to allocate resources for higher priority queries:
Finally, we saw an expected increase in aborted queries for lower priority workloads (which includes queries that are evicted and re-queued to make room for higher priority queries):
Our Interpretation
We suspect that the reason we didn’t see a significant improvement when using Automatic WLM with Query Priority is due to the fact that our cluster workloads are primarily dominated by a single very resource-intensive batch processing pipeline (executed by the intermix_write group). Once data is processed and reduced by this workload, it gets offloaded into other databases for serving (either Elasticsearch or RDS via DBLink). So while it is the case that our clusters do run a mix of workloads, the fact that the bulk of the work is dominated by this single resource-intensive transformation pipeline means that there may simply not be enough wiggle-room in terms of workloads to move around to give Redshift enough parameters to optimize.
Further, with Manual WLM we’re able to tune our clusters for optimal performance given this dominant transformation workload, which itself is a complex mix of queries. Our data SLA metric isn’t simply “minimize disk-based queries” or “maximize query execution time for this group of queries”–it is “decrease the final latency at the end of this complex data pipeline”.
Finally, since our workload is dominated by write queries, we don’t get much gain from enabling concurrency scaling or SQA to offload read-only queries.
Conclusions and Recommendations
Despite the fact that Automatic WLM was slightly slower than our Manual WLM configuration, we think that it is the right way to approach WLM: it allows you as a Redshift Administrator to prioritize workloads to align with your business needs, as opposed to tuning lower-level knobs (concurrency and memory) to achieve the same result. And as the AWS Redshift team improves their WLM tuning algorithms, we expect that the performance of Automatic WLM will continue to improve.
We recommend that most Redshift admins try Automatic WLM, especially if any of the following are true:
- You have not optimally tuned your Manual WLM configuration for your particular workloads (e.g. you’re using Redshift’s default Manual WLM configuration)
- Your cluster runs a mix of workloads that have different priorities (i.e. some workloads need consistent performance, others don’t)
- You want to improve the performance of long-running queries, especially if they are disk-based.
If you do use Automatic WLM, we strongly recommend you enable Query Priorities if you have any workloads for which consistent performance is important (and who doesn’t?) We also recommend enabling SQA for your cluster, and enabling Concurrency Scaling for any Priority Queues which run eligible queries. Both of these will reduce and maybe even eliminate queuing for those workloads, especially during times of increased query volume (think BI tools where your users are querying your warehouse in real-time).
We’d love to hear your thoughts and if Automatic WLM with Query Priorities has worked for you–reach out to us in our Redshift community slack channel and share your experience with the community!