Announcing Cluster Recommendations – Measure & Improve Data SLAs

One of the things we hear most often from our customers is that data is becoming more and more critical to their business necessitating Data SLAs.

Take for example one customer who builds hardware devices for consumers. They use their Amazon Redshift cluster to monitor and provide support for customers on the health of those devices (billions of events per day) and also to manage the complex supply chain for their overseas production lines.

This company relies on the data warehouse to run their business, and if the data pipelines don’t complete in time, or queries are slow or fail – the business loses revenue and customer satisfaction plummets.

We see this trend across all of our customers. In response – businesses are beginning to demand service level agreements (SLAs) for queries and reports built on top of the data warehouse. These SLAs have a number of different service level indicators – the most prominent are:

Query latency (did the query finish within a certain amount of time?)
Query availability (did the query finish?)
Data recency (is the data fresh?)

In fact, we have one customer that demands a 10 second SLA for any query that is executed through their BI reporting tool. Analysts get fed up with waiting for the “spinning wheel of death” waiting for a dashboard to load, or a query to complete.

Measuring & Improving Data SLAs for Query Latency

Measuring service levels for query latency is trickier than it seems. To understand why, consider the sheer scale of workloads that a typical data warehouse handles. Our average customers runs over one million queries per week on their Amazon Redshift cluster. These queries span three categories of applications

ETL tools (get data “in’)
Modeling tools (“ELT” workloads to model data and create derived tables)
BI tools (get data “out” of the warehouse)

At the same time, each category of workload is growing:

Teams add more BI users over time
ELT transformations grow when new data sources are added, or new models are created
New data sources results in new data being loaded in

So a single SLA across all the queries is meaningless. What you really need is to measure SLA on a very granular basis – for each data application, down to the individual user level. For a real world example, consider a company using Looker as their BI application. How do you easily filter out all the noise only to see the Looker queries? How do you monitor the performance of a single model?

Measuring SLAs is hard, but is only the beginning of the journey. Assuming you are able to measure the SLA from each of your data applications – the next step is to take steps to improve the SLA by finding and fixing issues impacting your workloads and users.

Introducing the Cluster Health Dashboard

The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:

SLA measures for any connected app, down to the individual users
Proactively surfaces Recommendations to improve performance and optimize costs
Top 5 slowest queries with quick links to Query Optimization Recommendations

Measure Data SLAs for any App or Users

To see latency distribution for any connected app (see here for a list of supported apps), simply modify the filter at the top-right of the screen.

Query latency follows a power-law distribution – and most queries in the warehouse run rather quickly, with a long tail of slow queries.

Measuring query latency is best done by looking at different points across that distribution – the most common being the P95 and P99 latencies. These are shown prominently at the top of the page.

Further, we show you a distribution of latencies across the entire cluster, grouped by user (by default). Click on the chart to quickly go to the queries on the Discover page.

Example of SLA metrics for “Looker PDT” queries.

Cluster Recommendations

Our team has significant experience running Amazon Redshift at scale. In fact, we run a very large fleet of Amazon Redshift clusters as part of our product architecture. Based on this experience, we surface observation and recommendations on how to optimize the cost and performance of the cluster. The Recommendations cover things like: