Here at Integrate.io, we’re constantly building products and innovating for Amazon Redshift users, and we’d like to think we have our finger on the pulse of what Redshift customers are saying. The number one driver of change that we’ve seen for our clients is that they’re experiencing huge growth in data and query volumes.
A 2016 report by IDG found that the average company now manages 163 terabytes (163,000 gigabytes) of information—and the figures have undoubtedly only increased since the survey was released. Of course, having more data at hand also means that you need higher query volumes to extract all of the insights they contain.
We’ve observed three main drivers behind the growth in data and query volumes among our customers:
-
Growth in the number of data sources connected to a Redshift cluster, and the volume of data coming from those sources. Among many of our clients, we’ve observed exponential data growth rates—even doubling every year.
-
Growth in the number of workflows that are running once data resides inside a cluster. The reason is simple: the more data and sources you have, the more ways you have to join and combine that data.
-
Growth in the number of data consumers who want more data, in more combinations, at a higher frequency. Much of this growth is driven by how easy dashboard tools make it to explore data.
When building data pipelines with Amazon Redshift, all of these growth factors make it quite challenging to monitor and detect anomalies across all of the queries running on the system. For example, a lot of our clients use Looker as their tool for data exploration, and one of the most common complaints with the platform is slow Looker dashboards.
One way of approaching this problem is to follow best practices for setting up your workload management in Amazon Redshift the right way. You get a more granular view of your users and the resources they use, and the WLM gives you the power to isolate them from each other.
But even after optimizing your WLM configuration, you’re still lacking critical insights. For example:
- If you detect a large spike in query volume, how can you identify its cause?
- If you have a data pipeline that is experiencing an overall increase in latency, which individual queries are contributing to the increase?
- If you detect a large spike in Looker PDT performance, which queries are causing the issue?
Until now, it’s been quite hard to surface all of this information in a single place. That’s why we built a way to intuitively and powerfully visualize these insights, helping to make your life as a data engineer easier when working with Amazon Redshift.
The new feature is called “Query Insights,” and it’s available immediately in your Integrate.io dashboard. Below, we’ll check out some examples of using the Query Insights feature in Integrate.io.
Detect a Huge Increase in Query Volumes in Amazon Redshift
In this example, we’ll show how you can use Query Insights to quickly identify that a huge spike in query volume happened, as well as the specific queries that caused it.
The Integrate.io dashboard clearly shows that there has been a query count spike at 8:59 p.m. on January 24. In the left panel in the Integrate.io user interface, click on “query group” to segment the chart by query group. Click on the chart to reveal a slide-in list of query groups, sorted in descending order of Count (i.e. the number of queries). Finally, click on the “View Queries” icon to jump to the Query Details list, where you can see exactly which queries are causing the problem.
Identifying Which Query Group is Causing a Batch Pipeline Slow Down in Amazon Redshift?
In this example, we’ll show you how to use Query Insights to find the cause of an overall increase in latency.
Click on the graph icon in the main Query Insights dashboard to group the data by app. This reveals that the “integrate_collector” app is experiencing a large increase in latency. Next, click on the “integrate_collector” app so that you can group by a custom attribute.
This app has two custom attributes, “dag” and “task”, that appear on the left navigation panel. This is because these queries are tagged using the Integrate.io annotation format. First group by “dag”, and then by “task”, to isolate the specific task that is seeing a spike in latency. Finally, group by “query group” to see the specific queries that are causing a problem.
Find the Query in Amazon Redshift causing a Looker PDT Latency Spike
In this example, let’s monitor our Looker PDT queries to find the queries which are the slowest.
First, click on the “Looker PDT” tag in the dashboard in order to only view data for Looker PDT queries. As you can see in the left navigation panel, the Looker PDT app has a few custom attributes. Click on “model” to group the queries by model name. We immediately see that the “supply chain” model is the slowest. We can click on it and then group by “query group” to find the individual queries causing a problem.
What’s Next
Query Insights is a tremendously valuable tool in your Redshift toolkit, but we’re only getting started. Keep your eyes open for a new feature “Transfer Insights” soon, which will allow you to monitor the users and apps that are loading data and rows into your Amazon Redshift cluster.
Want to try Query Insights out for yourself? Take the opportunity to get a personalized tour of Query Insights, as well as see what’s next on our product roadmap and provide your feature requests and feedback. Sign up today for a free trial of Integrate.io.