Query optimization that dramatically reduces runtime for queries which use window functions.
The Simplified Problem
A common problem we see our customers solving in SQL is how to compare two rows per group in a dataset. For example, say you have the following data in the product_prices table:
Product |
updated_on |
Price |
widget_A |
2019-07-03 12:00:00 |
$105 |
widget_A |
2019-07-03 11:00:00 |
$110 |
widget_A |
2019-07-03 10:00:00 |
$100 |
widget_B |
2019-07-02 06:00:00 |
$255 |
widget_B |
2019-07-02 05:00:00 |
$235 |
How do you efficiently SELECT the price change at each updated_on time for each Product? I.e. we’d like to see a result set of:
Product |
Most recent updated time |
Price change |
widget_A |
2019-07-03 12:00:00 |
-$5 |
widget_A |
2019-07-03 11:00:00 |
$10 |
widget_B |
2019-07-02 06:00:00 |
$20 |
This type of problem arises in a lot of domains. Another common example of this problem occurs when you want to compare the time difference between subsequent events. For example, you may have a table that logs timestamps for each page visit of a user to your website and you want to do an analysis on the time difference between visits.
A common but sub-optimal way we see customers solve this problem is by using the ROW_NUMBER() window function together with a self join. The algorithm is straightforward: first select all your product prices and order them within each product by updated_on using the ROW_NUMBER() window function. Then self join on the table, mapping each row to the row preceding it. In SQL:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
with ordered_prices as (
SELECT product,
updated_on,
price,
row_number() over (partition by product order by updated_on desc) as row_number
FROM product_prices
)
select op1.product,
op1.updated_on,
op2.price-op1.price,
from ordered_prices op1 join ordered_prices op2
on op1.product=op2.product and op1.row_number = op2.row_number+1
|
This query produces the desired result set, but at the cost of a join. For real-world (i.e. less contrived) examples, solutions like this also often need to incorporate additional sub-selects or other expensive operations which compound the effect of the join and can make the query run excruciatingly slow.
This class of problem can manifest itself in other ways as well: sometimes instead of a self join, the query may just select from ordered_prices where row_number=1 to get the most recent updated_on and price. But the general pattern is the same: using ROW_NUMBER() to label rows for further processing/joining.
A Better Solution
If you’re using ROW_NUMBER() in a query just to apply filters or joins later on in your transformation, there may be a better way. Rather than co-opting ROW_NUMBER() to compare neighboring rows, a better solution is to apply a different window function that was designed to solve this problem: LAG(). The LAG window function returns the value from a row offset from the current row. The resulting SQL becomes:
|
select product,
updated_on,
lag(price) over (partition by product order by updated_on desc) - price as price_change
from product_prices
|
This SQL says that we should order the rows within each product by updated_on, and take the difference of the price from the current row with the price of the following row within that group. This eliminates the need for the join and can greatly simplify the query plan. One note: this result set will include NULL rows at the start of each product (i.e. where there is no previous row), but these can be trivially eliminated with a WHERE clause of price_change is not null when the results are used (or this query can be wrapped in a subselect to remove those rows).
A Real-World Example
The seed for this blog post was planted when a potential customer came to us with the same question we hear all the time: “Why are our queries slow?” After instrumenting their cluster with Integrate.io they were able to quickly identify their worst queries and find out why they are slow and what to do about it. In their case, one of the worst offending queries used this ROW_NUMBER() pattern and triggered a large number of Query Recommendations in their Integrate.io Dashboard, including:
Since the query exhibited the ROW_NUMBER() pattern above, it caused multiple unnecessary table scans, an expensive join, significant IO skew, and a large amount of intermediate data to be written to disk during execution. The query operated over a single table which had 876 million rows–not a huge table, but large enough to bring Amazon Redshift to its knees if not optimized properly–and was taking over 5.5 hours to execute!
After reviewing our Recommendations, they reimplemented their query using the LAG function to eliminate the join (which greatly simplified their query by removing two additional sub-selects), and the query execution time dropped to 30 seconds. From 5.5 hours to 30 seconds simply by reviewing targeted Query Recommendations and implementing a straightforward solution–talk about a good ROI on time spent optimizing SQL. Their exact feedback after seeing the new query runtimes? “OMFG. takes 30s. Wow”
Lessons Learned
Amazon Redshift is a fantastic general-purpose data warehouse. But since it is so easy to add data and new workloads to your cluster, it can be very hard to know where and how to start optimizing SQL queries.
Analysts and data engineers often wonder which queries are the worst offenders and where limited SQL tuning effort should go to give the biggest bang-for-the-buck. Knowing where to put query tuning effort is crucial since seemingly simple-looking queries can sometimes have inefficiencies lurking which can compound to slow a query by several orders of magnitude.
The good news: this is the problem that Integrate.io Query Recommendations was designed to solve. We help you pinpoint the root cause of query slowness so that you can spend your limited time optimizing the right queries in the right way.