Amazon Redshift is a distributed, shared-nothing database that scales horizontally across multiple nodes. Query execution time is very tightly correlated with:
- the # of rows and data a query processes.
- the amount of data moving between nodes.
Below is an example of a poorly written query, and two optimizations to make it run faster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
with
table1_cte as
(
select * from table1
),
table2_cte as
(
select * from table2
),
select
*
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
|
Optimization #1: Limit Rows Processed by using a WHERE clause
Queries can run faster by minimizing the amount of data moving between nodes. In practice, this means being careful when writing multi-stage queries where the results of one stage feeds into the next.
In the case of our example query, modifying your ‘WHERE’ clauses to only select rows needed will minimize the amount of data that needs to be moved around and speed up the query.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
with
table1_cte as
(
select * from table1 where created_at>'{{l_bound}}' and created_at <'{{u_bound}}'
),
table2_cte as
(
select * from table1 where created_at >'{{l_bound}}' and created_at <'{{u_bound}}'
),
select
*
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
|
Optimization #2: Limit Columns Scanned
Amazon Redshift is a columnar-oriented database. As a result, scanning a table doesn’t read each row in its entirety. Instead, individual columns can be scanned without needing to read other columns. This means that you should be careful to only select columns that you will use for your query. Try avoiding using a
‘SELECT *’
operation in all cases.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
with
table1_cte as
(
select id,name,address from table1 where start_time>'{{l_bound}}' and start_time<'{{u_bound}}'
),
table2_cte as
(
select id,name,address from table1 where start_time>'{{l_bound}}' and start_time<'{{u_bound}}'
),
select
a.name,b.address
from table1_cte as a
JOIN table2_cte as b
ON a.id = b.id
|
Using these two optimizations when writing queries will have dramatic positive effects on your query speeds.