Row Skew results when a table uses KEY based distribution and the values in the DISTKEY column are not evenly distributed. The row skew metrics is a positive integer ranging from 1 to the number of rows in the table. Row skew is the ratio of:
- number of rows on the node containing the most number of rows for the table
- number of rows on the node containing the least number of rows for the table
High row skew results in uneven node disk utilization (cost) and slower queries (performance).
The chart below shows a real-world example. With uneven disk utilization, a single node(s) ends up having more rows for that table. This can be a major (cost) problem if you need to add more nodes in your cluster just because a single node is skewed.
With high row skew, doing a straight SELECT on that table will be slower than otherwise. This is because one node has more data than the next, and the query execution must wait for the “slowest” node to send up its data to the leader.
There are two options to eliminate row skew:
- selecting a DISTKEY that is random, or
- change the distribution style to EVEN or ALL
The exception to tolerate row skew is if – and only if – you make a conscious decision to optimize a single query. See the section “Use DISTKEYs Only When Necessary” in this article for more information.