Amazon Redshift is a petabyte-scale data warehouse that has been widely adopted since its release in October 2012. With Redshift, it’s easy to spin up a cluster, pump in data, and begin performing advanced analytics in under an hour.
Because it’s so easy to start using Redshift, however, data engineers often skip Redshift best practices when setting up a cluster. Cutting corners when setting up Redshift may create performance issues down the line, and you’ll pay the price later as your data volume and pipeline complexity grows. Some of the common Redshift pain points are slow queries and lack of workload scalability.
You may have already seen our article on the top performance tuning techniques for Amazon Redshift. In this post, we’ll focus on exactly the opposite topic: the top 3 things not to do when setting up an Amazon Redshift cluster. By scrupulously avoiding these issues, you’ll be paving the way for success as the complexity of your data pipeline grows.
Setting up Redshift Clusters: Don’t Use the Masteruser
For many people, the process of setting up Amazon Redshift looks like this: when launching a Redshift cluster, you create a masteruser, which by default has access to the initial database. Next, the masteruser’s login gets shared, such that ETL pipelines, scheduled jobs, and dashboard tools all log in with the same user.
The problem with this approach is that you lose granularity: it gets much more difficult to understand which people are doing what and running which queries. As you add more users, troubleshooting bad queries starts to become harder and harder. A single masteruser may work if you only have 3 to 5 users accessing Redshift, but it becomes simply intractable once you have 10 or more.
Instead, use Redshift’s CREATE USER command, which creates a new database user account, and create individual logins to isolate your workloads—one user, one login, no exceptions. This way, you’ll have more control and better visibility into your workloads. We also recommend grouping your users by type of workload (e.g. loads, transforms, ad hoc). This will come in handy later when you set up workload management (WLM).
Meanwhile, you should keep the masteruser idle—don’t run any queries with it. The masteruser can serve as your lifeline in case your cluster becomes unresponsive. When nothing else works, you’ll be able use it to run queries (see below).
Setting Up Redshift Clusters: Don’t use a Single Schema
By default, a database in Redshift has a single schema with the name PUBLIC. Also by default, all users have CREATE and USAGE privileges on this PUBLIC schema.
Yet with a single schema and read/write access for all users, you’re creating dependencies that become hard to untangle as data, user and query volume grow. More users create more joins across more tables, which means that you’ll be gradually boxing yourself in. Don’t do it.
Instead, plan your data architecture well in advance. Start with two different schemas: a “raw schema” where you load all your data, and a “data schema” that’s available for running ad hoc queries. You can move data between the schemas with your transform jobs.
As a rule of thumb, users in each role should only have access to the schemas and tables that they need, and no more. That’s also another reason why you should put in place individual users and user groups.
Setting up Redshift Clusters: Don’t use the default WLM queue
Amazon Redshift operates in a queueing model, with query queues and slots. You can use the Amazon Redshift Management Console to define query queues. Redshift then routes queries to the appropriate queues at runtime.
You also have the option to use Amazon Redshift’s automatic workflow management (WLM)—but we wouldn’t recommend it. User sessions and queries can overlap. Some queries can consume cluster resources for long periods of time, which in turn impacts the performance of other queries. That’s why you should isolate your workloads from each other.
By default, Amazon Redshift configures two query queues:
- A superuser queue reserved for superusers. You can’t configure this queue, and it does not appear in the Redshift console.
- A default user queue to run queries. The default configuration for the queue is to run 5 concurrent queries.
There’s a 99 percent chance that the default queue will not work for you (although every company’s workloads are different). But setting up WLM is something people like to ignore because it’s so dang easy to dump data into Redshift—and when things get slow, you can simply add another node. However, this becomes a very expensive proposition over time. Consider that the performance increase of adding more nodes without WLM is degressive.
In our blog post “4 Simple Steps to Set Up your WLM in Amazon Redshift,” we describe in detail how to set up your cluster for better workload scalability. We recommend checking out the whole article, but here are the main four steps:
- Categorize your users into 3 groups (load, transform, ad hoc).
- Define 3 new WLM queues: one each for your loads, transforms and ad hoc queries.
- Assign the user groups to their corresponding queues.
- Make the default queue a catch-all queue with a concurrency of 1 and memory of 1 percent. Do not assign any workloads to this queue.
What’s more, any queries that are not routed to other queues should run in the default queue, which acts as your insurance if something goes wrong. By not assigning your default user to any queue, it will run in the default queue, which will always have that one slot available.
Conclusion
To sum up, the three biggest mistakes to avoid when setting up an Amazon Redshift cluster are:
- Using the masteruser for everything (instead of individual accounts).
- Using a single schema (instead of two separate schemas for data loading and ad hoc queries).
- Using the default WLM queue (instead of three separate WLM queues for loads, transforms, and ad hoc queries).
By avoiding these three critical issues when setting up your Amazon Redshift cluster, you’ll be poised for more scalable workloads and long-term success. However, this is just the start of Redshift performance tuning at scale. As you load more data into your cluster and add more users, things can become slow.
When it’s critical for you to have excellent Amazon Redshift performance, it makes sense to use a product like intermix.io alongside an ETL tool. intermix.io helps you understand how your queries and data loads are performing, giving you complete visibility into your Redshift clusters.