What is AWS Redshift Spectrum?
AWS Redshift Spectrum is a feature that comes automatically with Redshift. It is the tool that allows users to query foreign data from Redshift. Foreign data, in this context, is data that is stored outside of Redshift. When you are creating tables in Redshift that use foreign data, you are using Redshift’s Spectrum tool. Here’s some more info about Redshift Spectrum if you’re interested!
What is the EXTERNAL
command?
EXTERNAL
is the SQL clause for specifying foreign data. It is used within a CREATE command to specify that the SQL object you are creating (a schema or table) is referring to an “external” data source.
When To Use This Service
- You have a lot of data in S3 that you wish to query with common SQL commands, this is common for teams who are building a data lake in S3
- You don’t want to store data directly in Redshift due to storage costs
- See how Integrate can help manage Redshift costs
Setting Up the AWS Environment
In order for AWS Spectrum Redshift to work, you need your AWS environment configured accordingly. Since this process involves two AWS services communicating with each other (Redshift & S3), you need to create IAM roles accordingly. Redshift will assume this IAM role when it communicates with S3, so the role needs to have S3 access. It’s kind of like Redshift saying to S3 “hey, I have this IAM role’s permissions, can I view these files?” And then S3 looks at that IAM role’s permissions and allows Redshift in.
When you configure the IAM role, you need to make sure the following two policies are attached in order for Redshift to have access to S3: AmazonS3ReadOnlyAccess
& AWSGlueConsoleFullAccess
Once you’ve created your IAM role, you need to attach this role to your Redshift cluster. You can do this in the AWS Redshift console by selecting your Redshift cluster, then clicking “Actions” in the top right corner, and then “Manage IAM Roles” in the dropdown. Here, you can add your IAM role to Redshift.
Creating Your Schema
Now that you’ve got your Redshift cluster configured, you can start executing SQL to build out your Spectrum environment. This starts with creating a schema. Here is some sample code for creating a schema called “sample”:
|
create external schema sample from data catalog database 'dev' iam_role 'arn:aws:iam::263101679790:role/example-iam-role' create external database if not exists;
|
Let’s dissect this a little:
|
create external schema sample from data catalog
|
In this first line, we are creating a schema and calling it “sample.” The “data catalog” refers to where metadata about this schema gets stored. The default “data catalog” for Redshift is AWS Athena. If you are using a Hive Metastore, you can create your schema from hive metastore.
Here we are specifying the name of our external database within our Redshift cluster:
|
iam_role 'arn:aws:iam::263101679790:role/example-iam-role'
|
This is where we tell Redshift which IAM role to use when reaching out to S3. When configuring your Redshift cluster, you can assign several IAM roles to your cluster (like we did earlier). But assigning the role to the cluster is not enough, you also need to specify which role you are using when you execute your queries.
|
create external database if not exists;
|
If the database, dev
, does not already exist, we are requesting the Redshift create it for us.
You can find more tips & tricks for setting up your Redshift schemas here.
Creating Your Table
Now that we have an external schema with proper permissions set, we will create a table and point it to the prefix in S3 you wish to query in SQL. There are some “gotchas” to be aware of before we start:
- All of the files in your prefix must be the same format:
- Same headers, delimiters, file types, etc
- You must specify an S3 prefix/folder, not a single file
Here is a SQL command which will create an external table with CSV files that are on S3:
|
create external table sample.animals(name varchar, age integer, species varchar) row format delimited fields terminated by ',' stored as textfile location 's3://redshift-example-123/animals/csv/' table properties ('skip.header.line.count'='1');
|
Let’s break this down:
|
create external table sample.animals(name varchar, age integer, species varchar)
|
We are specifying an “external” table within our “sample” schema called “animals”. The table will have three columns, “name”, “age”, and “species.”
|
row format delimited fields terminated by ','
|
Here we are saying that each field within our CSV is separated by a comma:
|
stored as textfile location 's3://redshift-example-123/animals/csv/'
|
We are telling Redshift that the data is stored as a textfile, as opposed to a Parquet file or other file format. We are also specifying the S3 prefix in which to find our CSV files. Note that we are not specifying the actual file(s), we are specifying the prefix/folder they are in.
|
table properties ('skip.header.line.count'='1');
|
Finally, we are warning redshift that our CSV files contain a header and that header can be skipped when querying the data.
Now that we’ve successfully created the table, we can query the data! What’s important to remember is that Redshift hasn’t examined the data yet. You’ve essentially made a promise to Redshift that your data is formatted in a certain way and is stored in a certain location. It is not until you query the data will Redshift tell you if you’ve kept your promise.
Querying Your Table
Now you can query your table the same way you would any other table!
|
select * from sample.animal;
|
We hope you’ve found this tutorial useful. For more information about working with Redshift, please visit our blog.