In this guide, we’ll go over the Redshift COPY command, how it can be used to import data into your Redshift database, its syntax, and a few troubles you may run into.
What is the Redshift COPY command?
Redshift COPY: Syntax & Parameters
How to load data from different s3 regions
Improving Redshift COPY Performance
ERROR: S3ServiceException
Specifying the S3 Regions
What is the Redshift COPY command?
The Redshift COPY command, funnily enough, copies data from one source and loads it into your Amazon Redshift database. The source can be one of the following items:
- An Amazon S3 bucket (the most common source)
- An Amazon EMR cluster
- An Amazon DynamoDB table
- An external host (via SSH)
If your table already has data in it, the COPY command will append rows to the bottom of your table. There are a few things to note about using the Redshift COPY command:
Per this last note, the recommended way of deduplicating records in Amazon Redshift is to use an “upsert” operation. In the next section, we’ll take a closer look at upserts.
Redshift COPY: Syntax & Parameters
The COPY command is pretty simple. To use it, you need three things:
- The name of the table you want to copy your data into
- Your data source (see list of potential data sources above)
- Authorization to access your data source (usually either an IAM role or the access ID and secret key of an IAM user)
If the source file doesn’t naturally line up with the table’s columns, you can specify the column order by including a column list.
Here’s a simple example that copies data from a text file in s3 to a table in Redshift:
copy catdemo
from 's3://awssampledbuswest2/tickit/category_pipe.txt'
iam_role 'arn:aws:iam:::role/'
region 'us-west-2';
If the source file doesn’t naturally line up with the table’s columns, you can specify the column order by including a column list in your COPY command, like so:
copy catdemo (column1, column2, etc.)
from 's3://awssampledbuswest2/tickit/category_pipe.txt'
iam_role 'arn:aws:iam:::role/'
region 'us-west-2';
AWS assumes that your source is a UTF-8, pipe delimited text file. If it is not, you need to let it know by using the FORMAT AS parameter. You can learn more about the exact usage here. If you have any questions, let us know in the comments!
How to load data from different s3 regions
Some people may have trouble trying to copy their data from their own S3 buckets to a Redshift cluster. This can easily happen when an S3 bucket is created in a region different from the region your Redshift cluster is in. The error message given is not exactly the clearest, and it may be very confusing. Fortunately, the error can easily be avoided, though, by adding an extra parameter.
Improving Redshift COPY Performance: Eliminating Unnecessary Queries
By default, the Redshift COPY command automatically runs two commands as part of the COPY transaction:
- “COPY ANALYZE PHASE 1|2”
- “COPY ANALYZE $temp_table_name”
Redshift runs these commands to determine the correct encoding for the data being copied, which may be useful when a table is empty. In the following cases, however, the extra queries are useless and should be eliminated:
- Performing a COPY into a temporary table (i.e. as part of an upsert operation).
- Performing a COPY when the table already has data in it. In Redshift, the data encoding of an existing table cannot be changed. Even if the COPY command determines that a better encoding style exists, it’s impossible to modify the table’s encoding without a deep copy operation.
In the below example, a single COPY command generates 18 “analyze compression” commands and a single “copy analyze” command:
Extra queries can create performance issues for other queries running on Amazon Redshift. For example, they may saturate the number of slots in a WLM queue, thus causing all other queries to have wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”, which will disable these features during upsert operations. Below is an example of a COPY command with these options set:
|
-- Load data into the staging table
COPY users_staging (id, name, city)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx'
COMPUPDATE OFF STATUPDATE OFF;
|
ERROR: S3ServiceException
For a regular COPY command to work without any special options, the S3 bucket needs to be in the same region as the Redshift cluster.
copy one_column ("number")
from 's3://test-buckets/region_test'
CREDENTIALS 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXX'
delimiter '\t';
If not, you may get an error similar to this:
ERROR: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid
DETAIL:
-----------------------------------------------
error: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid
code: 8001
context: Listing bucket=test-buckets prefix=region_test
query: 12715653
location: s3_utility.cpp:529
process: padbmaster [pid=31787]
-----------------------------------------------
The error message is a bit ambiguous. What it is actually stating is that another option needs to be added to your COPY command to define your S3 bucket region.
Specifying the S3 Regions
To do this, you have to add the REGION option to your COPY command. For example, if our S3 bucket was in the US-WEST-2 region, we will add this to our COPY command like so:
copy one_column ("number")
from 's3://test-buckets/region_test'
CREDENTIALS 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXX'
delimiter '\t'
region as 'us-west-2';
This method can also be used to verify a Redshift cluster's region, if the region for your Redshift cluster is not clear. If a COPY is successful without using the REGION argument for the COPY command, that confirms that the Redshift cluster is in the same region as your S3 bucket.
If you'd like more help with Redshift and managing your data warehouse, contact our support team or click get started to start your free trial.