Redshift Copy: How to use Redshift's COPY command

In this guide, we’ll go over the Redshift COPY command, how it can be used to import data into your Redshift database, its syntax, and a few troubles you may run into.

What is the Redshift COPY command?

The Redshift COPY command, funnily enough, copies data from one source and loads it into your Amazon Redshift database. The source can be one of the following items:

An Amazon S3 bucket (the most common source)
An Amazon EMR cluster
An Amazon DynamoDB table
An external host (via SSH)

If your table already has data in it, the COPY command will append rows to the bottom of your table. There are a few things to note about using the Redshift COPY command:

The maximum size of a single input row from any source is 4 MB.
Amazon Redshift Spectrum external tables are read-only; you can’t COPY to an external table.
The COPY command appends the new data to the end of the table, without modifying any existing rows.
In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.

Per this last note, the recommended way of deduplicating records in Amazon Redshift is to use an “upsert” operation. In the next section, we’ll take a closer look at upserts.

Redshift COPY: Syntax & Parameters

The COPY command is pretty simple. To use it, you need three things:

The name of the table you want to copy your data into
Your data source (see list of potential data sources above)
Authorization to access your data source (usually either an IAM role or the access ID and secret key of an IAM user)

If the source file doesn’t naturally line up with the table’s columns, you can specify the column order by including a column list.

Here’s a simple example that copies data from a text file in s3 to a table in Redshift:

copy catdemo 
from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
iam_role 'arn:aws:iam:::role/' 
region 'us-west-2';

If the source file doesn’t naturally line up with the table’s columns, you can specify the column order by including a column list in your COPY command, like so:

copy catdemo (column1, column2, etc.) 
from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
iam_role 'arn:aws:iam:::role/' 
region 'us-west-2';

AWS assumes that your source is a UTF-8, pipe delimited text file. If it is not, you need to let it know by using the FORMAT AS parameter. You can learn more about the exact usage here. If you have any questions, let us know in the comments!

How to load data from different s3 regions

Some people may have trouble trying to copy their data from their own S3 buckets to a Redshift cluster. This can easily happen when an S3 bucket is created in a region different from the region your Redshift cluster is in. The error message given is not exactly the clearest, and it may be very confusing. Fortunately, the error can easily be avoided, though, by adding an extra parameter.

Improving Redshift COPY Performance: Eliminating Unnecessary Queries

By default, the Redshift COPY command automatically runs two commands as part of the COPY transaction:

“COPY ANALYZE PHASE 1|2”
“COPY ANALYZE $temp_table_name”

Redshift runs these commands to determine the correct encoding for the data being copied, which may be useful when a table is empty. In the following cases, however, the extra queries are useless and should be eliminated:

Performing a COPY into a temporary table (i.e. as part of an upsert operation).
Performing a COPY when the table already has data in it. In Redshift, the data encoding of an existing table cannot be changed. Even if the COPY command determines that a better encoding style exists, it’s impossible to modify the table’s encoding without a deep copy operation.

In the below example, a single COPY command generates 18 “analyze compression” commands and a single “copy analyze” command:

"copy analyze" and "analyze compression" with intermix.io

Extra queries can create performance issues for other queries running on Amazon Redshift. For example, they may saturate the number of slots in a WLM queue, thus causing all other queries to have wait times.

The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”, which will disable these features during upsert operations. Below is an example of a COPY command with these options set:

-- Load data into the staging table

COPY users_staging (id, name, city)

FROM 's3://.......'

CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx'

COMPUPDATE OFF STATUPDATE OFF;

ERROR: S3ServiceException

For a regular COPY command to work without any special options, the S3 bucket needs to be in the same region as the Redshift cluster.

copy one_column ("number") 
from 's3://test-buckets/region_test' 
CREDENTIALS 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXX' 
delimiter '\t';

If not, you may get an error similar to this:

ERROR:  S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid 
DETAIL:
  -----------------------------------------------
  error:  S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid 
  code:      8001
  context:   Listing bucket=test-buckets prefix=region_test
  query:     12715653
  location:  s3_utility.cpp:529
  process:   padbmaster [pid=31787]
  -----------------------------------------------

The error message is a bit ambiguous. What it is actually stating is that another option needs to be added to your COPY command to define your S3 bucket region.

Specifying the S3 Regions

To do this, you have to add the REGION option to your COPY command. For example, if our S3 bucket was in the US-WEST-2 region, we will add this to our COPY command like so:

copy one_column ("number") 
from 's3://test-buckets/region_test' 
CREDENTIALS 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXX' 
delimiter '\t' 
region as 'us-west-2';

This method can also be used to verify a Redshift cluster's region, if the region for your Redshift cluster is not clear. If a COPY is successful without using the REGION argument for the COPY command, that confirms that the Redshift cluster is in the same region as your S3 bucket.

If you'd like more help with Redshift and managing your data warehouse, contact our support team or click get started to start your free trial.

Cloud Integration

Redshift Copy:
How to use Redshift's COPY command

Table of Contents

What is the Redshift COPY command?

Redshift COPY: Syntax & Parameters

How to load data from different s3 regions

Improving Redshift COPY Performance: Eliminating Unnecessary Queries

ERROR: S3ServiceException

Specifying the S3 Regions

SFDC Bi-Directional Sync: Pulling Data Out of Salesforce

Talend OS vs. Integrate.io: Overview, Comparison, and Review

How Cloud Services Help Manage Risk and Compliance

Redshift Copy: How to use Redshift's COPY command

Table of Contents

What is the Redshift COPY command?

Redshift COPY: Syntax & Parameters

How to load data from different s3 regions

Improving Redshift COPY Performance: Eliminating Unnecessary Queries

ERROR: S3ServiceException

Specifying the S3 Regions

Related Readings

SFDC Bi-Directional Sync: Pulling Data Out of Salesforce

Talend OS vs. Integrate.io: Overview, Comparison, and Review

How Cloud Services Help Manage Risk and Compliance

Subscribe To The Stack Newsletter

Redshift Copy:
How to use Redshift's COPY command

Subscribe To
The Stack Newsletter