Amazon Redshift offers amazing performance at a fraction of the cost of traditional BI databases. But when it comes to data manipulation such as INSERT, UPDATE, and DELETE queries, there are some Redshift specific techniques that you should know, in order to perform the queries quickly and efficiently. In this article, I’d like to introduce one of such techniques we use here at Integrate.io.
COPY Command is your friend
If you want to insert many rows into a Redshift table, the INSERT query is not a practical option because of its slow performance. Instead, Redshift offers the COPY command provided specifically for bulk inserts. It lets you upload rows stored in S3, EMR, DynamoDB, or a remote host via SSH to a table. It’s much more efficient compared to INSERT queries when run on a huge number of rows. So, the COPY command is good for inserting a large number of rows. But what if you want to UPDATE and/or DELETE a large number of records?
UPSERT
UPSERT (UPdate or inSERT) is a common technique to insert or update a large number of rows to a table. It works this way:
-
Create a staging table using the same table definition as the target table.
CREATE TABLE users_staging (
id integer,
name varchar(255),
city varchar(255),
PRIMARY KEY (id)
);
-
Upload rows to the staging table using the COPY command. Rows you want to insert and rows you want to update may be mixed together in the staging table.
COPY users_staging (id, name, city)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-
Run an UPDATE query to update rows in the target table, whose corresponding rows exist in the staging table.
UPDATE users
SET name = s.name, city = s.city
FROM users_staging s
WHERE users.id = s.id;
-
Run an INSERT query to insert rows which do not exist in the target table.
INSERT INTO users
SELECT s.* FROM users_staging s
LEFT JOIN users
ON s.id = users.id
WHERE users.id is NULL;
As you can see, you can perform bulk inserts and updates with 3 commands, COPY, UPDATE and INSERT. But what if you also have rows that need to be deleted?
UPSERT + Bulk DELETEs
Since UPSERT doesn’t handle deletes, you need to issue another set of commands to delete rows from the target table. You can use a staging table to delete rows all at once.
-
Create a staging table. This staging table, unlike the staging table for UPSERT, may omit columns other than the primary key columns because only the primary key columns will be used.
CREATE TABLE users_staging2 (
id integer,
PRIMARY KEY (id)
);
-
Upload the rows to be deleted to a staging table using a COPY
command.
COPY users_staging2 (id)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-
Run a DELETE
query to delete rows from the target table whose primarykeyexist in the staging table. The query might look like this:
DELETE FROM users
USING users_staging2 s
WHERE users.id = s.id;
With the two additional commands (COPY
and DELETE
) you can bulk insert, update and delete rows. But it’s a total of 2 COPY
commands and 3 data manipulation commands (INSERT
, UPDATE
and DELETE
.) It’s a lot of queries especially if you have many tables or if you want to update data frequently. Also, if you looks at these INSERT
, UPDATE
and DELETE
queries, all 3 involves a JOIN
. This is a very expensive operation we’d like to avoid if possible. Now, what can we do?
DELSERT comes to your rescue!
At Integrate.io, we use a technique we call DELSERT (DELete and inSERT) to improve the bulk upload performance. It’ll cut down the number of commands from 5 to 3 and the number of JOIN queries from 3 to 1. Let me show you how it works.
-
Create a staging table. In addition to columns from the target table, add anextracolumn which tells that the rowisfor insert, update or delete.
CREATE TABLE users_staging3 (
id integer,
name varchar(255),
city varchar(255),
row_type varchar(1), -- ‘i’, ‘u’ or ‘d’: indicating the type
-- of operation to be done to the row.
PRIMARY KEY (id)
);
-
Upload all rows (insert, delete, update) to a staging table using a COPY
command. Each row has a value indicating what it’s for, insert/update/delete, in the extra column.
COPY users_staging3 (id, name, city, row_type)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-
Run a DELETE query to delete rows from the target table whose primary key exists in the stagingtablefor delete or update.
DELETE FROM users
USING users_staging3 s
WHERE users.id = s.id AND (row_type = ‘u’ OR row_type = ‘d’);
-
Run an INSERT
query to insert rows marked for insert or update.
INSERT INTO users (id, name, city)
SELECT id, name, city FROM users_staging3 s
WHERE row_type = ‘i’ OR row_type = ‘u’;
As you can see, a set of updates are done using only 3 SQL queries (COPY
, DELETE
and INSERT
) instead of the previous 5. Also, unlike our original UPSERT
, this INSERT
does not involve a JOIN, so it is much faster than the INSERT
query used in an UPSERT
.
Summary
Here’s a summary of the queries used in (1) an UPSERT + bulk DELETE; vs., (2) DELSERT.
UPSERT + Bulk DELETE
BEGIN;
-- Create a staging table for UPSERT
CREATE TABLE users_staging (
id integer,
name varchar(255),
city varchar(255),
PRIMARY KEY (id)
);
-- Upload data into the staging table
COPY users_staging (id, name, city)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-- Update records
UPDATE users
SET name = s.name, city = s.city
FROM users_staging s
WHERE users.id = s.id;
-- Insert records
INSERT INTO users
SELECT s.* FROM users_staging s
LEFT JOIN users
ON s.id = users.id
WHERE users.id is NULL;
-- Drop the staging table
DROP TABLE users_staging;
-- Create a staging table for delete
CREATE TABLE users_staging2 (
id integer,
PRIMARY KEY (id)
);
-- Upload data to the staging table
COPY users_staging2 (id)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-- Delete records
DELETE FROM users
USING users_staging2 s
WHERE users.id = s.id;
-- Drop the staging table
DROP TABLE users_staging2;
COMMIT;
DELSERT
BEGIN;
-- Create a staging table
CREATE TABLE users_staging (
id integer,
name varchar(255),
city varchar(255),
row_type varchar(1),
PRIMARY KEY (id)
);
-- Upload data to the staging table
COPY users_staging (id, name, city, row_type)
FROM 's3://.......'
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx';
-- Delete data
DELETE FROM users
USING users_staging s
WHERE users.id = s.id AND (row_type = ‘u’ OR row_type = ‘d’);
-- Insert data
INSERT INTO users (id, name, city)
SELECT id, name, city FROM users_staging s
WHERE row_type = ‘i’ OR row_type = ‘u’;
-- Drop the staging table
DROP TABLE users_staging;
COMMIT;
Conclusion
While UPSERT
is a fairly common and useful practice, it has some room for performance improvement, especially if you need to delete rows in addition to just INSERT
s and UPDATE
s. DELSERT
is a more streamlined alternative, which minimizes the number of queries and also improves the performance of some of the queries.
How Integrate.io Can Help
Integrate.io provides continuous, near real-time replication between RDS, MySQL and PostgreSQL databases to Amazon Redshift. The Integrate.io Sync tool is an intuitive, powerful, cost-effective way to automatically sync, capture and replicate the changes from your transactional databases to your data warehouse on AWS in a single interface with no manual scripting!
References