MySQL is the most popular open-source and free database in the world because it is powerful, flexible, and extremely reliable. But when it comes to data analytics, many companies turn to Amazon Redshift to complement MySQL. There are several ways to replicate your MySQL data to Redshift. But first, let’s dig a little deeper into why you should replicate your MySQL database to Redshift.

thumbnail image

https://www.c-sharpcorner.com/article/what-is-the-most-popular-database-in-the-world/

Five things to know about moving data from MySQL to Redshift:

  • Tens of thousands of companies use MySQL to power their web-based applications and services.
  • However, MySQL is quickly bogged down by even the smallest analytical queries, making data analytics difficult and putting your entire application at risk of crashing.
  • Redshift is built to handle petabytes of data and deliver analytics in a fraction of the time.
  • The benefits of moving data from MySQL to Amazon Redshift include maintaining application performance, scalability, and faster analytics.
  • When you pair MySQL and Redshift, you also eliminate the risk of crashing your production database while running queries.

    How to Move Data From MySQL to Redshift?

    You can connect MySQL and Redshift for better performance, scalability, and other benefits. Many companies that use MySQL to power their web applications choose Redshift for data analytics.

    There are four ways to move data between MySQL and Redshift: Import & Export, Incremental SELECT & COPY, Change Data Capture (CDC) with Binlog, and Integrate.io ETL.

    Why Replicate Data From MySQL to Redshift?

    thumbnail image

    There are several reasons why you should replicate MySQL data to the Redshift data warehouse.

    • Maintain application performance. As we’ve already mentioned, running analytical queries on your production MySQL database can severely impact its performance. It could even cause it to crash. Analytical queries are incredibly resource-intensive and require dedicated computing power.
    • Analyze ALL of your data. As an OLTP database (Online Transaction Processing), MySQL is designed for transactional data like customer records and financial data. However, you want to draw insights from your entire set of data, including the non-transactional type. You can use Redshift to capture and analyze all of your data in one place.
    • Faster analytics. Redshift is a Massively Parallel Processing (MPP) data warehouse, which means it can crunch huge sets of data in a fraction of the time. MySQL, on the other hand, struggles to scale to the computing power needed for large, modern analytical queries. Even a MySQL replica database will struggle to achieve the same speed as Redshift.
    • Scalability. MySQL was designed to work on a single-node instance, not modern distributed cloud infrastructure. Therefore, scaling beyond a single node requires time- and resource-intensive techniques like sharding or master-node setup. All of this slows down the database even further.

    How to Connect Amazon Redshift With MySQL

    Because of MySQL’s inherent weaknesses, many companies replicate data to Redshift for their analytics needs. There are four ways to accomplish this:

    • Import & Export
    • Incremental SELECT & COPY
    • Change Data Capture (CDC) with Binlog
    • Integrate.io ETL

    Import & Export

    The simplest way to replicate to Redshift is to export your entire MySQL data. However, this is also the least efficient method.

    There are three steps:

    • Export
    • Transform
    • Import (or load)

    thumbnail image

    Step 1: Export

    To start, export data using MySQL’s mysqldump command. A typical mysqldump command looks like this:

    $ mysqldump -h yourmysqlhost -u user mydatabase mytable1 mytable2 --result-file dump.sql
    

    The output of this command is your MySQL SQL statement. You cannot run the SQL on Redshift as is — you’ll have to transform the statement into a format suitable for Redshift import.

    Step 2: Transform

    For the best upload performance, convert your SQL statement into TSV (tab-separated values) format. You can do this by using the Redshift COPY command.

    The COPY command converts your SQL statement into TSV format. Then it batch uploads the files into a Redshift table in Amazon S3. For example, a row of data in your MySQL dump would look like this:

    mysql> INSERT INTO `users` (`id`, `firstname`, `lastname`, `age`) VALUES (1923, ‘John’, ‘Smith’, 34),(1925,’Tommy’,’King’);
    

    Using COPY, it will be transformed into this:

    1923 John Smith

    1925 Tommy King

    Note that values are separated by a tab character (\t).

    You may also have to convert data values to be Redshift compatible. This is because MySQL and Redshift support different column and data types.

    For example, the DATE value ‘0000-00-00’ is valid in MySQL, but will throw an error in Redshift. You have to convert the value into an acceptable Redshift format, like ‘0001-01-01.’

    Step 3: Import (Load)

    After you’ve transformed your MySQL statement, the last step is to import it from S3 to Redshift. To do this, simply run the COPY command:

    COPY users
    FROM 's3://my_s3_bucket/unload-folder/users_' credentials  
    'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';
    

    Downsides of Import & Export

    Although Import and Export is the simplest way to replicate to Redshift, it is not ideal for frequent updates.

    For example, it takes roughly 30 minutes to export 18GB of data from MySQL over a 100Mbps network. It takes another 30 minutes to import that data into Redshift. This assumes you experience zero connection issues during import or export, which would force you to start the process over.

    Incremental SELECT & COPY is a more efficient method of replicating MySQL to Redshift.

    Incremental SELECT & COPY

    If the Import & Export is too slow for your needs, incremental SELECT & COPY might be your answer.

    The SELECT & COPY method only updates the records that have changed since the last update. This takes considerably less time and bandwidth compared to importing and exporting your entire dataset. SELECT & COPY enables you to sync MySQL and Redshift much more frequently.

    To use incremental SELECT & COPY, your MySQL table has to meet a couple of conditions:

    The table must have an updated_at column, whose timestamp gets updated every time the role is changed. The table must have a unique key or keys.

    Like Import & Export, there are three steps to this method:

    1. Export

    Incremental SELECT exports only the rows that have changed since the last update. The SELECT query you run on MySQL looks like this:

    SELECT * FROM users WHERE updated_at >= ‘2016-08-12 20:00:00’;
    

    Save the result to a file for transformation.

    2. Transformation

    This transformation step is the same as the Export & Import method. Transform the MySQL data into TSV format for Redshift.

    3. Import

    At this point, your MySQL TSV file includes both updated rows and newly inserted rows. You cannot simply run a COPY command straight to your destination Redshift table. This would cause the updated rows to be duplicated.

    To avoid duplicate rows, use the DELSERT (DELete + inSERT) technique:

    1. Create a temporary table on Redshift with the same definition as the destination table.
    2. Run COPY command to upload data to the temporary table.
    3. Delete rows from the destination table, which also exist in the temporary table. It will look like this:
    DELETE FROM users USING users_staging s WHERE users.id = s.id;
    

    Where “id” is the unique key of the table.

    1. Finally, Insert rows from the temporary table to the destination table:

    INSERT INTO users (id, firstname, lastname, updated_at) SELECT id, firstname, lastname, updated_at FROM users_staging s;
    

    Downsides of SELECT & COPY

    Incremental SELECT & COPY is more efficient than Export & Import, but it has limitations of its own.

    The main problem is that rows deleted from your MySQL table stay in Redshift indefinitely. This isn’t an issue if you want to keep historical data on Redshift while purging old data from MySQL. Otherwise, deleted rows in Redshift can cause major headaches during data analysis.

    Another drawback of this method is that it doesn’t replicate table schema changes. When a column is added or dropped from the MySQL table, you need to make the corresponding change on the Redshift table manually.

    Finally, the query used to pull updated rows from a MySQL table can affect the performance of your MySQL database.

    If any of these drawbacks are a dealbreaker, this next method is for you: Change Data Capture with Binlog.

    Change Data Capture With Binlog

    Change Data Capture (CDC) is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to incremental SELECT & COPY in that it only imports changed data, not the entire database.

    Unlike Incremental SELECT & COPY, however, CDC allows you to achieve true replication of MySQL to Redshift.

    To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.

    Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.

    Getting Started with Binlog

    When you use CDC with Binlog, you are actually writing an application that reads, transforms, and imports streaming data from MySQL to Redshift.

    There is an open-source library called mysql-replication-listener that you can use to do this. This C++ library offers a streaming API to read data from MySQL bBnlog in real time. A high-level API is also available for a couple of languages, including kodama (Ruby) and python-mysql-replication (Python).

    1. Setup

    First, set the MySQL config parameters to enable Binlog. The following is a list of parameters related to Binlog:

    log_bin = /file_path/mysql-bin.log
    

    Parameter binlog_format sets the format to how Binlog events are stored in the Binlog file. There are three supported formats: STATEMENT, MIXED and ROW.

    STATEMENT format saves queries in the Binlog files as is (e.g., UPDATE SET firstname=’Tom’ WHERE id=293;). Although it saves Binlog file size, it has issues when used for replication.

    For replication to Redshift, use ROW format.

    ROW format saves changed values in the Binlog files. It increases the Binlog file size but ensures data consistency between MySQL and Amazon Redshift. log_bin sets the path where Binlog files are stored. expire_logs_days determines how many days Binlog files are kept.

    Specify the tables you’d like to replicate in the replicate-wild-do-table parameter. Only those tables specified should go into the Binlog files.

    We recommend keeping Binlog files for a couple of days. This ensures you have time to address any issues that arise during replication.

    If you use a MySQL replication slave server as the source, it’s important to specify the log-slave-updates to TRUE. Otherwise, data changes made on the replication master will not be logged in the Binlog.

    Also, your MySQL account needs to have the following privileges to perform replication-related tasks:

    • REPLICATION SLAVE
    • SELECT
    • RELOAD
    • REPLICATION CLIENT
    • LOCK TABLES

    2. Export & Transformation

    When you use the Binlog, “export” is really a real-time data stream of your MySQL Binlog files. How the Binlog data is delivered depends on the API you use.

    For example, with kodama, Binlog data is delivered as a stream of binlog events.

    Kodama lets you register event handlers for different event types (insert, update, delete, alter table, create table, etc.). Your application will receive Binlog events. It will then generate an output ready for Redshift import (for data changes) or schema change (for table schema changes).

    The data change import is similar to the Transformation steps of our other replication methods. Unlike the others, however, Binlog allows you to handle delete events. You need to handle delete events specifically to maintain Redshift Upload Performance.

    3. Import

    The final step of moving data from MYSQL to Redshift is to import your Binlog data stream.

    The problem is Redshift doesn’t have streaming upload functionality. Use the DELSERT import technique we outlined in the Incremental SELECT & COPY method.

    Downsides of Binlog

    Binlog is the ideal method of replication from MySQL to Redshift, but it still has downsides:

    Building your CDC application requires serious development effort.

    In addition to the data streaming flow described above, you will have to build:

    • Transaction management. Track data streaming performance in case an error forces your application to stop while reading Binlog data. Transaction management ensures you can pick up where you left off.

    • Data buffering and retry. Similarly, Redshift can become unavailable while your application is sending data. Your application needs to buffer unsent data until the Redshift cluster comes back online. If this step is done incorrectly, it can cause data loss or duplicate data.

    • Table schema change support. A table schema change Binlog event (ALTER/ADD/DROP TABLE) comes as a native MySQL SQL statement that does not run on Redshift as is. To support table schema changes, you’ll have to convert MySQL statements to the corresponding Amazon Redshift statements.

    If you don’t want to spend developer resources on your own CDC Binlog application, there’s a fourth method: Integrate.io to migrate MySQL to Redshift.

    Integrate.io: MySQL to Redshift Replication in Minutes

    Integrate.io’s ETL pipeline utilizes the MySQL Binlog to replicate data to Redshift in near real-time.

    Integrate.io manages the entire MySQL to Redshift migration, so you don’t have to. You can even synchronize multiple MySQL databases (and other types of databases as well) to Redshift at the same time.

    Setting up Integrate.io is so simple that you can do it over your lunch break:

    1. Enable binary logging on MySQL.
    2. Provide Integrate.io with access information on your servers and with Amazon Redshift.
    3. Recreate your MySQL tables on Amazon Redshift.
    4. When you’re ready, Integrate.io will replicate your MySQL data to Redshift.
    5. Whenever a change is made in the Binlog (data or schema change), Integrate.io will automatically replicate those changes to Redshift.

    In addition, Integrate.io automatically maps MySQL data types into formats used by Redshift. This eliminates the need for your team to do this manually.

    Integrate.io isn’t the only ETL-as-a-Service on the market, but it is the simplest and most reliable ETL. Our world-class, 24/7 customer support will proactively monitor your pipeline to ensure you’re always up and running.

    How Integrate.io Can Help Move Data From MySQL to Redshift

    You rely on MySQL to power your business, but its limitations with data analytics are well-known. Redshift provides a simple, powerful solution to your BI needs. Together, MySQL and Redshift can push your business to the next level.

    As you’ve seen, there are numerous ways to replicate data from MySQL to Redshift. Methods range from simple to complex, and painfully slow to nearly real-time. The method you choose depends on several factors, such as:

    • Replication frequency
    • Size of your MySQL dataset
    • Available developer resources

    The fastest, truest replication method is Change Data Capture (CDC), which utilizes MySQL’s Binlog. The downside is that it requires developer hours to build and maintain the application.

    That’s where an off-the-shelf tool like Integrate.io comes in.

    Use Integrate.io for fast, reliable replication from MySQL to the Redshift database without the hassle and headache. Its ETL pipeline utilizes the MySQL Binlog and replicates data to Amazon’s data warehouse quickly. Or you can synchronize several MySQL databases to Redshift simultaneously. Integrate.io also provides over 200 data connectors that effortlessly sync datasets between data sources and destinations, such as various relational databases, transactional databases, SaaS tools, data warehouses, and data lakes.

    Other benefits of Integrate.io include:

    • Create no-code ETL and Reverse ETL pipelines in minutes with a drag-and-drop interface.
    • Get the fastest ETL data replication in the industry and unify your data every 60 seconds, helping you create a single source of truth for reporting.
    • Optimize API management with secure, self-hosted REST API code automation that powers your data products.
    • Improve data observability with custom alerts and monitoring.

    Take advantage of Integrate.io's 14-day free trial and get more value when moving data from MYSQL to Redshift. Or, if you want to solve unique data migration and data transfer problems, schedule an intro call with an expert. Integrate.io will identify your pain points and provide you with solutions one-on-one.

    FAQs

    How to load data from SQL Server to Redshift?

    You can load data from SQL Server to Amazon Redshift using several methods:

    1. Manual Method: This involves exporting data from SQL Server, preparing the data, uploading it to Amazon S3, and then copying it into Redshift.
      • Prepare Your Redshift Cluster: Ensure your Redshift cluster is running, create an IAM role with S3 access, attach the IAM role to your Redshift cluster, and open necessary inbound rules.
      • Export Data from Microsoft SQL Server: Choose the tables to export from SQL Server, then export the data to flat files (CSV format is commonly used) using SQL Server Management Studio (SSMS) or a command-line tool like BCP.
      • Prepare the Data Files: Cleanse the data to conform to Redshift's requirements, which might include removing or escaping special characters and converting date and time formats. If the data files are large, split them into smaller chunks to optimize the upload and copy process.
      • Upload Data to Amazon S3: Create an S3 bucket, then use the AWS Command Line Interface (CLI) or the S3 Management Console to upload your CSV files to the S3 bucket.
      • Prepare Your Redshift Database: Create schemas and tables in Redshift to match the structure of the SQL Server data.
      • Copy Data from S3 to Redshift: Use the Redshift COPY command to load data from the S3 bucket into the Redshift table.
      • Verify Data Integrity: Run queries to verify that the data integrity has been maintained.
      • Optimize Performance: Analyze the tables to update the query planner's statistics and apply sort keys and distribution keys to optimize query performance.
    2. Using Airbyte: Airbyte allows you to replicate data from SQL Server to Redshift by configuring both as Airbyte Cloud source and destination, respectively. After setting the replication frequency, choose the tables to sync and set the sync mode for each table individually.
    3. Custom ETL Scripting: Custom ETL scripting gives you full control over the code and enables you to customize it to meet your specific requirements. However, custom ETL scripting can be time-consuming and resource-intensive to set up and maintain.
      • Set up your environment.
      • Extract data from SQL Server: Connect to the SQL Server database and extract the data you want to migrate into Redshift.
      • Transform data: Clean and reformat the data and convert data types to prepare it for loading into Redshift.
      • Load data into Redshift: Load the extracted and transformed data into Redshift using the AWS CLI.
    4. AWS Database Migration Service (DMS): AWS DMS can be used to migrate an on-premises Microsoft SQL Server database to Amazon Redshift.

    Is Redshift faster than MySQL?

    Search results do not contain information comparing the speed of Redshift versus MySQL.

    Can we connect to Redshift using MySQL Workbench?

    Search results do not contain information about connecting to Redshift using MySQL Workbench.

    Can I use SQL in Redshift?

    Yes, you can use SQL in Redshift. Redshift is a data warehouse that uses SQL as its query language.

    Is Redshift MySQL or PostgreSQL?

    Redshift is neither MySQL nor PostgreSQL, but it is based on PostgreSQL. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.