Automated ETL (extract, transform, load) and data integration workflows are essential for the modern data-driven organization, and they can swiftly and efficiently migrate data from sources to a target data warehouse or data lake. But ETL must run at regular intervals — or even in real-time — so how can you know which information is fresh and which information you’ve already ingested?
Solving this problem is the goal of change data capture (CDC) techniques. When should you use change data capture, exactly? In this article, we put on our “ETL expert” hat and offer advice on this topic.
What is Change Data Capture?
Change data capture (CDC) is the term for a collection of data integration techniques that identify the new or changed data from a particular source. This new or changed data can then migrate to your cloud data warehouse or data lake.
By applying change data capture, you can selectively ingest only the data that is not yet available for business intelligence (BI) and analytics workloads. CDC makes it possible to perform ETL at regular intervals, which prevents the data in your warehouse from going stale and facilitates smarter data-driven decision-making.
When Should You Use Change Data Capture?
Change data capture benefits your organization by saving a lot of time and effort in the long run, especially during the computationally intensive transformation stage of ETL. However, using change data capture may not be necessary for every data source.
For example, consider a streaming data source like a weather monitor with information that changes every time you access it. Such a data source does not require CDC, simply because all the data is “new” and there will be no residual information from the last time you performed data extraction.
On the other hand, CDC is crucial for source tables and databases that act as long-term data storage. These repositories may contain thousands or millions of records, most of which change very little between two different executions of ETL. Trying to ingest the entire database indiscriminately each time, without regard to the records that have actually changed, would be both foolish and wasteful.
So far, we’ve mainly focused on CDC’s impact on batch ETL (executing ETL at regularly scheduled intervals). CDC is a smart business practice for batch ETL, but it’s nearly essential for performing real-time ETL, in which data moves through the pipeline almost immediately after becoming available.
Batch ETL presents several problems: ingesting large amounts of data can be slow and strain your systems and network, even when limiting it to the fresh data through CDC. For this reason, many organizations run batch ETL overnight, when more resources will be available and it’s less likely to disrupt people’s work. However, this delay also means that key decision-makers will have to wait longer for access to valuable data and insights.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
For these reasons, many organizations prefer a real-time ETL (also known as streaming ETL), in which data ingestion happens to fresh data almost as soon as it’s available. Real-time ETL sends data continuously to your data warehouse, consuming fewer resources at any point in time.
Of course, knowing that new data is available requires the use of techniques such as CDC so that you know just what information to collect. If your organization depends on any kind of real-time, data-driven activity — from detecting credit card fraud to performing Internet of Things (IoT) analytics — then you need change data capture.
When Should You Use Different CDC Methods?
The question “When should you use change data capture?” is one thing, but you should also consider the question: “When should you use the different types of change data capture?” Here’s some advice on when to use different CDC methods:
1. Log-based CDC
Log-based CDC uses a database’s transaction log, which is a separate metadata file that records every database transaction (e.g., UPDATE, INSERT or DELETE statements). The goal of transaction logs is to help you recover in the event of a database crash; however, they can also help you perform CDC. Log-based CDC involves scanning through the transaction log to reconstruct the records that are new or changed since your last ETL execution.
Log-based CDC is generally accurate and efficient, but it can be more technically complex to implement. Since each database system may have a proprietary format for its transaction log, you might need to write a unique script for each system that’s part of your workflow.
2. Trigger-Based CDC
Trigger-based CDC uses database triggers, which are specialized database functions that run when a given event occurs (for example, when new data loads into the database). When a database trigger runs, it acts as a signal for the ETL process to ingest the new information.
Trigger-based CDC is a reliable way to identify changes to a database. However, executing database triggers adds a large amount of overhead to your database operations, and it’s not available for data sources other than databases (just like log-based CDC).
3. Metadata
Many data sources include metadata, such as a DATE_MODIFIED or LAST_UPDATED column, that shows when a record was last changed. Examining this column allows you to identify which records have undergone changes since the previous ETL execution, and then ingest the appropriate data.
This metadata-based approach is relatively straightforward and efficient to implement, but it has one big drawback: It probably won’t be able to identify deleted records (since they will no longer be present). If you want to cleanse the data in your data warehouse periodically by reflecting deletions at the source, this approach is likely not suitable for your needs.
How Integrate.io Can Help with Change Data Capture
We’ve talked about what change data capture is and when you should use change data capture — so how can you actually bring CDC into your organization?
For many businesses, the right answer is with an ETL tool like Integrate.io. It comes with log-based and trigger-based CDC, so you can ingest exactly the data you need — and skip the data you don’t. Integrate.io’s low-code and no-code functionality mean you don’t have to worry about the technical details of change data capture — you only think about how it makes your organization more nimble, efficient, and data-driven.
With a rich, diverse set of features — including change data capture; more than 100 pre-built integrations and connectors; and a simple drag-and-drop interface — Integrate.io is the right choice for businesses that want to automate and streamline their data integration workloads. To learn more, contact our team of data experts today for a chat about your situation, or to get started right away with your 14-day pilot of Integrate.io.