Introduction
When you’re building or improving your data pipeline, batch data processing seems simple. Pull data from a source, apply some business logic to it, and load it for later use. When done well, automating these jobs is a huge win. It saves time and empowers decision-makers with fresh and accurate data.
But this kind of ETL (Extract, Transform, and Load) process gets very tricky, very quickly. A failure or bug in a single step of an ETL process has cascading effects that can require hours of manual intervention and cleanup. These kinds of issues can be a huge time sink for the analysts or engineers that have to clean up the mess. They negatively impact data quality. They erode end-user confidence. And they can fuel anxiety when pushing out even the most innocuous changes to a data pipeline.
To avoid falling into this kind of ETL black hole, analysts and engineers need to make data pipelines that:
- Avoid loading duplicate data
- Don’t load data to the final destination when a step has failed
- Clean up after themselves
In short, they need to make data pipelines that are idempotent.
Idempotent is a mathematical term describing an operation that can be applied arbitrarily many times without changing the result. Such data pipelines are fault-tolerant, reliable, and easy to troubleshoot. And they’re much simpler to build than you might expect.
Here are three simple steps you can take to make your data pipeline idempotent.
Load Data to Your Pipeline in Isolated Batches
Amazon recommends several best practices for loading data. The easiest way to be sure that a buggy or failed ETL run doesn’t dirty up your data is to isolate the data that you deal with in a given run.
If you’re working with time-series data, this is reasonably straightforward. When pulling data from your initial source, filter it by time. Operate on an hour, a day, or a month, depending on the volume of your data. By operating on data in time-windowed batches like this, you can be confident that any bugs or failures aren’t impacting your data quality beyond a single window.
If you’re working with cross-sectional data, try to deal with some explicitly-defined subset of that data in a given run. Knowing exactly what data has been impacted by a bug or failure makes clean up and remediation of data pipeline issues relatively straightforward. Just delete the affected data from your final destination and then run again.
If you isolate your data correctly, you never have to worry about duplicated or double-counted data points.
Use Transactions to Prevent Buggy Data Pipeline Runs
It’s important to build your pipeline so that it either loads everything or nothing at all. This will help to avoid time-consuming cleanup of failed or buggy pipeline runs. That is to say, your runs should be atomic.
If you’re loading into a SQL database, achieving atomicity for your data pipeline is as simple as wrapping all of your loading statements–inserts, copies, creates, or updates–in a transaction with a BEGIN statement at the beginning and an END or COMMIT statement at the end.
If the final destination for your data is not a SQL database–say an Elasticsearch cluster or an Amazon S3 bucket–building atomicity into your pipeline is a little more tricky. But it’s not impossible. You’ll simply need to implement your own “rollback” logic within the load step.
For instance, if you’re writing CSV files to an Amazon S3 bucket from a python script, maintain a list of the resource identifiers for the files you’ve written. Wrap your loading logic in a try-except-finally statement which deletes all of the objects you’ve written to if the script errors out.
By making your data pipeline atomic, you prevent the need for time-consuming cleanup after failed runs.
Related Reading: Troubleshooting Data Loading Errors in Amazon Redshift
De-Duplicate So your Data Pipeline Cleans Up After Itself
Now that you’ve isolated the data your pipeline deals within a given run and implemented transactions, having your pipeline clean up after itself is a breeze. You just need to delete before you insert. This is the most efficient way to avoid loading duplicate data.
Say your data pipeline is loading all of your company’s sales data from the previous month into a SQL database. Before you load it, delete any data from last month that’s already in the database. Be sure to do this within the same transaction as your load statement. This may seem scary. But if you’ve implemented your transaction correctly, your pipeline will never delete data without also replacing it.
Many SQL dialects have native support for this type of operation, typically referred to as an upsert. For instance in PostgreSQL (which is also the SQL dialect used by Amazon Redshift) you could prevent loading duplicate data with the following:
|
INSERT INTO sales (employee, customer, contract_number, annual_revenue, date)
VALUES ('Janet', 5439, 50000, '2018-07-20')
ON CONFLICT (contract_number)
DO UPDATE SET employee = excluded.employee;
|
This SQL statement will attempt to insert a new row into the database. If a row already exists with contract number 5439, the existing row will be updated with the new employee name. In this way, duplicate rows for the same contract will never load into the database.
By having your pipeline clean up after itself and prevent the loading of duplicate data, backfilling data becomes as simple as rerunning. If you discover a bug in your business logic or you want to add a new dimension to your data, you can push out a change and backfill your data without any manual intervention whatsoever.
Conclusion
ETL pipelines can be fragile, difficult to troubleshoot, and labor-intensive to maintain. They’re going to have bugs. They’re going to fail once in a while.
The key to building a stable, reliable, fault-tolerant data pipeline is not to build it so that it always runs perfectly. The key is to build it in such a way that recovering from bugs and failures is quick and easy.
The best data pipelines can recover from bugs and failures without ever requiring manual intervention. By applying the three simple steps outlined above, you’re well on your way to achieving a more fault-tolerant, idempotent data pipeline.
We’ve built an ETL performance monitoring solution with end-to-end visibility that’s used by top analytics teams. Discover more at Integrate.io.