A data pipeline is a software process that takes data from sources and pushes it to a destination. Most modern data pipelines are automated with an ETL (Extract, Transform, Load) platform.
How Does a Data Pipeline Work?
Data always needs to flow from one location to another for functions such as processing, analytics, syncing, or storing. This movement might need to occur daily, hourly, or even in real-time whenever a record is updated.
A data pipeline is the software that performs this task routinely and consistently. There are three main elements of a pipeline:
-
Sources: A pipeline can draw data from multiple disparate sources. For example, data can come from production systems like a CRM, ERP, or a sales database.
-
Destination: The ultimate destination for the data. This might often be a data mart, a data warehouse, a data lake, or a different relational database. Usually, a pipeline will only have one destination.
-
Pipeline software: Software is required to export from the sources and import to the destination. In the past, this was performed with regularly scheduled batch jobs. This approach has mostly been replaced with automated ETL, which can run in real-time. This software may also perform transformations on data in transit so that it is formatted per the destination’s schema
One of the advantages of automated ETL is that administrators can make changes to the sources and destinations without compromising the pipeline.
What Data Flows through a Data Pipeline?
Any data can flow through a data pipeline. Broadly speaking, this means three classifications of data:
-
Structured data: Data that’s already within a relational database structure. Full tables can be exported using API calls.
-
Semi-structured data: This data may be handled as a file export, such as a JSON or CSV file.
-
Unstructured data: In this context, unstructured can refer to anything that’s not in a recognized database format. Text files, images, audio, and scans of documents are all examples of unstructured data.
To connect a source to the pipeline, the administrator must have adequate permissions to perform a data export. Each source has its own method for exporting data. In some instances, it may be necessary to write a script or application that pulls data from the source.
Automated ETL makes this process much easier. Platforms like Xplenty are pre-configured to integrate with a large library of popular data sources. This allows the platform to automatically obtain data from the sources, without the need for coding, configuration, or batch jobs.
How is Data Transformed in the Pipeline?
Data may or may not require transformation before it is delivered to the destination. Depending on the nature of the pipeline, this transformation might include:
-
Data cleansing: Any erroneous values are amended or removed, including duplicates and null values.
-
Data mapping: Data is converted to fit the schema of the destination database.
-
Data harmonization: The data is further refined to ensure all values meet the logical rules of the destination database.
-
Data enrichment: Multiple data sources are merged to create a single, more detailed source.
-
Data masking: Sensitive values are obfuscated to preserve the privacy of the data’s subject.
If any transformation is to occur during transit, the administrator will define rules, including a master schema. These rules have to be applied and verified manually unless an automated ETL tool is used.
ETL will apply any changes automatically to all data that passes through the pipeline. This is done by way of a staging layer, in which data is copied to a version of the destination schema.
Where Can the Pipeline Deposit Data?
Pipelines can connect to any suitable data repository. The ultimate destination depends on the purpose of the pipeline.
-
Relational database: The pipeline can connect to a simple relational database. This is most useful in scenarios when integrating multiple data sources, or when data must be transformed before production use.
-
Data warehouse: Pipelines are essential for data warehousing. The pipeline can be configured to gather data from multiple sources, and then transform that data to fit the warehouse schema. ETL ensures that the warehouse is updated accurately in real-time.
-
Data mart: Marts are subsections of a warehouse, with data collated for a specific business purpose. Data pipelines can help to supply these marts, ensuring that all data within the structure is relevant.
-
Data lake: A data is an unstructured repository for large quantities of data. When a pipeline connects to a data lake, it often won’t perform any substantial data transformation. Instead, the data is loaded to the lake in a raw format, and analysts use advanced tools to navigate. This approach is known as ELT (Extract, Load, Transform), as opposed to ETL.
Most data pipelines are unidirectional, with data flowing from sources to the destination. This is often the most efficient approach, although some implementations allow for bidirectional pipelines.
Compliance and Security Implications of Data Pipelines
Transporting data carries a certain degree of risk. The data could become corrupted, be incorrectly transformed, or even be intercepted by a third party.
Strong security protocols are essential when planning a data pipeline. Automated ETL platforms remove much of the risk involved, as data is never directly exposed. Instead, the ETL platform queries the destinations via API, then securely transports the data to its destination. Because there’s no manual interaction with the data, there is little risk.
Some data protection laws, such as GDPR, can restrict the way organizations operate on data. Data quality standards must be maintained, and the data may not be allowed to leave the jurisdiction. This can be a problem when using a third-party service that is based in another country.
Also, data quality must be preserved at all times. Any errors in the final data could constitute a data protection breach, as well as compromising the integrity of the data.
For this reason, organizations must have a process for tracking data lineage. This is metadata that details where data originated from, where it has been transported to, and what kind of transformation has taken place.