Data preparation is a workflow that produces a set of data for specified business usages, such as analytics or warehousing. A typical data preparation workflow can include steps like data acquisition, data cleansing, creating metadata, and data transformation.
What is the Data Preparation Workflow?
The nature of the data preparation workflow depends on the type of project. Often, the process starts with a conversation with relevant stakeholders.
1. Gather requirements
If this is a new data preparation project, the data team will work with stakeholders to define the end goal. For example, in an analytics project, they will discuss which data sources to work with and what the final data structure will look like.
2. Cleanse the data
Data cleansing is the process of modifying data to improve accuracy and quality. This is done by identifying any corrupt, empty, invalid or duplicate data. These inconsistencies are fixed either by updating or deleting the affected values.
3. Extract the data
Data integration brings disparate data sources together into a single place. Sometimes, this might require multiple methods, such as using a combination of API calls and JSON file imports.
4. Transform the data
Imported data will need to be transformed so that it fits the needs of the end-user. When all data is transformed to fit a single schema, this is known as data harmonization.
5. Create metadata
The final data set will require metadata for navigation and organization. Metadata can be used to preserve previous metadata, such as information relating to data lineage and data provenance.
6. Load the data
Once prepared, the data must be delivered to its final location, such as a data warehouse. At the end of the data preparation process, the user should be able to perform their desired processes without requiring further alterations to the data.
What Tools are Used for Data Preparation?
Data preparation can be performed manually when it is a small, one-off job. Data technicians can use SQL queries or an Excel spreadsheet to prepare the data.
For larger projects, data preparation is handled by an automated ETL (Extract Transform Load) platform such as Xplenty.
Automated ETL handles most of the resource-intensive work in the data preparation process. The advantages of ETL are:
-
Extraction: ETL platforms like Xplenty include a library of integrations. This allows them to plug into multiple data sources without the need for writing code, configuring APIs, or performing manual exports. ETL allows all required data to be easily gathered in one place.
-
Transformation: The ETL platform provides a staging layer in which data can be transformed before moving on. This can include harmonizing the data according to a single schema defined using tools within the ETL platform.
-
Loading: After it has been cleansed, integrated, and harmonized, the data can be delivered to its final destination, such as a data warehouse or an analytics platform. As with the extraction phase, Xplenty can perform this part code-free, due to the integrations library.
The main advantage of automated ETL is that it creates an active data pipeline. The data preparation project just needs to be set up once. Then the data pipeline will provide a steady supply of fresh, business-ready data.