Data cleansing is the process of modifying data to improve accuracy and quality. The cleansing process has two steps:
- Identify and categorize any data that might be corrupt, inaccurate, duplicated, expired, incorrectly formatted or inconsistent with other data sources
- Correct all dirty data by updating it, reformatting it, or removing it
Data cleansing is one of the key steps in the Extract, Transform, Load (ETL) process. ETL tools will apply pre-defined schema during the Transform stage to ensure that only high-quality data passes through to the load stage.
How Does Data Cleansing Work?
Automated data cleansing tools run through several processes when cleaning up data, such as:
-
Identifying duplicates: Either duplicate rows in a single table or multiple records within multiple databases
-
Flagging corrupted or empty cells: Corrupt and empty cells may require manual intervention to amend
-
Checking data validity: Data can grow stale over time and might no longer be accurate
-
Data validation: Matching data against a template, such as XXX-XXX-XXXX for phone numbers or {name}@{domain}.{TLD} for email
-
Checking data formats: Data is often stored in the wrong data format, such as dates and integers types that are stored as text strings
-
Histograms and regression testing: There are statistical techniques that can be applied to detect unusual data values which may require cleansing
-
Using conversion tables: Where there are several recurring variations on a common value, a conversion table can identify all variations and replace them with the preferred version, such replacing all instances of “FA,” “Fla.” and “florida” with “Florida.”
After identifying problematic data, the system must then correct it. Cleansing can be done in several ways, such as:
-
Reformatting: The system can convert rows to the correct data type.
-
Dropping: Duplicate data can be safely deleted from the data source.
-
Imputing: Missing data can sometimes be imputed from other values, or estimated based on similar records.
-
Merging: If a better data source is available, the information from that source can be used merged into the corrupted table.
-
Flagging: When there is no way to fix data automatically, the system can flag that there is an issue. Analytics modules can then ignore this data during analytics, and human operators can attempt to update the problematic data manually.
When Does Data Cleansing Take Place?
Large-scale data cleansing often happens during the transform stage of ETL. During this phase, each record can be individually checked against the organization’s data schema. This ensures that analytics operations are conducted with accurate, high-quality information.
However, ETL only impacts the data that arrives in the data warehouse. It does not cleanse any of the data sources that act as inputs, such as an organization’s CRM or ERP, which remain unchanged by the ETL process.
If an organization notices significant issues with data quality before ETL, they should conduct a full data audit. In some cases, they may need to perform extensive manual data cleansing and a review of data governance policies.