In older surveys, data scientists reported that data cleansing occupied around 60 percent of their time and that it was generally the “least enjoyable” part of their job. A few years later, data scientists are still putting a considerable amount of their working hours into data cleansing efforts. While a 2020 survey reported that data scientists now only spend around 45 percent of their time on data preparation tasks such as data cleansing, this means data cleansing still takes considerable effort and time from data scientists.
While many data scientists report data cleansing as being one of the least enjoyable tasks in their job, data cleansing occupies a vital role in the ETL (extract, transform, load) process, helping to ensure that information is consistent, accurate, and high-quality. What's more, you can make data cleansing significantly less painful by following a few simple best practices. Read on to discover what data cleansing is, why it's important, and how to do it right.
Recommended Steps:
- Drop Irrelevant Data
- Get Rid of Duplicate Data
- Structural Errors/Discrepancies
- Take Care of Outliers
- Drop, Impute, or Flag Missing Data
- Standardize the Data
- Validate the Data
What is Data Cleansing?
Simply put, data cleansing is the act of cleaning up a data set by finding and removing errors. The ultimate goal of data cleansing is to ensure that the data you are working with is always correct and of the highest quality. Data cleansing is also referred to as "data cleaning" or "data scrubbing."
"Computer-assisted" cleansing means using specialized software to correct errors in your data. The software works by comparing unclean data with accurate data in a database. It also checks manually-entered data against standardization rules. For example, it would change "california" to "California" when capitalizing the names of states.
Data cleansing software can make most of the necessary changes automatically, such as fixing typos. For this to work, you would need to set an auto-correction threshold score, such as 0.8 or 80%. For example, the tool would automatically change any misspelled product name if it found a correct match that surpassed the 0.8 score.
The data cleansing process is also interactive. That would be helpful if the software couldn't find a matching replacement satisfying a preset auto-correction rule. You would have to set an auto-suggestion threshold value, such as 0.6, for this to work. Based on the example above, if the software encountered a misspelled product name, it would suggest a replacement if it found a match above the 0.6 score. You would then look at the suggested replacement, and approve or decline it as appropriate.
Computer-assisted data cleansing is much more accurate than a human-centric process. Plus, it is fast when cleansing large volumes of streaming data.
Data Cleansing: Why Do It?
According to one survey by Experian, most companies believe 29% of their data is defective. What's more, enterprise data sets can decay in quality at an alarming rate. For example, most analysts estimate that B2B customer data decays at a rate of at least 30 percent per year, and as high as 70 percent annually for industries with high turnover.
If you're ingesting tons of data from diverse sources, it's almost certain that some of this data will be streaming in "dirty." For example, social media comments or text on images may not always meet your formatting or accuracy standards. You may also receive unclean data from a structured source, such as a relational database. An example is when a value in a foreign key column doesn't match the referenced primary key.
Information that is out-of-date, corrupt, duplicated, missing, or incorrect can dramatically skew the results of your analysis and reporting processes. And it can hurt a company's bottom line too. According to Forbes, dirty data is costing business organizations up to 12% of total revenue. The goal of data cleansing is to repair the holes and inconsistencies present in your data set so that organizations dependent on accurate information can continue to enjoy the benefits of high-quality data.
Cleaning your enterprise data can fix these major issues:
- Duplication
- Irrelevance
- Inaccuracy
- Inconsistency
- Incompleteness (missing data)
- Outliers
- Lack of standardization
- Existence of data silos
Related Reading: Data Silos: What They Are (And How to Destroy Them)
4 Benefits of Data Cleansing
The direct benefits of data cleansing include:
-
More accurate insights and predictions: The well-known software development principle of "garbage in, garbage out" (GIGO) asserts that poor input data will necessarily result in bad outputs. The more accurate your enterprise data, the better performance from the algorithms and models that use this data.
-
Better employee productivity and efficiency: Data that is cleansed is data that people don't have to spend their time correcting themselves. Employees can do their jobs with full confidence that the information they use is up-to-date and correct to the greatest extent possible.
-
Higher revenues and lower costs: According to a study by Experian, companies estimate that they lose 27 percent of their revenue, on average, due to inaccurate data. Investing time and effort into data cleansing will reap major rewards and significantly improve your bottom line.
-
More satisfied customers: Higher-quality data helps you understand how to improve the customer experience at every stage of the process, from the initial contacts with prospects to customer support and retention.
Related Reading: Data Cleansing vs. Data Enriching: What's the Difference?
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Data Cleansing Methods
Histograms
Running histograms on the data can determine which values appear less frequently and could thus be invalid. These values can then be updated, although this is a problem with Hadoop which does not provide an update functionality.
Conversion Tables
In case certain data issues are already known, for instance, that the US is written out in several different ways, conversion tables could be used. The data should first be sorted by the relevant key, lookups could be used to make the conversions, and finally, the results should be stored for further use.
Tools
Various vendors such as IBM, SAS, Oracle, Lavastorm Analytics, and Talend provide data cleansing solutions. Free tools include Open Refine, plyr, and reshape2, though it is unclear whether they can handle Big Data.
Algorithms
Spell checking or phonetic algorithms can be used to fix some of the data. Nonetheless, they could also corrupt it by making the wrong suggestions, so some manual work may be necessary.
Manually
Speaking of which, most of the data is usually cleaned by hand. Even with the help of tools, histograms, and algorithms, human intervention is still needed to understand and fix the data.
Data Cleansing Steps
Not all data cleansing techniques and practices are created equal. Data workers sometimes use spell checkers and histograms to find outliers in the data that are likely errors. A variety of software vendors offer complete data cleansing tools that can easily check the accuracy of your records and scrub them for duplicates. Even with these automated tools, however, some degree of manual intervention is usually required to provide a sanity check.
Here are the steps we recommend taking when performing a data cleanse:
1. Drop Irrelevant Data
Identify and get rid of irrelevant data in your database or data warehouse. That would be anything whose processing or analysis has no business value to you. Such data can skew the outcomes of your business analytics.
A good example is building a predictive maintenance model for a specific car brand. In that scenario, you wouldn't need data on all car models to train your ML algorithm. So, you would drop rows containing irrelevant vehicle entries.
You would do something similar if you were studying the general health of the population. Your tables wouldn't need a "phone number" column in that case.
However, be sure the feature you want to remove is not essential to the dataset you're analyzing. Consult with other stakeholders to make sure you don't drop data you might need one day.
2. Get Rid of Duplicate Data
Duplicate data uses server or processing resources without adding value. Additionally, duplicate records may skew your insights into your customers. Thus, removing duplicate data from your warehouse is a key part of the data cleansing process.
Record duplication may result from various scenarios. An example is when you're ingesting data from multiple sources. For instance, spreadsheet files may have the same customer information as CRM records. Scraping the same web page twice can cause this problem too.
Consider the case of a customer that submits two different email addresses via two separate forms. There's a high probability of treating this information as belonging to two different customers. Data cleansing would help to spot any such duplication. You would then merge the records or remove one of them.
Any two records with the same unique ID in a database table or file are duplicates. Data cleansing merges such records to eliminate redundancies. Merging rules may differ from organization to organization.
3. Structural Errors/Discrepancies
There are different types of structural errors, from typos to inconsistent capitalization. These can be a problem in categorical data or grouped datasets, so they need cleansing.
Typos may originate from the different ways in which people type strings. Consider the example of "Gender" as a categorical variable. There usually should be two classes in this case: male and female. But you may encounter more than two different categories of the variable in a dataset. For instance:
Data cleansing helps to recognize such mislabeled or inconsistently capitalized classes. The outcome in the "Gender" example would be a cleaner classification of grouped datasets (i.e., Male and Female).
Consider using bar plots to spot structural or capitalization errors in your datasets. These graphs show data categories along one axis and their corresponding values or characteristics on the other. By just looking, you may see that some strings mean the same thing despite their different spelling or capitalization. For example, "asphalt" and "Asphalt" may appear separately on a bar plot despite being the same type of roofing material.
One way to fix typos is to map each given value to the expected class options, manually. In the "Gender" example, you can map each string to either "Male" or "Female."
You may also run a fuzzy matching algorithm on strings you suspect to have typos. The tool works out the "edit distance" between the anticipated strings and each of the values in the specific dataset. The edit distance is the number of changes, such as insertions, replacements, or deletions, it takes to transform a source string into the intended one. It serves as a measure of similarity between any two strings. For example, it requires three transformations (i.e. writing "a", "l", and "e" after "m") to change "m" to "male."
Take the example of the source string "landan" and the target string "London." If the similarity score (edit distance) between the two strings were higher than a preset threshold, your fuzzy matching program would match "landan" to "London." You would correct to "London" all strings that satisfied the similarity index.
4. Take Care of Outliers
Any value that substantially differs from the rest of your data may be an outlier. However, not all outliers are bad for business analytics. So, before removing any deviant values, assess their potential impact on your analysis.
Take the example of analyzing the various states in the United States. In this scenario, California would be an outlier because of its significantly larger population and economy. Nonetheless, excluding the state would substantially affect the results of your analysis. With that said, outliers impact some analytical models, such as linear regression, more than others.
The two main types of statistical outliers are:
-
Univariate outlier: With this data point, only one variable has an extreme value. Excluding this type of outlier in your analysis may lead to unrealistic conclusions.
-
Multivariate outlier: This type has a combination of abnormal scores on two or more variables.
5. Drop, Impute, or Flag Missing Data
Discretion is necessary when you're deciding whether to drop, impute, or flag missing data. What you do with the missing pieces of information affects the accuracy of your analytics.
Imputing
Imputing means working out the missing value based on the other data. Imputation techniques include linear regression and calculating values like median and mean. You may also copy values from similar observations across your dataset.
Nonetheless, imputing values increases the risk of using biased data in your analysis. That can be a major concern when there are many missing values.
Similarly, imputation does not enrich your dataset in any way. It only reinforces a pattern that the other observations have already established.
Dropping
You may drop observations that have missing values when analyzing statistical data. In this case, dropping is better than imputing values that can impact computation results.
Flagging
Flagging means telling your ML algorithms about any missing values. The main argument for this approach is that it prevents the loss of information. In other words, it's essential that your analytical models decode that specific values are missing. Flagging is particularly useful when data is missing consistently, rather than randomly.
You may fill missing numeric data with "0." However, statistical calculations should ignore these zeros.
In the case of categorical observations, flag any missing values with "Missing." Here, you add a new category so your algorithm can learn what values are unavailable.
6. Standardize the Data
Cleansing your data includes standardizing it to have a uniform format for each value. You may start by having all strings in the same case (upper or lower).
Metric conversion may be necessary when standardizing measurements. For example, all values for height should be in the same unit. So, you may need to convert from feet to meters (or vice versa) to achieve parity. Likewise, use either dollars or cents across your dataset.
Be sure to standardize all other units of measurement in your database. These include weight, distance, and temperature. As for dates, choose either the USA style or the European format.
7. Validate the Data
Validation ensures your data is correct and ready for meaningful analysis. So, check all your data for accuracy after changing it. You may need an interactive, AI-powered tool to do this. Critical considerations in the final stages of data cleansing include ensuring that:
- Your data meets pre-established range constraints
- Each input value is of the mandated data type
- There are no missing values for mandatory fields
- There are no nonsensical values
Best Practices
When cleaning your enterprise data, it is highly recommended that every organization consider making the following adjustments:
Names: A person's full name may be written in a variety of formats. For example, the family name may be placed before or after the given name, and middle initials may be included or excluded. Data cleansing should recognize that all of these formats refer to the same person.
Dates and times: There may be even more ways to write a date than to write a name: MM/DD/YYYY, YYYY-MM-DD, and DD MM YYYY are just a few examples. In addition, times may be given in 12-hour or 24-hour format, and may or may not include seconds.
Locations: The same location may be referred to by several names, or by its latitude and longitude. For example, the United States may be written as "U.S.", "US", "USA", or "United States of America."
Numbers and currencies: Many European countries use periods rather than commas to separate groups of three digits, and use commas rather than periods as the decimal separator—exactly the opposite of the American system. Meanwhile, you may want to convert foreign currencies into their local equivalent.
Best Data Cleansing Tools
With data cleansing still having such a large role in the daily tasks of data scientists it’s important that data scientists have the right tools on their side to tackle any data cleansing issues. Some of the top platforms offering data cleansing tools in 2021, include:
- Integrate.io
- Tibco Clarity
- DemandTools
- RingLead
- Melissa Clean Suite
- WinPure Clean & Match
- Informatica Cloud Data Quality
- Oracle Enterprise Data Quality
- SAS Data Quality
- IBM Infosphere Information Server
Related Reading: Top 10 Data Cleansing Tools for 2021
How Integrate.io Can Help
Data cleansing paves the way for meaningful analytics and intelligence-driven business decisions. It is a vital yet often misunderstood part of data management, helping to sanitize and fact-check information as it moves down the data pipeline.
The Integrate.io data integration platform enables users to cleanse and reformat your data during ETL, transforming it to the required target format. Are you curious as to how Integrate.io can help you expedite your data cleansing workflows? Contact our team today to schedule a 14-day demo or pilot and see how we can help you reach your goals.