Data processing is any action performed to turn raw data into useful information. "Information" is any output that is of use to an organization, such as an analytics report, a visualization, or a refined data set.
Stages of Data Processing
Data processing works much in the same way as Extract, Transform, Load (ETL). Data is retrieved from sources, then passes through a transformation layer, before finally being deposited in its ultimate destination.
Usually, this breaks down into a process with the following stages:
Acquisition
First, relevant data sources are identified. These can include production databases, data repositories, and external sources. There are several data acquisition methods, such as API calls, manual file exports, or using an automated ETL platform.
Preparation
Data preparation is itself a sequence of smaller processes. These include data cleansing and harmonization, to ensure that the incoming data is free of errors, duplicates, and redundancies. This preparation stage produces a set of clean data that is now ready for integration.
Integration
The integration layer involves transforming data according to a master schema so that it arrives at its destination in a standardized format. Integration will also normalize the data, giving it a more efficient structure.
Organization
In some instances, the data may require indexing, sorting, or another form of organization before proceeding to the next stage. This can be done manually or by using a sorting algorithm.
Processing
The processing work is performed manually or automatically, depending on the complexity of the data. For small data sets, an analyst may be able to perform processing with some SQL queries, or even Excel. Larger data sets will require a data analytics tool, which will use machine learning and AI to derive insight from data.
Visualization
The result of data processing is usually a communication that serves a business need. Graphs, charts, dashboards, reports – all of these can be the results of processing. Usually, this step is handled by an analytics expert who will use a visualization tool such as Tableau and chart.io.
Storage
Finally, the processed data is placed into storage. This may be a business-specific repository such as a data mart, which will allow the relevant business unit to access the processed data whenever they need it. Otherwise, the data may end up in a larger repository, such as a data warehouse or data lake.
Processing Unstructured Data
Data processing can be different when working with unstructured data.
Structured data already exists in relational tables before processing. Unstructured and semi-structured data can take many forms: images, audio files, application logs, JSON and CSV exports, BLOBs (Binary Large Objects).
Unstructured data can also be processed, although the procedure is often different.
Acquisition and Preparation
Where possible, there should be a strategic decision about which sources to include. This may not always be possible when working with a Big Data structure, such as a data lake. In as much as possible, the data should be validated before processing, and any broken data removed.
Move to a Suitable Environment
Usually, this involves moving all data to a suitable environment, such as Hadoop. Unstructured data can't be added to a relational database, which is why it has to be imported into an appropriate Big Data setting.
Perform Data Exploration
Data exploration is essential when working with larger data sets. Often, the analytics team may not have a clear picture of the full extent of the available data. Exploration is a type of preliminary analytics that helps clarify the contents of the data set and identify achievable analytics goals.
Introduce Some Structure
There are techniques for finding structure in unstructured data, including:
-
Metadata analysis: Metadata is often structured data, and can be processed as such. This information helps to understand and map the contents of the unstructured data.
-
Regular expressions: A method of recognizing data that has the same semantic meaning. For example, 12 June, June 12th, and 06/12 all express the same concept. Regular expressions can help simplify unstructured data.
-
Tokenization: A method of identifying common patterns in the data, such as recurring phrases in text. These patterns are identified by tokens, and tokens can be combined to build a semantic structure.
-
Segmentation: This technique involves grouping data together based on common attributes. For example, data created on the same day could be segmented and subjected to cohort analysis.
With Big Data sets, all of these operations are performed automatically by analytics tools that can navigate enormous data repositories in a relatively short time, thanks to tools such as MapReduce.
Perform Analytics and Visualization
When the data is ready, the analytics team will perform analytics operations, including visualization. Insights derived from the data are then passed on to the relevant business units.
Note that instead of an ETL approach, unstructured data requires the Extract, Load, Transform (ELT) method. Because the data doesn't have a structure, it can't pass through a single schema. Instead, the data must be moved to the right environment and then subjected to processing.
GDPR Definition of Data Processing
The term "data processing" has a very specific meaning under the EU's General Data Protection Regulation. Under that law, data processing is defined as:
"… a wide range of operations performed on personal data, including by manual or automated means. It includes the collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction of personal data."
This definition does not apply exclusively to analytics in business intelligence. Instead, it covers any activity involving personal data covered by GDPR.
GDPR defines a data processor as anyone who handles personal information, including services that process data on behalf of other organizations. In that context, the organization that owns the data is known as a data controller.