Data mining is the process of exploring Big Data to reveal undiscovered patterns and rules. This process is also known as knowledge discovery.
Data mining usually happens prior to data analytics. Mining can help to uncover patterns which the business might not previously be aware of, such as a correlation between two variables. Analytics can then be used to test a hypothesis based on insights gleaned from mining.
What is the Data Mining Process?
Data mining is a complex process involving statistics, machine learning, and database techniques.
One of the most common process models for data mining is the CRISP-DM model, an open-source standard that has been in use since the 1990s. This model breaks data mining down into six steps:
1. Business understanding
At the initial stage, stakeholders discuss the scope and objectives of the data mining project. This conversation helps to identify which data sources to use, what business outcomes are required, and what resources will be made available to the data mining team.
2. Data Understanding
Next, there will be a phase of data exploration. This involves a high-level examination of available data sources. During this phase, promising trends are highlighted, and these will be the targets for future mining. Tools such as Tableau or Grapher can help to perform this initial analysis.
3. Data Preparation
Data is prepared as required to facilitate mining. This can include:
-
Data cleansing: Errors, duplicates and other problematic values are removed from the data.
-
Data integration: Multiple, disparate sources are unified into a single source.
-
Data harmonization: Data is converted into a pre-defined schema.
This stage may pass through an ETL (Extract Transform Load) layer to automate the data preparation process. An ETL platform like Xplenty can prepare data from most common sources without the need for manual intervention.
4. Modeling
The data mining team will try a number of models to explore the available data. These models can include:
-
Linear regression: Identifying the relationships between multiple values, and then using those relationships to predict future values
-
Decision trees (or regression trees): A modeling technique that uses a series of binary values to interpret data
-
Neural networks: machine learning algorithms that repeat problems over and over, gradually becoming more efficient with each iteration
In order to test these models effectively, it may be necessary to review the data preparation process, in which case the data mining process moves back to stage three.
5. Evaluation
The results of each model are assessed to find the most appropriate candidate. Models must meet the following criteria:
-
Predictive: The model can make predictive conclusions based on available data
-
Accurate: Insights derived from the model must correspond with the data
-
Relevant: The model must produce results that deliver the agreed-upon business objectives
If no candidate model meets these criteria, the process may move back to step four, or back to step three if further data preparation is required.
6. Deployment
The data mining model is deployed and put to work against the available data. The results should fulfill the project’s objectives and deliver insights that can inform the next steps in the organization’s data analytics strategy.
How is Data Mining Performed?
Steps 3 to 6 of the CRISP-DM model generally only happen where data scientists are creating mining algorithms.
In enterprise usage, the data team will generally use a Business Intelligence (BI) platform to perform data mining. Common platforms include Tableau Server, Looker, Amazon QuickSight, and Microsoft Power BI.
These platforms can also help with insight refinement and visualization. Data mining should ultimately produce something that is useful to the business, such as a new trend or an interesting correlation.
The final step of data mining is to present insights to relevant stakeholders. These interested parties will then decide:
- If the data mining project met the stated goals
- If the correct data sources were used, or if other data sources should have been included
- Whether the mining results represent new knowledge or fit with the existing understanding of the business
- If the data team should proceed to perform more in-depth analytics
In some instances, further mining work might be required. This may involve returning to the beginning of the CRISP-DM process.