Data virtualization involves combining multiple data sources into a single logical view. Virtualization is achieved with middleware, and the data involved is not replicated elsewhere. However, virtualized data can be used as a live data source for many business purposes, including analytics.
When is Data virtualization Used?
Virtualization is commonly used when working with highly volatile data, or when processing times need to be reduced. Virtual data includes the most up-to-date information from the source database, with none of the delays that can arise from data integration and data replication.
There are several use cases for data virtualization, such as:
-
Analytics: Virtualization is an easy way to bring data sources together, which makes it ideal for data analytics. Analysts have full control over the way data sources are combined, and they can adjust the structure of virtualized data without compromising the integrity of any data repositories.
-
Real-time reporting: Because virtualization is fast, it can offer quicker results than most data pipelines. This can be useful in time-critical situations, such as when a financial company wants to monitor market trends.
-
Data exploration: Data exploration is often the precursor to large-scale analytics. It also helps define the schemas in integration projects, such as warehousing. Virtualization can provide a quick view of the available data, allowing for speedy exploration.
-
Testing: Virtual data sources are easy to set up, which is useful for testing. The flexibility of data virtualization is especially helpful in high-speed testing environments, such as DevOps.
Virtualization is not used where there is a requirement to store a persistent copy of transformed data. Virtual data views are dependent on the original source – if that source is lost or changed, that data will vanish from the virtual view.
Data virtualization is not to be confused with virtualization in the broader IT sense. A data virtualization platform is not a functioning database running in a container or virtual machine. Instead, it is a unified, logical view of multiple physical databases.
How is Data Virtualization performed?
Data virtualization is performed with data virtualization tools. These tools can be open-source or commercial, but they all operate on the same principles. The process usually looks like this:
1. Load metadata from the source
Data virtualization tools work with metadata, rather than the data itself. This metadata will include details of the source data, such as table structures, column names, and information about data lineage.
2. Load views from the source
The data visualization tool will also load views from the source database if present. The advantage of using views is that they can be applied directly to the source database, without requiring an understanding of the source’s structure.
3. Apply business rules
By using metadata and physical views from the source database, it is possible to create functioning logical views. The exact method for doing this depends on the data virtualization tool – some work with a query language like SQL; others use a drag-and-drop interface. Using their preferred tool, data engineers can apply joins to create logical views that cover multiple, disparate data sources.
4. Make virtual data source available
The output of the logical views can be used as a data source. The actual data remains unchanged on the source database, while the data virtualization platform outputs a working representation of that data.
5. Connect to business platforms
Virtual data usually exists for a specific business purpose, such as analytics. In this instance, the virtualization platform may integrate to a data visualization tool such as Chart.io or Tableau. Analysts can create graphical representations of the data, or process it according to any other relevant business needs.
Data Virtualization vs. Data Warehousing
Virtualization and warehousing are two methods for integrating data into a single, working source. However, the two methods tackle the issue in different ways.
Consolidation
With data consolidation methods such as Extract, Transform, Load (ETL), data is extracted from a source, transformed according to a schema, and copied to a destination, such as a warehouse. The transformed data persists at its new location and exists separately from the original source.
In virtualization, data stays put on its original server. The view of data is purely virtual, the result of a middleware layer that sits between the data sources and the business applications.
Recency
When the source data is transformed, this change will not be automatically represented in the data warehouse, unless the change is transmitted along a data pipeline. If the source data goes offline, the data warehouse remains unchanged.
Data in a virtual view is the result of queries to the original data source. This means that all data available will be the most recent version. If the source is unavailable, the relevant data won’t be included in the virtual view.
Flexibility
Both ETL and virtualization involve passing data through a transformation layer. In virtualization, this transformation is temporary. Data owners can change the schema, refresh the view, and see a new set of results.
In ETL, the transformation is permanently represented in warehoused data. Any changes in the schema are only reflected in new data, while older data remains untouched.
Speed
Virtualization offers on-demand logical views of data. This tends to be the quickest method when working with fast-changing live data, as there is no waiting for a transformation process to complete.
However, post-transformation, warehoused data offer some efficiency advantages. Data has already been transformed by ETL into the appropriate schema, which means that there are no complex views and joins. The data is ready for immediate querying.
Many organizations will find that they need a mix of virtualization and warehousing, depending on the scenarios. To choose between the two, decision-makers must know the needs of business users, as well as the nature of the data sources involved.