Designing a data warehouse is a crucial part of getting effective business intelligence and analytics—so why does it sometimes seem so difficult?
Fans of AMC's hit drama Mad Men might remember an episode from the show's final season, "The Monolith," when a room-sized computer is installed in the center of the office. One of the firm's clients requests a computer simply because a competing agency already has one—thus taking what might be the very first step in the digital "arms race."
Mad Men is beloved by critics and viewers for showing how the advertising industry changed throughout the 1960s, from the growth of advertising conglomerates to the rise of TV commercials. Just as advertising went through major changes in Don Draper's time, the task of designing a data warehouse is rapidly evolving as the volume, variety, and velocity of big data all increase.
Introduction
Relational database management systems (RDBMS) are still king for most organizations when it comes to data warehouses. However, designing a data warehouse with an RDBMS involves a sequential series of tasks such as:
Using relational databases when designing a data warehouse requires the schema to be well-defined before loading a single byte of data—a concept that's known as "schema on write." Even the slightest changes to your data sources or questions could take you back to the drawing board in order to start the data warehouse design process all over again. In the age of bigger and bigger data, this is simply inadequate.
Here at Integrate.io, we've already written a comprehensive guide to data warehouse design from start to finish. In this article, we'll take a more philosophical look at the question of designing a data warehouse. What are the challenges, and the strengths of designing a data warehouse, in an age where big data becomes larger and more complex with every passing year?
Table of Contents
The Challenges of Big Data
When designing a data warehouse, IT professionals need to account for the most pressing big data challenges. Two crucial facts:
- The average company now manages 163 terabytes (163,000 gigabytes) of information.
- IDC predicts that the total amount of information worldwide will soar to 175 zettabytes (175 trillion gigabytes) by 2025.
The enterprise data of today arrives from a dizzying array of various sources, and it comes in many different formats. Information such as application logs, sensor data, and social data may be semi-structured or completely unstructured data, which is difficult to process by a traditional relational data warehouse.
What's more, it's hard to ask all the right business questions in advance when designing a data warehouse, since plenty of new questions will pop up during analysis. Also, more data is arriving all the time—so the questions you want to ask right now may not be the same as the ones you want to ask next week, next month, or next year.
Designing a data warehouse for today's big data challenges means that you have to handle changing requirements and information in different formats, from different sources, all at great speed. Enter Apache Hadoop.
Designing a Data Warehouse with Apache Hadoop
Apache Hadoop is a distributed file storage system that requires no schema. In other words, data can be stored immediately without having to define anything, a concept better known as "schema on read." As new requirements pop up, they can be satisfied without having to redesign the entire system, while new sources of data can be easily integrated.
While Hadoop excels at building agile and flexible data warehouses, Hadoop can handle all data types—from structured to semi-structured and unstructured, in huge quantities, and at high velocity.
The added power of Apache Hadoop allows you to enhance your enterprise data warehouse in several ways. Below, we'll discuss three options for using Hadoop when designing a data warehouse.
1. ETL Offloading
ETL offloading is a method in which an organization preserves its traditional data warehouse for analytical purposes, while offloading the ETL process (extracting the data from various sources, transforming it, and loading it into the data warehouse) to Apache Hadoop.
ETL offloading helps the data warehouse environment better deal with big data challenges. What's more, Hadoop services such as Integrate.io run in the cloud and process your data in parallel, improving data warehouse performance and efficiency even further.
2. Inexpensive Raw Data Archive
Apache Hadoop can serve as a cheap archive for raw data. Hadoop can store all types of information, and it scales by adding off-the-shelf servers to the cluster.
This means that extra storage space can be added easily and relatively cheaply to your data warehouse with Hadoop. Data can be kept in the Hadoop archive indefinitely and used when necessary.
3. Operational Data Store (ODS)
An operational data store (ODS) is an additional database that lies between the operational systems (those that process day-to-day transactions) and the data warehouse. Raw data from the operational systems is simply copied to the ODS and preserved for a certain period of time, usually between a week and a month. The ETL process then runs between the ODS and the data warehouse.
The ODS architecture is an alternative approach to the traditional data warehouse layout. Using an operational data store has several advantages:
-
Better performance: A full ETL process involves additional tasks beyond copying data, e.g. data cleansing, that may strain the operational systems’ resources. Instead of running a full ETL between the operational systems and the data warehouse, you can improve performance by extracting data to the ODS, and then transforming and loading it to the data warehouse.
-
Easier testing: The data warehouse only contains crunched data which answers predefined questions. Since all the raw data is available on the ODS, you can test ad hoc questions there long before you add them to the data warehouse.
-
Full ETL: Rather than running the ETL process between the ODS and the data warehouse, you could run ETL on the ODS itself, and then copy the processed data to the data warehouse. This is another way to help free up data warehouse resources.
As it turns out, Hadoop makes for an excellent ODS solution—it’s perfect for the copying process and running ad hoc queries as batch processes. In fact, with Hadoop's cheap and scalable storage, you can keep data there indefinitely until further notice, giving you the opportunity to perform even more in-depth analyses.
While using Hadoop as an operational data store is a relatively new development, it’s becoming more and more important. An ODS powered by Hadoop enables users to run different types of queries: traditional batch MapReduce queries, interactive SQL, and high-performance stream processing, just to name a few.
Designing a Data Warehouse: The Problem with Apache Hadoop
While Hadoop has many advantages when designing a data warehouse, it's not a perfect solution. The biggest challenge of Hadoop is that it requires specialized and complex skills that many organizations simply don't have, or don't have in great enough supply.
The good news is that this skills shortage doesn't have to be a barrier when adopting Hadoop for your data warehouse. The need for a user-friendly Hadoop solution is exactly the reason we created Integrate.io, a simplified ETL and data integration platform. Integrate.io runs Hadoop clusters for you in the cloud with just the click of a button—no understanding of how to set up and manage Hadoop required.
In addition, Integrate.io has a straightforward visual drag-and-drop interface for defining the transformations and other tasks that you want to perform on the data when preparing it for your data warehouse. There's no need for any specialized programming knowledge of technologies like MapReduce, Hive, or Pig.
Our suggestion: Don’t be like Mad Men, worrying about the latest, greatest technologies just to keep up with the Joneses. Do, however, make sure you’re set up with a data warehouse that will handle all of your big data needs—especially if that involves Hadoop and Integrate.io.
Want to learn how to use Integrate.io to build clean, simple data integration pipelines to your data warehouse? Get in touch with us today for a free trial of the Integrate.io platform.
Designing in the cloud? Be sure to read our post Designing a Big Data Warehouse on the Cloud
Originally Published: June 17, 2014