Buzz about Big Data has been at fever pitch for over a year now. We hear a lot about how the insights we glean will propel businesses, about emerging technologies, and companies merging. But how often do we hear about the guts behind Big Data, what makes it actually work? Maybe I’m wrong, but from what I read, not often enough. So to buck that trend, let’s dive into one of the main building blocks of traditional data warehousing, ETL, and see how it fits in with current Big Data architecture.
For those who are unfamiliar, ETL stands for Extract, Transform and Load. It's taken from data warehousing methodologies that date back to the 1960's. ETL is the process of taking raw data (that's the data emitted by sensors, POS machines, web applications, medical device equipment), etc... Once collected, the data is extracted from where it resides, transformed into a more readable and understandable format, and conformed to business terminology and entities. The transformed and cleaned data is then loaded into a data repository of some sort, usually a relational database.
Related Reading: Understanding the Basics of ETL
This is how the data architecture world worked for pretty much the past 40 years, basically since the advent of the relational database and data warehousing. Raw data transformed by ETL processes and moved to the DWH for consumption.
Every now and then, we hear that there's no need for ETL. The argument is that because we have the ability to store large amounts of data on fairly inexpensive storage platforms (such as Hadoop), plus the ability to store semi-structured and unstructured data in the same place where we store structured data, not to mention we can process and analyze this data all at once, we no longer need to implement any sort of ETL process. We can simply dump all the data we ever dreamed of storing and analyzing on a single platform, and do whatever we want to with this data. No more boundaries of data models (tables, dimensions and fact, keys and relationships between tables), the world is our oyster and we can do as we please. Simply build the best tools ever built to explore data on Hadoop and we're good!
But are we really? To which I respond, no, we're not. We should still cook our food before it’s served, because most of us don’t like it raw. Only the extra special, really smart guys, can make sense out of the mass quantities of data residing on Hadoop clusters "as is". The rest of us commoners don’t want to and/or can't spend tedious iterations of data exploration cycles to get the job done, to do reporting, and to perform analyses. Most of us need to access our favorite BI or reporting tools, connect to a strongly defined data model, and slice and dice the data as we wish. However that data has to be clean, and fields masked for standards like GDPR & CCPA. It has to have the common business terminology that everybody in the organization will understand. Especially you.
Don’t get me wrong. In no way am I suggesting that raw data is inaccessible and should be ignored. Quite the contrary. We should embrace the technologies that enable us to store and process raw data in the Big data world. The data scientist should spend his days going over this raw data, trying to find the hidden truths and enigmatic trends that others have yet to comprehend, the behavioral changes of customers, markets, currency, livestock and storms. But this is a job for a select few, not for the majority of data and BI consumers, developers and analysts. These people still need their ETL. They do not have the skills nor the time to look into the raw data and still get their job done. However, with the proper ETL tools, your everyday Joe Business Intelligence will bring a lot of value to your business.