The Problem With Data Lakes
- A data lake is a repository for storing structured and unstructured data in a data warehouse.
- The problem with data lakes is that they have no schema or data structure, making them nearly unusable.
- Data lakes became a dumping ground for information. As a result, using them for analysis was near impossible.
- Data lakes are complex, suffer from poor data quality, and are fraught with security risks.
- To become truly useful, data lakes must implement a standard architecture. They also need to implement procedures to ensure data integrity, all of which will make them more useful.
By W H Inmon
This is a guest post with exclusive content by Bill Inmon. Bill "is an American computer scientist, recognized by many as the father of the data warehouse. Inmon wrote the first book, held the first conference, wrote the first column in a magazine, and was the first to offer classes in data warehousing.” -Wikipedia.
Big Data started as a replacement for data warehouses. The Big Data vendors are loath to mention this fact today. But if you were around in the early days of Big Data, one of the central topics discussed was — if you have Big Data do you need a data warehouse? From a marketing standpoint, Big Data was sold as a replacement for a data warehouse. With Big Data, you were free from all that messy stuff that data warehouse architects were doing. Much to the surprise of the Big Data vendors, data warehousing support was far stronger than they had ever imagined. There were (and still are) valid reasons why data warehouses existed. Here’s why.
Big Data Needs Structure
If you wanted integrated, believable data, you needed a data warehouse. Big Data had nothing to say about this aspect of data. The Big Data vendors said, "Buy my product and your problems go away.” Big Data vendors got pushback. The feedback was clear. Big Data needed an architectural construct. The solution? The data lake. Integrate.io is a data warehousing integration solution that handles ETL, ELT, Reverse ETL capability, and a very fast CDC platform. Use it to move data from multiple sources to a data lake of your choice with little or no code.
What is a Data Lake?
The data lake was just a big collection of structured and unstructured data thrown onto the Big Data infrastructure. The theory was that you put the data into the data lake and anyone who needed the data could find it and use it. However, it didn’t take long for people to discover that the data lake was really just a glorious data garbage dump. Why? Because there was no structure or predefined schema. Which means:
- Complexity: Data lakes are extremely complex in that only advanced data scientists and engineers understand how to use them. And even then, making sense of the data usually requires natural language processing technology.
- Poor Data Quality: Data lakes are rife with data quality issues because there's no discipline over what goes in or how it goes into the data lake
- Security Risks: There are security risks involved with data lakes. With so much information being stored with no oversight, it's easy for sensitive information to get overlooked.
As a result of these issues, the data sat there, and no one used it. No one could use it. In reality, the data lake was never a real architecture. It was just a buzzword used to counter the technicians that had already built a data warehouse.
The Problem With Data Lakes
The problem with garbage dumps is that they start to smell over time. Furthermore, the Big Data garbage dump was expensive. So the people that put this non-architecture out there were called upon to make their garbage dump useful. Still, they encountered problems.
Difficult for Analysis
The first thing they discovered was that they needed metadata to describe what was in the data lake. Without metadata, no one could find anything. They soon realized that metadata wasn’t enough. Companies need data they can rely upon for analysis. However, metadata only led them to the next step up the ladder. They need to refine the data against a common data model to make sense of it from one analysis to the next. But it still doesn’t quite solve the problem. In fact, the data lake doesn’t solve problems; it merely introduces them.
I don’t know if this path the data lake is walking down sounds familiar. But it's the same path that the people doing analysis a long time ago have already discovered. They need a data warehouse architecture. Exactly 180 degrees, the opposite of what they promised their buyers years ago. (It's hard for vendors to admit they made a mistake.) Integrate.io is a new ETL platform that removes the pain points and complexity of moving data to and from data lakes. As a no-code solution, anyone can integrate data sources quickly and easily. Email hello@integrate.io to learn more.
How to Make a Data Lake Work
When you bring in the data lake, you'll need to impose discipline if you don’t want the data lake turning into a garbage dump. Specifically, discipline of the architecture over the data lake.
Yes, Big Data and data lake enthusiasts — there's such a thing as architecture. Data integrity is necessary, but it doesn't magically happen. It requires a lot of work and forethought. And we in the world of data warehousing get to say to you, “I told you so.” Yes, we do remember who was so condescending to us in years past. We remember who called us an “old” idea and architecture. We remember the derision that was tossed at us. We remember who sold the IT community on the fact that we weren’t needed. We remember being told that we were yesterday’s news and to get lost.
You could have saved a lot of time and energy (and money) by trying to build on the past rather than trying to sweep us away. We remember.
The last laugh is really the best laugh. But Big Data and data lakes didn't need to be the problem child that they are. But the vendor is to blame for the mess that has been made.
How Integrate.io Helps Clean Up The Data Lake Dump
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Integrate.io empowers companies to sync data with major data warehouses and data lakes without the need for code or complicated big data pipelines. This data warehousing integration features deep e-commerce capability enabling companies to leverage information from data warehouses and data lakes in e-commerce. The solution comes with out-of-the-box connectors that sync with major data warehouses and lakes, removing all the challenges associated with data integration. Schedule a demo to learn more about how Integrate.io can help with your data integration needs.
Bill Inmon the father of the data warehouse has authored 65 books and was named by Computerworld as one of the ten most influential people in the history of computing. Bill’s company – Forest Rim technology is a Castle Rock, Colorado company. Bill Inmon and Forest Rim Technology provide a service to companies in helping companies hear the voice of their customer. See more at www.forestrimtech.com