Introduction

From databases to data warehouses and, finally, to data lakes, the data landscape is changing rapidly as volumes and sources of data increase. With a growth projection of almost 30%, the data lake market will grow from USD 3.74 billion in 2020 to USD 17.6 billion by 2026.

Also, from the 2022 Data and AI Summit, it is clear that data lake architecture is the future of data management and governance. The trend will likely grow due to the release of Delta 2.0 by Databricks, such that all of the platform's APIs will be open-source.

Plus, Snowflakes announced some game-changing features at its summit, making data lakes the mainstay of the industry. Governance, security, scalability, and seamless analysis of analytical and transactional data, will be the prime factors driving innovation in this domain.

Basic Anatomy of a Data Lake

According to Hay, Geisler, and Quix (2016), the three main functions of data lakes are to ingest raw data from several data sources, store it in a secure repository, and allow users to quickly analyze all data by directly querying the data lake.

A data lake, therefore, consists of three components. Data storage, data lake file format, and data lake table format. All these help with the functions mentioned above and serve as the primary building blocks of a data lake.

The data lake architecture stores data from various sources, such as traditional databases, web servers, and e-mails, through its data storage component.

The data lake file format serves as the data processing unit where data sources are compressed in column-oriented formats to optimize querying and exploration. Lastly, the data lake table format helps with data analytics by aggregating all data sources into a single table.

So updating one data source will update all others as if they were all in a single table.

Typical data storage platforms include AWS S3Google Cloud Storage, and Azure Data Lake. At the same time, Apache Parquet or Avro are some versatile data lake file formats, with Apache HudiApache Iceberg, and Delta Lake being well-known data lake table formats.

A Comprehensive List of 18 Data Lake Features

Data Lake has become a necessity rather than a nice-to-have. But that doesn't mean an organization blindly invests in it. Different circumstances warrant the need for a different feature set. Below is a list of all the features a data lake should ideally have.

Ability to Scale Metadata

Efficient metadata management is crucial for a data lake to maintain data quality so that a broader set of users can easily understand and derive insights from different data sets.

Darmont and Sawadogo (2021) state that data within a data lake has no explicit format, which means it can quickly become a wasted asset without metadata to describe the relevant schemas.

The authors identify three levels of metadata that a data lake system should have. Firstly, it should provide business-level information to enhance understanding of a data set. Secondly, operational metadata should cover information generated during data processing, while technical metadata should clearly describe the schema.

Carry out ACID Transactions

A data lake without support for ACID properties can be a considerable setback for data governance.

Wright et al. (2007) describe ACID as an acronym for Atomicity, Consistency, Isolation, and Durability.

Atomicity ensures that only completed data processes affect data sources. So no row is added if an update fails midway.

Consistency maintains data integrity by imposing constraints like unique identifiers, positive balances in a checking account, etc.

Isolation prevents concurrent operations from interacting, while durability helps maintain the latest data state even after a system failure.

Support for DML Operations

Database Manipulation Language (DML) is a set of commands that let users manipulate data in databases. For example, SQL is a DML that allows users to write commands like SELECT, INSERT, DELETE, UPDATE and MERGE to perform specific operations on data.

Data lakes having support for DML simplifies governance and auditing along with change data capture (CDC) by letting users easily maintain consistency between source and target tables. For example, a user can deploy the UPDATE command to pass on changes detected in a source table to the target table based on specific filters.

Flexibility in Building and Maintaining Schema

One of the advantages of a data lake over a data warehouse is that data lakes provide flexibility with schema evolution. Data warehouses require pre-defined schemas before storing a particular data set, while data lakes do not need such a schema.

Effective data lakes have data storage systems that automatically infer the schemas from structured and unstructured data sources that are in store. Such inference is usually termed schema-on-read instead of schema-on-write, where the latter term applies to the rigid schema structures of a data warehouse.

Tracking row-level Table Changes

Data Lakes like Delta Lake and Snowflake allow users to track and capture changes made to tables at the row level. The feature is part of CDC, where the data lake records any change made to a source table due to an UPDATE, DELETE, or an INSERT event in a separate log.

Such tracking helps in several use cases, like optimizing ETL procedures by only processing the changes, updating BI dashboards with only the new information rather than the entire table, and helping with audits by saving all the changes made in a change log.

Maintaining Audit Log, Rollback & Time travel.

Managing big data is challenging if a data lake lacks a versioning system. It especially becomes cumbersome if there is real-time data ingestion which means new data keeps coming in constantly. If some bad data enters the data stream, cleaning up such a large volume of data will be very difficult.

As such, data lakes must support automatic versioning, allowing time travel by letting users track and roll back to previous versions if needed and simplifying the management of data pipelines to maintain data integrity and quality.

Data (Table) Restoration

It is common for businesses today to perform frequent migrations of large amounts of data from one environment to another to use cost-effective data solutions. But conducting such ad-hoc migrations on data lakes may lead to irreversible setbacks that can cause businesses to lose valuable data assets.

So, data lakes should have built-in restoration capabilities that let users restore the previous state of the relevant tables using secured backups through simple commands.

Automated File Sizing

File sizes can quickly grow when dealing with large file systems such as those found in big data applications. Traditional data lakes based on Hadoop data clusters cannot adjust file sizes based on the volume of data. The result is that the system creates many files, with each file size being relatively small, thereby occupying a lot of unnecessary space.

Efficient Data lakes should automatically adjust file sizes based on the volume of incoming data. Delta Lake, for instance, allows users to specify the file size of the target table or let the system adjust the size itself based on workload and the overall size of the table. Larger tables warrant larger file sizes so that the system creates fewer files.

Managed Cleaning Service

Nargesian et al. (2019) point out the lack of efficient data cleaning mechanisms in most data lake architectures as a glaring weakness that can quickly turn a data lake into a data swamp. Since data lakes ingest data without a pre-defined schema, data discovery can become complex as the volume and types of data increase.

As such, data lake platforms like Snowflake impose certain constraints at the data ingestion stage to ensure that the data coming in is not erroneous or inconsistent, which can later result in inaccurate analysis.

Index Management

Indexing tables can enable data lakes to speed up query execution, using indices rather than traversing the whole data set to deliver results.

Indexing is especially useful when applying filters in SQL queries as it simplifies the search. Metadata management can also play a role as it defines specific attributes of data tables for easy searchability.

However, a data lake like Snowflake does not use indexing since creating an index on vast data sets can be time-consuming. Instead, it computes specific statistics on the columns and rows of the tables and uses those for query execution.

Managed Data Ingestion

Data ingestion feature in data lakes is sometimes not explicitly prioritized because data lakes work on the principle of "store now and analyze later." That is the beauty of storing data in native format.

However, this can quickly become a bottleneck and turn a data lake into a swamp with no use for data analysis. Data Lakes should therefore have some mechanism for providing early visualization of the data to give users an idea of what it contains during the ingestion process.

Support for Bulk Loading

Although not a must-have, bulk loading is beneficial when data needs to be loaded into the data lake occasionally in large volumes. Unlike loading data incrementally, bulk load helps speed up the process and improve performance.

However, higher speed may only sometimes be a good thing to have since bulk loading may ignore the usual constraints responsible for ensuring only clean data enters the lake.

Support for Concurrency

One of the problems with on-premises data architectures was that they could not provide high concurrency, which meant serving several users together was a hassle. Cloud platforms addressed this problem, but high concurrency was still an issue, given the restrictions of data warehouses.

Open-source platforms like Apache Spark, well-known for big-data analysis, cannot support high concurrency. However, data lake solutions like Data Bricks are one of the few that support high concurrency, although they do not score very well on low latency, which is the time required to respond to user requests.

Support for Data Sharing

Data sharing has become the need of the hour with the ever-increasing pace of digitalization. With data being used for several use cases by various teams, seamless data sharing through a data catalog system is necessary for data-driven decision-making and preventing silos between business domains.

It is not just that data lakes should provide ways to share data across platforms seamlessly, but they should do so safely and securely since data security can become an issue due to weak access controls.

Data Partitioning

Abadi (2009) defines data partitioning as distributing data across multiple tables or sites to speed up query processing and simplify data management.

Lakehouse platforms like AWS recommend data partitioning for scalability and security, as partitioning prevents a single data source from occupying a lot of space and separates sensitive data from non-sensitive ones.

Data Security

Khine and Wang (2018) write that since data lakes depend on low-cost, open-source technologies and store both semi-structured and unstructured data, sensitive data may end up in the wrong hands within an organization.

Data lakes should therefore allow for centralized control whose granularity can extend to even control access at the row level to ensure compliance with regulatory standards.

Data Analytics

Ravat and Zhao (2019) define a data lake as a big data analytics solution that ingests data in various formats and serves different users, like data scientists, for use cases like machine learning and business intelligence while ensuring data quality and security.

This definition makes it clear that one of the objectives of a data lake is to help users perform advanced analytics and build artificial intelligence systems that drive business competency.

Data Governance

Effective data governance is crucial for data lakes to store valuable data (Derakhshannia et al., 2019). Indeed, organizations need to build a data lake solution that provides an optimal ground between data access and data control.

A data lake architecture must have processes to maintain data quality and integrity as data sharing becomes the norm across several platforms. The processes become especially useful with cloud data lakes with multiple users accessing different types of data simultaneously.

Does Your Ideal Data Lake Have These Features?

It won't be a surprise if your data lake solution does not have all the above features. A data lake solution can only have some feature sets as it largely depends on the organization's needs that determine which ones are essential.

However, integrate.io provides several tools that cover most of the mentioned features and help automate ETL processes for maintaining effective data pipelines.

Also, integrate.io uses the Parquet data file format that significantly enhances your data lake performance by speeding up data science and other analytical efforts.

So talk to one of our experts to gain maximum value from your data lake initiative.