With the ever-increasing volume of data being generated from a highly diverse set of data sources, organizations have started to increasingly direct their focus on solutions that can help them with data management more efficiently and effectively. Indeed, in the current decade, having a robust data infrastructure is key to an organization’s success, and timely data-driven decision-making is what every management is striving for today.
For many years, data warehouses and data lakes dominated the scene. A data warehouse essentially being a central repository of structured data while data lakes are an all-purpose data storage solution where both structured and unstructured data could be stored. However, this has now started to pose problems in terms of scalability and data quality.
And this is exactly where the concept of a data mesh comes in. The data mesh approach was coined by Zhamak Dehghani - director of emerging technologies at Thoughtworks - as an alternative solution to the monolithic infrastructures that most organizations currently use.
Table of Contents
What is a Data Mesh?
Understanding Data Mesh Intuitively
Major Components of Data Mesh Framework
Which Technologies are Required for Building a Robust Data Mesh Architecture?
How Data Mesh Differs From a Centralized Data Repository
Designing Your Own Data Mesh
Concluding Remarks
What is a Data Mesh?
Data mesh or a data mesh architecture as defined by Zhamak Dehghani is an approach to data management that makes it easy to share, access, and manage big data in an ever-expanding environment.
Usually, a data ecosystem is organized in the form of a stand-alone data warehouse or a data lake which is centrally managed by a data engineering team.
In contrast, a data mesh helps by simplifying the flow from data production to data consumption. The idea is to have each domain team, create, process, manage and publish relevant data via a self-serve platform that can be used across business domains for various purposes.
The aim is to remove the common bottlenecks that organizations face with their data pipelines and data lifecycle.
A Brief Look At the Evolution of Data Architecture
During the late 1990s, the data ecosystem consisted entirely of a data warehouse where the need was to store structured data in relational databases.
With time, as data sources began to expand, organizations moved towards data lakes to store unstructured data in real time.
Moving forward, a hybrid system evolved with both data lakes and data warehouses combined into a data lakehouse where ETL teams would process the unstructured data in a data lake and load it into the data warehouse.
Later, with the advent of the cloud, storage capacities expanded, allowing organizations to ingest as much data as possible.
However, all of the above ecosystems were centrally managed and this created problems with data integrity. Indeed, the autocratic structure was not viable since the data management team had to deal with a number of stakeholders. Also, delivering domain-oriented data was problematic since the central team lacked domain-level expertise.
Data ownership was yet another issue as it was difficult to track down the data producer to resolve data-related issues.
And today, we, therefore, have the data mesh approach which is a move toward data democratization. Such decentralization improves upon the obstacles that are a result of a single monolithic architecture.
It also solves the problem of teams working in silos which occurs when each team has to simply reach out to a central data engineering team without collaborating with others. Such a disconnect hurts agility and prevents scalability.
Understanding Data Mesh Intuitively
A distributed data mesh is basically a form of microservices architecture that has long been used by software developers to speed up delivery and integration of features through apis.
Likewise, with a data mesh, each domain team is responsible for managing its own data. Since each domain team can have its own unique use cases, the data mesh helps by ensuring that each team has access to the relevant analytical data to perform tailored analysis.
One way of understanding this is to imagine a large restaurant with various chefs specializing in certain cuisines where each chef requires a certain set of ingredients.
Now, instead of having one person responsible for buying all of the ingredients, each chef can buy their own ingredients and share them with others if needed. With each chef responsible for his/her own ingredients, the process becomes highly scalable and adaptable.
Imagine a rush hour in which different customers are demanding different types of dishes. If each chef were required to go to the ingredients guy, it would undoubtedly result in a huge mess.
Quite similarly, a data mesh makes each domain team responsible for its own data, and doing so ensures that data quality is maintained.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Major Components of Data Mesh Framework
The four principles of an effective data mesh are data ownership by the domain team, data products, self-serve data infrastructure, and federated data governance. But before we can go into the details, a word is warranted on the fundamental objectives of a data mesh.
A data mesh needs to achieve high interoperability. Like the analogy of the restaurant, interoperability simply means that one team’s data products or ingredients can be used by others without any hassle.
Secondly, the discoverability of data products should be easy enough for any team to find the relevant data in time.
Also, security is key. With democratization, it becomes quite difficult to ensure that data is secure. So strict guidelines need to be in place to prevent security breaches.
The goal of a central data repository was to ensure a high quantity of data. But with a data mesh, high quality is what takes the center stage.
Data as a product
The data-as-a-product principle is more of a mindset that comes with a distributed data mesh rather than a technicality. It is this product thinking that helps with interoperability across different domains.
In a data mesh, each domain team creates its own data product that can be used by others. For example, the sales team can create a nice and clean sales data set that can be later used by a data scientist for some machine learning model.
Federated Governance
Since a data mesh is a decentralized system, it is crucial that certain standards are followed to ensure consistency across data products and prevent data duplication. So a data mesh implementation usually involves a central team that outlines certain practices that need to be followed by all domain teams when publishing their data products.
For instance, this can be the file formats and naming conventions that need to be used when creating a data set.
Self-service Infrastructure
The self-serve practice of a data mesh ensures that each team is prevented from having to constantly go to a central data platform to fetch raw data. For instance, the finance team might simply use some financial data set that a data scientist might have already created.
The data set should, however, have clear metadata details that can easily tell the finance team what each column in the data set means along with other information such as the date of creation, usage, etc.
The self-serve system makes it easy for various data consumers to access relevant analytical data and get the most value out of the organization’s data ecosystem.
Business Domain Ownership
With a domain-driven approach, a culture of ownership prevails. Indeed, one of the bottlenecks of a central data lake was exactly that it was difficult to identify domain owners of a certain dataset.
With a data mesh, the data products created by the domain teams would make it easy for everyone to identify the domain owners, and communication amongst different teams would be incredibly streamlined, as the product owner could be directly approached without any friction.
Each team would bear the responsibility of ensuring the operability of its own data pipeline which would enhance the quality of the data assets of the entire organization.
Which Technologies are Required for Building a Robust Data Mesh Architecture?
Below is a list of technologies that are acting as a catalyst to the adoption of a data mesh.
DataOps Culture
DataOps is an emerging practice borrowed from the domain of DevOps in software engineering. DataOps mostly involves automation of the data pipelines so as to speed up the data delivery workflow.
So instead of having a large ETL team manage data processes manually, with DataOps, each domain team can deploy various tools that help automate ETL jobs and speed up the integration process so as to ensure timely delivery and high-quality data.
Cloud Platforms
With the ever-increasing presence of cloud platforms, organizations today need not worry about having physical on-premises servers to store their data.
Rather, services like AWS or Google Cloud can help teams migrate data onto the cloud and transfer the responsibility of maintenance and integration to the cloud solutions vendor.
Data Catalog Tools
Data cataloging is one of the most crucial elements of a data mesh implementation. A data catalog is similar to a catalog that you may find in a library to get information on a certain book.
The data cataloging feature now comes in-built with data management platforms such as the Data Catalog of Google BigQuery, or Lake Formation in AWS.
A data catalog can help different teams understand domain data. For instance, the sales team can create a data set for customers and also give information about the columns, schemas, the date of creation, etc. Such metadata can help, say, a data scientist to extract more insights when building a prediction model for example.
Data Marketplace
Organizations are relying more and more on external general-purpose data sets that can be integrated with internal data sources to give a deeper understanding of a certain problem. A data marketplace facilitates this since it is itself, an online store where different datasets can be purchased
Data Virtualization
With a data virtualization platform, an organization can bring all its data sources in one place and the different teams can connect to the relevant sources directly for different types of analysis.
With data virtualization, instead of having to replicate a data source from scratch, one can simply create a view and perform the required analysis as needed.
How Data Mesh Differs From a Centralized Data Repository
At this point, it should be somewhat clear as to what data mesh is and how it differs from a central data repository.
Basically, a data mesh is a move towards a more democratized system where each domain team can manage its own data, whereas, a centralized repository is managed by a single data engineering team which handles all the access and delivery issues across the entire organization.
Limitations of Centralized Data Repository
As mentioned in the introductory section, a number of problems arise with a centralized system. Firstly, since a central data engineering team is managing access, it can lead to long delays if more and more access requests start coming from the domain teams.
Secondly, in a centralized system, pre-processing a data source is the responsibility of the engineering team. However, this requires a lot of domain-specific knowledge which a single team may not have.
Lastly, no one really knows who the actual data producer is in a centralized system. It is just one single team preparing and delivering a data set.
However, this does not mean that a data mesh is the perfect solution.
Limitations of Data Mesh
A data mesh is as successful as the dedication of each domain team. Since a data mesh transfers the onus of data management to those who understand their data well, a carefree attitude of these so-called experts can put the data mesh system in jeopardy.
A data mesh is not just a tool. Rather, it is an approach that involves certain best practices. If they are not followed properly, the mesh can fail to meet its expectations.
Designing Your Own Data Mesh
Implementing a data mesh can be a long process. To begin, an organization can start by proliferating the idea of data as a product with some forward-thinking members in the company. These members can form a team to identify the data requirements of each domain team.
One of the domain teams can be selected to work on a data product. At the same time, the existing infrastructure can be modified or a new one built to support this.
Once this is built, the success can be shared with other teams as well. Gradually, as the practice expands, a federated governance system can be developed to ensure that the data mesh is self-governing and sustainable.
Concluding Remarks
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
A data mesh is indeed the future of data management. Organizations, however, need to rethink the architectures they have in place and then consider whether a change is really needed.
A data mesh isn’t for all. If your organization is not that big or if you are not facing any problems with the central data team, then perhaps you are better off with a centralized system. However, if you are experiencing rapid expansions, and having to deal with a lot of different data sources coming in, then a data mesh might be the way to go.