In today's digital business world, data is of utmost importance for any company. In order to compete, they must be able to process large amounts of data quickly and effectively. It is crucial to create a data capture and storage architecture that meets the constantly changing internal and external requirements. However, traditional centralized data architectures are increasingly reaching their limits as enterprise processes and data become more complex and global.
The data mesh approach offers an innovative, decentralized data architecture. This gives domains the opportunity to offer high-quality data products on their own responsibility and thus reduce the time until concrete, usable insights are gained from collected data (time to insight) significantly.
In this article, we introduce the Data Mesh approach and demonstrate how it can help industrial companies gain a competitive advantage by making more agile and faster decisions.
Why Conventional Centralized Data Architectures are Becoming Increasingly Ineffective
IT landscapes in industrial companies are mostly grown historically and consist of a multitude of independent applications, in particular also a number of legacy systems. Especially with the latter, they often are proprietary systems that do not provide any or only insufficient interfaces. As a result, enterprise data is distributed across isolated data stores of various technologies and consists of heterogeneous data types (such as relational databases, Excel files, images, etc.). However, in order to make informed data-based decisions, a holistic overview and correlated analyses of the data are necessary.
A solution approach to this problem is the implementation of a centralized data architecture, for example as a data warehouse or a data lake. Here (large amounts of) data from various sources are extracted and merged and stored in a centralized location. This creates a consolidated and reliable data foundation for analytics tasks (single source of truth). Experiences with the implementation and operation of centralized data architectures have shown that the advantages are associated with a variety of challenges:
The data in the source systems are available in various formats and must be transformed into a uniform data model.
The data is not harmonized. For example, redundant or conflicting information may exist or have different semantics depending on the business context. Consolidation, cleaning, and preparation of the data are therefore absolutely necessary.
The required data needs to be moved from the source system to the target system (regularly). This process can be time-consuming and can have a negative impact on the Time to Insight.
In addition to the technical aspects, it is necessary to examine which data can actually be moved and made available for analytics tasks. This may depend on regulatory, data protection, or security-related aspects, for example.
The technical implementation is carried out within the framework of ETL/ELT processes. A central data engineering team implements pipelines that (continuously) extract, transform, and store data from the source systems in the target system. This also brings various challenges with it:
The setup of data pipelines is often complex and prone to errors. This can lead to missing or incorrect data on one hand, and on the other hand, delay the availability of required data if lengthy technical troubleshooting of the pipelines is necessary.
The data is processed and provided by IT employees who usually have little domain knowledge about the data and therefore cannot assess well which information is relevant for future data consumers.
Changes or additions to the data must always be carried out by the central IT/data team. This team becomes a bottleneck and delays the availability of relevant business information.
Efficient Use of Data Through Data Mesh
Data Mesh is a socio-technical approach aimed at making data processing and utilization more efficient in a company by creating a decentralized data architecture. It is based on the idea that the domains that generate and own the data know best how their data should be used and managed. Data Mesh is based on the following four principles:
The domains are the responsible data owners. They create data for their own use and make it available to other units.
Data as a Product
The domains provide data in the form of data products. In addition to the actual data, this includes all components that are necessary for creation and provision, such as data transformations or the interface for use. A data product must comply with the enterprise-wide agreed quality standards. The format of the data is oriented towards the needs of the data consumers.
Self-Service Data Platform
Completed data products are made available via self-service platform. Through a data catalog, consumers can identify relevant data and inform themselves about their characteristics and possible uses through metadata. This way, other departments have a quick and uncomplicated way to integrate existing data into their own analyses or to create a higher-value data product by combining several data products and offering it to other domains again.
The governance structure is decentralized and federated, which distributes the responsibility for data management to the domains, while ensuring consistent standards and policies.
What challenges Data Mesh can solve
Data Mesh solves several problems that can typically arise with traditional centralized data architectures. These include:
Data Mesh enables domains to create and deliver their own data products in smaller, more modular pieces. This allows them to respond more quickly to changing requirements and needs, and to scale their data products more easily.
Data quality and consistency
Domains that create and manage their own data products are better able to ensure that their data is high quality and consistent. They have a better understanding of how their data is generated and used and can ensure that it is in compliance with the requirements and standards of the enterprise.
Flexibility and innovation
Data Mesh enables domains to work faster and more agile without relying on other departments or a central IT team. This allows them to create new data products more quickly and develop more innovative solutions.
Efficient IT teams
By decentralizing data production and management, domains can work more independently from IT. IT teams can focus on technological tasks and utilize their resources more efficiently.
Conditions for successful data management
To make such an architecture work effectively, some conditions for the data must be met:
Creation of a data catalog: Necessary metadata must be available for each dataset so that data can be quickly found.
Each dataset is also assigned a unique address to enable programmatic access.
Verification and assurance that data is valid and up-to-date.
Description of the semantics and syntax of data to create easily usable datasets.
Establishment of guidelines and standards for efficient data integration in different domains.
Ensuring secure access to the data.
How could an example in industry look like?
Data analytics enables companies to make evidence-based decisions, such as identifying customers with a high risk of churn and taking countermeasures. The challenge is that well-founded decisions require a holistic view of the data. For example, a customer will not switch suppliers solely because of occasionally defective parts that need to be exchanged, but in combination with delivery delays due to (predictable) maintenance intervals of production machines, this risk could increase.
However, the required information is usually distributed across many different applications and consequently data sources, owned by different domains. It is often not transparent which data from other areas of the company is even available.The following example outlines such a scenario. The goal of the "Customer Service" department is to use data analysis to identify dissatisfied customers and proactively take countermeasures to maintain customer loyalty (actionable insight).
To get a complete picture of the situation, information from various areas of the company is useful. In the (shortened) example, data from production (Machine domain) and quality assurance (Quality Control domain) are to be used.
In the field of manufacturing, various types of data are generated. In our case, information about production volume and sensor data about machine condition will be used. Since the department knows their data well, they are aware that these raw data are difficult for other departments to understand and use. However, information about necessary maintenance can provide valuable insights into production interruptions. Therefore, a data product "planned maintenance intervals" should be provided, which can be used by data consumers for higher-level analysis.
To do this, the raw data from the source systems are extracted and a data set on (planned) maintenance measures is created in a transformation step. This transformation can be done using conventional processing methods, but the use of modern AI methods (predictive maintenance) is also conceivable.The finished data product is made available throughout the company via standardized interfaces.
Domain Quality Control
Similarly to the situation in the manufacturing area, the quality control department also has different types of information available. In the example, defects in products registered in a relational database and protocols for exchanged parts due to quality deficiencies stored in Excel reports. In the transformation step, these two data sources are correlated based on customer information, and the results are provided as a new dataset called "Product Quality." This data product is also made accessible to other departments through an interface.
Working with the Data Mesh
The availability of high-quality, curated data products in itself is already a value-add for the company. However, the full potential is only realized through the linkage of several data products - a Data Mesh is formed.In our example, the Customer Service department wants to identify customers at risk of churn. Analysts from the department can find the two described data products through a data catalog and get an idea of their usability for their own use case based on the metadata.
The use of the datasets can be done easily through the offered interfaces. It is irrelevant whether this is done through BI tools, source code or in any other way.In addition to utilizing the results for their own application, the Customer Service department can also provide the new dataset as its own data product to other departments in the company.
Added value of the Data Mesh
The outlined data mesh approach offers the following advantages, among others:
The creation and management of data products are carried out by the domains themselves. They can accurately assess which information is valuable and directly consider compliance issues. The quality of the data is ensured for the users of the Customer Service.
Questions regarding the data or change requests can be clarified directly with the responsible domain without the need for detours through central IT.
The raw data is processed at the point of origin. A complex transfer to a central location is no longer necessary. Current data is available more quickly.
By searching the central data catalog, the data sets could be easily found and identified as useful. The data can be used directly without first having to make requirements of the IT department. The time to insight is shortened.
What happens next?
In this article, we introduced the Data Mesh approach and highlighted the benefits that can be achieved compared to centralized data architectures. The example scenario presented illustrates the basic architecture of a Data Mesh and shows how existing datasets can be easily found and effectively combined to create new, creative solutions.
The implementation of a Data Mesh can be done using different methods and technologies. We would be happy to advise you on your path to a modern data architecture and support you in its implementation.