An End-to-End Guide to Data Mesh

Data Platforms:

Paweł Mitruś

Have you heard about Data Mesh? It’s the new catchphrase that many are struggling to understand in the field of data platform architecture. Our first impression was, that it’s so revolutionary, that only greenfield projects (or entire organizations) can adapt it.

However, we have to remember, that guidance so far isn’t clear and straightforward as for instance in Kimball’s Data Warehouse design. This is why there is large space for interpretation in following the direction set by Data Mesh, and also in using different methods of implementation in line with data architecture principles.

In this article, we will demonstrate the practical approach that we learnt through several Data Mesh implementations. The aim here is to help understand if this is the right enterprise data architecture to adopt. Also, because this enterprise data architecture framework is quite new, we’ll help you understand what it is all about.

Another Data Processing Platform

Let’s start with the basics — what is data processing? The goal is fairly obvious: to collect data and translate it into usable information. The tricky part is selecting and taking the best approach to achieving this goal.

That’s where we come in. Lingaro has helped customers worldwide and successfully delivered numerous big data platforms – ranging from traditional batch processed Data Warehouses to Data Lakes with Lambda Architecture. Let’s face it, the offering of major Cloud providers enables you to create any data warehouse architecture you’ll find suitable for processing data according to any given business case. But the key to success is always the same – finding the right architectural drivers that really matter for choosing the right services and planning implementation of your data management platform.

Fortunately, the majority of companies are aware of this and gather the most relevant information up front, including:

What are the types and the number of data sources
What is the entire data volume, and also how large the increments are
How often should new increments be available for use, how fast are they to be processed
How many different users will consume the data, what are their specific needs, how many additional users are anticipated to start using the platform over the next years
Platform needs, for example: handling 10 new sources and N-time more intense loads
Any relevant data lake architecture considerations

We’ll probably end up with something resembling the data architecture diagram below. (Kindly forgive the simplification; we cannot go into more detail in this overview).

The most important takeaways from this data architecture example are:

The Data Platform is monolithic – not necessarily physically, but from the development perspective it fits a single team.
The team that owns the Data Platform needs to have broad experience in Software Development, Data Engineering, and most importantly in business knowledge and experience with the data sources used. The team’s capacity and skills are often a bottleneck.
Consumers have to work with the Data Platform team (literally with the team) on the requirements regarding the data, the Data Platform team performs ETLs to finally publish the data to Consumers.
Ownership isn’t clear, data flows from Sources to Consumers and is transformed by different teams.

The good news is, that such an approach can perfectly handle Data Analytics Platforms for 95% of Companies, both in terms of costs and performance. But there is still that 5% …

An Outstanding Data Processing Platform

Sometimes the requirements for the platform involve more abstract matters:

Potentially scaling up the infrastructure beyond the capacity of currently available components, even in the Cloud. Multi-tenant data architecture may be suitable for new solutions, but this may need to be changed depending on future requirements.
At the moment of gathering requirements there are clear and specific business cases, but there will definitely be additional cases to be onboarded in the future.
Multiple data processing and publishing teams — internal and from 3rd party analytics companies operating according to a data processing agreement — will work with the platform to integrate data, using diverse technology, some more, some less advanced in the technology. There is also an expectation that they work independently.

These requirements are vital before evaluating the traditional points of number and type of sources, users, and other elements from the first example. Moreover, companies typically know upfront that these additional 3 factors are their real challenge. For these cases it is best to consider the Data Mesh implementation approach while sticking to data architecture best practices.

It is worth noting that there isn’t much here about the business or domain knowledge of the platform team. That is absolutely correct, as it will come with the different teams integrating themselves with that platform. The knowledge is simply too broad to be handled by a single team, even if it consists of world-class experts.

Data Mesh in a Nutshell

Before we move on to further considerations, let’s understand the aim of the given approach:

The Data Platform design has to solve all the 3 requirements from the above paragraph (it’s not the type of talk where one can say “sure this platform will handle more sources”).
Different business domains are the Mesh components, and their ultimate goal is to produce and share data with the organization. For instance, any ETL they perform is considered their internal implementation and is owned by the domain, not by the Platform Team.
The role of the Data Platform is different, as it is a Domain Team who owns the data transformation and publishing. The Domain team is to:
- Build ETL processes and overcome any technical aspect on their way to data publication. They are experts in knowing their data, they know how to design logical transformation. They just need support from the data engineering perspective.
- Make the refined data products accessible through different standard protocols (stream or batch), so it can be consumed throughout the organization.
- Also, make the products browsable and discoverable, so any potential data consumer knows what he is dealing with.

As the ownership of the data is with the Domain, including transformation and publishing, it can be considered as a data plane. The Data Platform has the role of orchestrator, integrator, and enabler – so it should be considered a control plane.

Data Mesh Architecture

Below are the key characteristics of the Data Mesh architecture:

Data Mesh is decentralized, so each domain has its own resources and implementation, they don’t influence each other.
The Data Platform team doesn’t need to have broad experience regarding the domains, but should be skilled in Software Development and Data Engineering (act also as support for the domain in terms of technical knowledge).
Consumers are very close to the sources (ideally, these are the same teams; but this setup can’t be applied in many cases), they don’t need to get a help from to the Platform team for the implementation or for the metadata integration. It aims to be a self-service, so doesn’t become the bottleneck.
Ownership is clear, Domain Teams are responsible to provide reliable data from the sources to consumers, with Platform Teams supporting the integration.

Data Mesh in Simple Terms

Actually, the idea of Data Mesh isn’t much different from a lot of current “Software as a Service” applications. It offers a service, like let’s say an online shop (ie. Shopify). It helps to gather customers, offer the products, sell and ship them, or cover the financial aspects.

It’s done by the experts in all these domains, so other experts can focus on their products – improving them, adjust the offer to their customers’ needs. But they don’t want to bother, why the image doesn’t resize well on different web browsers or even write the application from scratch. You can integrate with these kind of platforms, or even built their extensions. If this approach works perfectly on the large scale, it can also be adopted into your organization.

On the other hand, you have to remember that it takes tremendous amounts of time to create a platform like this. It’s way easier to “hardcode” most of the enterprise data center architecture features, before we make them customizable and accessible by non-developers. It’s like with deciding on the MVP architecture – should it be monolithic or microservices?

For most cases, it’s easier, faster, and cheaper to deliver a monolithic application, especially for the very first release. And I’ll say it again – for most cases that would be enough. So, let’s try not to overcomplicate our architecture, in the name of following the most recent trends. But for those who are able to recognize the key requirements as described above, there is no other way around.

Meet Our Experts

Paweł Mitruś Cloud Solution Architect

I have completed my studies at the Warsaw University of Technology, Faculty of Mathematics and Information, and gained my MS degree in Computer Science. I have been working with data processing & modelling for about 8 years. What I value most at work is architecture clarity, applying best practices, and efficient communication. In my free time, I like to develop my soft social skills. I believe they are the key factor in achieving any goal. I am also devoted to running in triathlons, I specialize in the 70.3 ironman distance.