Data Mesh — A Self-service Infrastructure at DPG Media with Snowflake

Published in

Level Up Coding

11 min readMar 15, 2021

By Ludmila Von Thun (Snowflake) and Wannes Rosiers (DPG Media)

— — — — — — — — — — — — — — — — — — — -

Over the last decades, companies have been trying to find a way to drive the most value from their data. They were building data platforms with data warehouses and data lakes. Having all data in a data lake doesn’t mean that the data is treated properly nor put to work in a way that drives the desired value. On the contrary, having data siloed to specific use cases will lead to overlooking many opportunities and will not give a full picture of what influences and drives the company.

Zhamak Dheghani discusses the architectural failures coming with these big data platforms and talks about a new perspective, taken from Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data, which she calls the data mesh. The data mesh is a concept discussed widely at the moment and is said to be the new architectural and organizational shift in data architectures.

However, how do you create a self-serve data platform that enables the different domain teams to create valuable data products?

DPG Media, an early adopter of the data mesh approach, and Snowflake are exploring ways to create a self-service data platform using Snowflake combined with other tools like DBT (data build tool) to leverage Snowflake’s capabilities when setting up a data mesh architecture.

First things first: a short recap on how we got to the Data Mesh

As Max Schulte and Arif Wider describe in their tech talk, the data lake was seen as the successor of the data warehouse. It was capable of handling all the different data volumes, varieties, and velocities that came with the rise of the internet. Still, often companies ended up losing control over the constant flood of data, which led to a Data Swamp. One of the reasons was unclear ownership of the data and resulting products.

To solve the ownership issue, Zhamak Dehghani introduced the data mesh. She writes: “distributed data products oriented around domains and owned by independent cross-functional teams who have embedded data engineers and data product owners, using common data infrastructure as a platform to host, prep and serve their data assets.” In short: the shift towards a data mesh is more an organizational one; domain-centric teams have data ownership, with secure and governed access.

The infrastructure is a central part of enabling a data mesh structure. David Vellante wrote about how Snowflake’s Data Cloud enables users to “build money-making services and products around data” as its global data mesh simplifies data access, breaks down silos, and enables live data sharing on a global scale. But we want to concentrate on building this structure within one company with one deployment to show how you can build the groundwork for a global data mesh.

Let’s look at it a bit more locally.

When reading about the data mesh for the first time, several questions pop up. How to create an environment that doesn’t need multiple copies of data to have multiple domains use it? How can the domains use the given infrastructure without creating resource contention for others? How to get the best combination of governance, security, and operability?

DPG Media is a big early adopter of the data mesh in the Benelux region, combining the approach of Domain-Driven Design (DDD) and Large Scale Scrum (LeSS). Wannes Rosiers describes DPG Media’s journey clearly in a blog post. One of the difficulties seen on that journey was multiple slightly distinct data copies and data governance. Others were mitigating resource contention and keeping performance up.

How did DPG Media start its journey?

Let’s not start with ‘How’ but with ‘Why’. DPG Media centralized all data efforts: a central data engineering team building a central data lake resulting in large data warehouses. Although this centralization increased delivery speed and introduced standardization, there were key issues still to be tackled.

The first one being: data is at the end of the chain. Who has not encountered an update on an operational system leading to a failing reporting system, as the data team was not kept in the loop? Every IT change has the risk of causing an inferno in your data landscape if you forgot to reach out to them. Even if you had reached out on time to adapt changes, the IT project would not have had analytical data usage at the start.

Next to that, you can not expect that the data engineers know every system and business domain. To give some DPG Media figures: it’s unreasonable to expect that 33 data engineers know all 19 platforms (reflecting business domains) being built by 500 IT people to support over 5000 employees and reaching 80% of Flemish and 90% of Dutch people on a daily basis.

There exists a single solution tackling both issues: decentralized data ownership! Hence, the key message of a data mesh. Yet, one does not simply say from one day to another that ownership of analytical data is no longer centralized but decentralized. Moreover, even though the concept of a data mesh mainly has an organizational impact when you already have a data lake, it still demands the introduction of some technological components supporting governance. That’s why DPG Media started with a Proof Of Concept on two business domains: the customer domain and the tracking domain (online user behavior).

DPG Media’s tech organization is divided into areas based on Domain-Driven Design, which means that one area, which should be considered as one team, is responsible to implement solutions for one business domain. It is important to mention that since the tracking team is central to all media platforms the company owns, the tracking team was co-located with the data engineers in what they call the data area. The customer domain however is a separate area, has a separate manager, … Hence a true decentralization of analytical data ownership.

Recently DPG Media has held a retrospective meeting with both the data area and the customer services area on the decentralization of data into a data mesh setup, both organizationally as well as governance-wise. This is what came out:

The added value for a business domain to have their own data engineers at their disposal is immense. A new world of use-cases flourishes.
However, it demands in-depth knowledge of all data in all systems by the domain expert — also legacy systems — which typically was thought of as “the data guy will need to figure it out.” One does not simply switch over responsibility!
When building a single domain of a data mesh, it’s ok to lack certain governance. Still, when having the ambition to scale this throughout the entire organization, this will have a technological impact.

As mentioned before, there is a second domain where DPG Media has focussed on applying the data mesh paradigm, namely tracking. A typical data organization starts right behind the source applications and builds until the consumer-oriented data products. DPG Media has focused on putting the ownership of tracking data within the tracking team. Not only for reporting purposes but even for personalization purposes!

Moving the end-to-end ownership to the tracking implies a major technology shift and responsibility increase for the team. Previously the team worked with Google Tag Manager and a data collector like Snowplow; now they own a Kafka Topic, implement pipelines within a reporting environment, and even build reports for their own usage. DPG Media has made this a success by tutoring the team and placing a specialized data engineer within the team. Next to that, it is important to limit the technical barrier, offering a self-service platform that adheres to the DATSIS principles of a data mesh.

The technical set-up

As mentioned before, DPG Media believes that the technical barrier to play around (and become the owner) of your data should be as low as possible. They also think that software engineering best practices should be brought to data engineering. In brief: please allow SQL, but make sure it is in version control, and it runs through a CI/CD pipeline.

This is where Snowflake steps in. Snowflake’s platform supports many workloads by enabling companies to load all data, structured, semi-structured, and unstructured, and make it available to the users. Snowflake’s architecture offers DPG Media scalability. As visible in the following picture, Snowflake has four layers, which we will quickly walk through: The Cloud Agnostic Layer, the Storage Layer, the Compute Layer, and the Cloud Services Layer.

Snowflakes Architecture

Snowflake separates storage and compute, and compute from compute. This has multiple implications for domains. First, DPG Media gets the storage they need when they need it. The same is true for compute power, including the ability to scale up for performance and increase the compute cluster on the fly without waiting times. The second part is the separation of compute from compute, which allows DPG Media to give every domain its own compute resources while having access to all their data. The most outer layer is the Cloud Services Layer, where the magic is happening, as we like to say. With Snowflake, DPG Media can focus more on enabling their domains in using the platform than on keeping it running.

DPG Media does not believe in a Swiss knife to solve all problems and considers its data team’s core mission to build data pipelines, not to maintain the data platform’s components. We deliberately write components as we do not believe there should be one platform serving all use-cases, yet a loosely coupled landscape of low maintenance tools building up to that data platform. Integrating those different components is a responsibility for the team; we will dive into detail later. Roughly speaking, the landscape enabling the company looks like this:

DPG Media uses Datafy, which offers a combination of Spark and Airflow on top of Kubernetes and has an easy interface towards a scheduler like Airflow. But, the main goal is to provide a low barrier to non-data engineers as well, people not writing Scala or Python code. This is where SQL — Structured Query Language or, in this case, maybe Simple Query Language — steps in, as well as Snowflake.

Beyond the data mesh story, Snowflake already had a lot to offer to DPG Media, namely:

Performance increase
Maintenance decrease
Ability for real-time ingests
SQL interface to semi-structured data (e.g., JSON format of clickstream data)

Next to these, other features are highlighted that add to DPG Media’s data mesh set-up. Amongst them are Time Travel, Zero Copy Cloning, and Data Sharing.

Time Travel allows everyone that now works with the cloud data platform to make mistakes without severe implications, as data can be easily restored and set back without having to take care of backups. It is giving the wanted data mesh organization a healthy learning culture.
Zero Copy Cloning does not physically copy data but creates a logical copy of the database or table. Therefore, a test environment can be spun up with one command and be used freely for testing, development, and other purposes without doubling the data or interfering with the data product.
Data Sharing is the basis for the global data mesh. It enables companies to share live and ready to query data with other business units, partners, and companies without losing the data’s authority. That’s also what powers Snowflakes Data Cloud, but that’s another topic and would go too far for now. An advantage of using Snowflake is that you don’t limit yourself to one platform or region but that you can also scale globally across platforms and regions. You can find more information about the global data mesh in related articles by Eric Poilvet or David Vellante.

Back to DPG Media’s story: within the data mesh set-up, Snowflake brings even more advantages combined with Datafy/Airflow and DBT (Data Build Tool). DPG Media has a set-up where SQL code can be versioned (DBT), deployed (Datafy), scheduled, and monitored (Airflow) with a scalable infrastructure (Snowflake). The infrastructure can be dedicated to teams or departments (Snowflake) and set up row-level access management (Snowflake).

This toolset only lacks one more item from a technology perspective, namely a data catalog or dictionary, making data discoverable and addressable (DA from DATSIS). Currently, DPG Media is using a wrapper around AWS Glue but also looking to Amundsen or Collibra.

The organizational impact

Most departments got enabled by either offering the SQL interface and allowing custom transformations within their preferred reporting tools. This, however, covers mostly teams building what we call consumer-oriented data products. If you look at source-oriented data products such as the first data sets built by IT teams, the SQL interface is no longer sufficient. You require more data engineering knowledge. DPG Media had first been centralizing the data team. But, to purely adhere to the data mesh set-up, you would again need to decentralize it. Therefore, DPG Media is taking a more flexible approach.

You still require a central data team to maintain the data platform. However, as said before, DPG Media does not believe in a pure platform team as the platform is not a core task for the company. DPG Media IT is structured in areas covering certain business domains. Business domains with huge data challenges get dedicated data engineering resources, being placed in the area responsible for the domain. They still need to adhere to the central data governance rules. Simultaneously, less data-heavy domains can lend data engineers from a central pool responsible for building data pipelines for the respective domains adhering to the governance rules and defining the governance rules, and maintaining the toolset supporting them. However, it is important to mention that this pool should not cover all domains; for example, DPG Media focuses on the data engineers, making 4 data engineers responsible for four domains and four others for three other domains. Next to that, they prioritize their work so that they work for a single domain during an entire development sprint.

In this way, DPG Media has introduced a pragmatic approach to a data mesh, still covering a few pain points. As a wrap-up, we mention the issues they have already solved:

Data engineers have a limited amount of business domains to grasp and are placed organizationally as close as possible to the business domain owners.
There is a low-level entrance to use the data platform.
Software engineering best practices are brought to the data landscape.
There is a central team managing and supporting governance rules.

This will (hopefully) not be the last time we talk about DPG Media’s approach to the Data Mesh. We will continue the journey together, trying to find the best way to combine both, the organizational shift supported by a self-serve data-platform.