Discovering Data Engineering

A collection of thoughts and learnings gathered over the years as a data engineer

Modern data workflow orchestration, part 2

The previous post gave some background on why you should try out GCP Workflows. This post is more technical and shows how to use Infrastructure as Code to easily set up GCP Workflows that also supports re-runs and backfills. Solution Architecture The requirements we have are: We want re-run capabilities (idempotency) and backfill capabilities. We will use a custom cloud run service built with FastAPI. We want to set up workflow orchestration and scheduling of batch jobs with Infrastructure as Code (IaC)....

February 25, 2022 · 3 min · Robert Sahlin

Modern data workflow orchestration, part 1

Do we really need full blown orchestration services like Composer (managed Airflow) in a modern GCP data stack? Bundling vs Unbundling Airflow There’s been a lot of discussion about Airflow’s role in the Modern Data Stack lately and how different tools and services are unbundling the responsibilities of Airflow. It is a very interesting discussion, but to be fair to Airflow it was built as a workflow manager but its flexibility has invited users to add additional responsibilities resulting in a anti-pattern to fill the gap of a missing control plane across data tools/services....

February 25, 2022 · 4 min · Robert Sahlin

Serverless dbt on Google Cloud Platform

Data Build Tool (dbt) is an awesome open source tool founded by dbt labs that also offers a managed service to work with dbt. But you can also host dbt yourself and do that all serverless with a collaborative setup following GitOps practices, best of all - it is easy to set up and very cost efficient. Managed service or self-hosted? Before jumping onto the setup you should ask yourself what option suit you the best....

November 25, 2021 · 9 min · Robert Sahlin

Move data modeling upstream

We will see a lot of data modeling move upstream away from batch modeling in a cloud data warehouse (Modern Data Stack) to the producer continuously generating domain events instead. Why? “Source-aligned business events are not modeled or structured like the source application’s transactional database; an anti-pattern is often observed, particularly when events are sourced through Change Data Capture tooling or Data Virtualization on top of the application’s database.” Zhamak Dehghani describes this very well in the fourth chapter of Data Mesh (great read, recommend it) and it resonates very well with me....

August 13, 2021 · 1 min · Robert Sahlin

Is data mesh only for large organisations?

Is data mesh only for large organisations? Many data mesh authorities argue that is the case. But I disagree, it isn’t primarily about company size at all, in fact a data mesh can be even more suitable in a scale up than an enterprise. Why? IMO the maturity for a data mesh is rather based on: Pace of change in the analytical system. Degree of decentralization of the operational system....

August 13, 2021 · 2 min · Robert Sahlin