Bundling vs Unbundling Airflow
There’s been a lot of discussion about Airflow’s role in the Modern Data Stack lately and how different tools and services are unbundling the responsibilities of Airflow. It is a very interesting discussion, but to be fair to Airflow it was built as a workflow manager but its flexibility has invited users to add additional responsibilities resulting in a anti-pattern to fill the gap of a missing control plane across data tools/services. I really encourage you to read these posts. This post isn’t about that discussion but rather if we can replace Airflow (Composer) with a more lightweight option.
Don’t get me wrong, Composer is a powerful beast, but it isn’t trivial to operate and definitely not cheap. Even as a managed Airflow service it will leave you scratch your head trying to figure out Kubernetes related alerts and errors thrown for what seems to be a random but recurring pattern. It is also easy to slip into the anti-pattern of not only orchestrating tasks but also actually running them on Composer or start using pre-built operators that later are abandoned and may raise conflicts when upgrading Airflow version. But there is a GCP option nowadays that may be better suited to meet your lightweight orchestration needs - GCP Workflows.
At my employer most of our data integration is streaming and not batch. But we have some data integration that is batch and we’ve been using Composer extensively last few years for data warehouse related tasks such as Transformations, Validations and Data feeds/exports. With the introduction of DBT (scheduled and orchestrated by GCP Cloud Scheduler and GCP Workflows) the Transformation and Validation tasks are now the responsibility of DBT. What is left is running Composer on a weekly/daily schedule for less than 10 simple DAGs to generate data feeds and that seems quite an overkill.
Hence I wanted to give GCP Workflows a try and see how it compares with Composer. Here are some reflections so far:
Separation of concerns
GCP Workflows really forces you to use it ONLY for orchestrating tasks, not for actually running them. Even the scheduling is the responsibility of another service (cloud scheduler).
Isolation of jobs
Each job in GCP Workflows runs independently of other jobs, hence you avoid the risk of noisy neighbors starving your cluster resources or conflicting package dependencies required by different jobs.
GCP Workflows is serverless and you pay per execution. The cost of a minimal Composer cluster is approx $1000/month which corresponds to 130M executions (50 executions/s) in GCP Workflows which probably is quite rare. This also let you provide isolated orchestration capabilities in multiple projects without the significant overhead of Composer.
Since GCP Workflows is lightweight, isolated and supports IAM, you can provide end-users greater access to orchestration and the flexibility of a “distributed” orchestration rather than a more centralized architecture.
I also like the fact that it is so easy to set up using IaC such as Pulumi. There is a lot of talk about no/low code in the Modern Data Stack, but I’ve experienced the pain and appreciate the APIs and defining my workflows using code, version control and CI/CD.
Re-runs and backfills
This is perhaps where composer has an edge and it is also put forward as one of the arguments for using Composer over GCP Workflows as the latter doesn’t have those features built in. However, that is quite easy to remedy yourself which I also show in a separate blog post.
After trying out both options I am currently in the GCP Workflows corner, especially if your workflows are relatively simple (few tasks in a job), triggered by events or you need a more de-centralized orchestration. If you haven’t tried GCP Workflows yet, give it shot.