We will see a lot of data modeling move upstream away from batch modeling in a cloud data warehouse (Modern Data Stack) to the producer continuously generating domain events instead. Why?
“Source-aligned business events are not modeled or structured like the source application’s transactional database; an anti-pattern is often observed, particularly when events are sourced through Change Data Capture tooling or Data Virtualization on top of the application’s database.”
Zhamak Dehghani describes this very well in the fourth chapter of Data Mesh (great read, recommend it) and it resonates very well with me....
If I had to choose two Key Performance Indicators for a data platform team:
Time to market for new datasets/pipelines Data Downtime (periods of time when data is inaccurate) Why?
Producing side: Scaling for data volumes is solved, now scaling is about adding new datasets/pipelines, especially as feature teams will take on data ownership and micro services will deliver data as a first class deliverable in addition to the service API....
Inmon vs Kimball vs Data Vault? Personally I prefer wide, nested and denormalized tables as data warehouse architecture. Why?
Cloud Data warehouses are designed as distributed systems with columnar storage that is separated from compute. Hence, you can efficiently query specific fields over a huge amount of records but you want to avoid joins as it introduces overhead when shuffling data between compute instances. Also, I prefer to keep my data immutable and if something is wrong I rather replay the data with the new logic than performing mutations on existing data....