Two data platform KPIs

If I had to choose two Key Performance Indicators for a data platform team: Time to market for new datasets/pipelines Data Downtime (periods of time when data is inaccurate) Why? Producing side: Scaling for data volumes is solved, now scaling is about adding new datasets/pipelines, especially as feature teams will take on data ownership and micro services will deliver data as a first class deliverable in addition to the service API....

July 29, 2021 · 1 min · Robert Sahlin

Streaming Analytics affect both tracking and analytics!

Streaming Analytics will change the way you should think about both tracking and analytics in digital analytics. How? Tracking: Since tracking has been focused on decision making, it has mainly captured performance metrics rather than signals that can be used to personalize the user experience. That has to shift with the advent of streaming analytics. Analytics: In batch analytics you let the query run over your data, but in streaming analytics you let your data run over your query....

July 29, 2021 · 2 min · Robert Sahlin

Inmon vs Kimball vs Data Vault vs Wide tables

Inmon vs Kimball vs Data Vault? Personally I prefer wide, nested and denormalized tables as data warehouse architecture. Why? Cloud Data warehouses are designed as distributed systems with columnar storage that is separated from compute. Hence, you can efficiently query specific fields over a huge amount of records but you want to avoid joins as it introduces overhead when shuffling data between compute instances. Also, I prefer to keep my data immutable and if something is wrong I rather replay the data with the new logic than performing mutations on existing data....

July 29, 2021 · 2 min · Robert Sahlin

Validate and monitor your BigQuery data

Data observability has gained huge momentum and data quality is essential for any kind of analytical system no matter it is plain old reporting or advanced machine learning. I’ve seen reports that states that data engineers spend more than 30% of their time manually chasing data quality issues! That is not only cost in term of precious resources’s time but also missed opportunities or even worse - loss in trust of your data and your data team....

February 6, 2021 · 3 min · Robert Sahlin

Unlimited persistent disk in google cloud shell

I use google cloud shell as my primary development environment. By doing that I can easily work from whatever computer I want as long as it has a browser and Internet connectivity. Cloud shell is free and comes with pretty much all the tools you need to develop services on Google Cloud Platform. But it comes with a huge limitation, it only provides 5 GB of persistent disk which won’t last long if you work with software development....

March 5, 2020 · 2 min · Robert Sahlin