Schema evolution in streaming Dataflow jobs and BigQuery tables, part 3

In the previous post, I covered how we create or patch BigQuery tables without interrupting the real-time ingestion. This post will focus on how we update the dataflow (Apache Beam) job without interrupting the real-time ingestion. 3 Dataflow Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing....

November 30, 2019 · 4 min · Robert Sahlin

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 2

In the previous post, I covered the protobuf (schema definition) part of the solution. This post will focus on how we create or patch BigQuery tables without interrupting the real-time ingestion. 2 BigQuery BigQuery is Google’s serverless data warehouse, and it is awesome (and I’ve experience from Hive, Presto, SparkSQL, Redshift, Microsoft PDW, …). It is a scalable data solution that helps companies store and query their data or apply machine learning models....

November 13, 2019 · 4 min · Robert Sahlin

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 1

In the previous post, I gave an overview of MatHem’s streaming analytics platform DataHem. This post will focus on how we manage schema evolution without sacrificing real-time data or having downtime in our data ingestion. The streaming analytics platform is built entirely on Google Cloud Platform and use services such as Dataflow, BigQuery and PubSub extensively. Another important component are protobuf schemas. 1 Protocol buffers There are many different frameworks for serialization/deserialization of data....

November 10, 2019 · 5 min · Robert Sahlin

Get all unique Firebase Analytics events in BigQuery

As I mentioned in my earlier post about the drawbacks with the entity-attribute-value data model used in Firebase Analytics and Google Analytics app plus web, it is hard to know what events and associated attributes and data types are logged without proper documentation. Another way to get an overview is to actually query the table. Below you find an example of how to do it. SELECT event_name, ARRAY_AGG(struct(name, value)) as attribute FROM( SELECT event_name, param....

September 3, 2019 · 2 min · Robert Sahlin

Why Google Analytics App + Web BigQuery Export Rocks and Sucks

Google recently released Google Analytics App + Web which essentially is something like Firebase Analytics for web (or Google Analytics version 2 if you want to). This is exciting for many reasons, two of them are: Google is finally moving away from a user-session-pageview based model to one built on users and events It supports BigQuery export also for standard users That is awesome. These two are actually two of the primary reasons why I built datahem....

September 2, 2019 · 2 min · Robert Sahlin