Schema evolution in streaming Dataflow jobs and BigQuery tables, part 3

In the previous post, I covered how we create or patch BigQuery tables without interrupting the real-time ingestion. This post will focus on how we update the dataflow (Apache Beam) job without interrupting the real-time ingestion. 3 Dataflow Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing....

November 30, 2019 · 4 min · Robert Sahlin

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 2

In the previous post, I covered the protobuf (schema definition) part of the solution. This post will focus on how we create or patch BigQuery tables without interrupting the real-time ingestion. 2 BigQuery BigQuery is Google’s serverless data warehouse, and it is awesome (and I’ve experience from Hive, Presto, SparkSQL, Redshift, Microsoft PDW, …). It is a scalable data solution that helps companies store and query their data or apply machine learning models....

November 13, 2019 · 4 min · Robert Sahlin

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 1

In the previous post, I gave an overview of MatHem’s streaming analytics platform DataHem. This post will focus on how we manage schema evolution without sacrificing real-time data or having downtime in our data ingestion. The streaming analytics platform is built entirely on Google Cloud Platform and use services such as Dataflow, BigQuery and PubSub extensively. Another important component are protobuf schemas. 1 Protocol buffers There are many different frameworks for serialization/deserialization of data....

November 10, 2019 · 5 min · Robert Sahlin