DataHem

Automatic builds and version control of your BigQuery views

We (MatHem) has finally moved our BigQuery view definitions to GitHub and automized builds so that whenever someone in the data team modify/add a view definition and push/merge that to the master or develop branch it triggers a build of our views in our production/test environment respectively. Hence we get version control and always are in sync between the view definition and the views deployed in BigQuery. Below are two ways to set it up and requires a github repo, cloud build and bigquery....

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 3

In the previous post, I covered how we create or patch BigQuery tables without interrupting the real-time ingestion. This post will focus on how we update the dataflow (Apache Beam) job without interrupting the real-time ingestion. 3 Dataflow Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing....

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 2

In the previous post, I covered the protobuf (schema definition) part of the solution. This post will focus on how we create or patch BigQuery tables without interrupting the real-time ingestion. 2 BigQuery BigQuery is Google’s serverless data warehouse, and it is awesome (and I’ve experience from Hive, Presto, SparkSQL, Redshift, Microsoft PDW, …). It is a scalable data solution that helps companies store and query their data or apply machine learning models....

Schema evolution in streaming Dataflow jobs and BigQuery tables, part 1

In the previous post, I gave an overview of MatHem’s streaming analytics platform DataHem. This post will focus on how we manage schema evolution without sacrificing real-time data or having downtime in our data ingestion. The streaming analytics platform is built entirely on Google Cloud Platform and use services such as Dataflow, BigQuery and PubSub extensively. Another important component are protobuf schemas. 1 Protocol buffers There are many different frameworks for serialization/deserialization of data....

Fast and flexible data pipelines with protobuf schema registry

MatHem is growing quickly and so are the requirements for fast and reliable data pipelines. Since I joined the company a little more than one year ago I’ve been developing an event streaming platform (named DataHem) to meet those requirements. 1 Background Before jumping into the solution architecture, I thought I would give you some background from a business perspective that has influenced the design choices. 1.1 Context MatHem is the biggest online grocery store in Sweden and to briefly give a context this is how the business works:...