Flatten Google Analytics Custom Dimensions with a BigQuery UDF

Are you one of the lucky digital analysts that have a google analytics premium account? Then you know you can export your data to Google BigQuery and analyze it in an adhoc and explorative manner using SQL. One frequent use case for BigQuery is to analyze many custom dimensions at the same time. But there is a challenge in how to do that in BigQuery since it follows a nested/repeated pattern.

Let’s use the public google analytics sample “LondonCycleHelmet” and say you want extract the custom dimensions 1-3 on hit level. I don’t know what the dimensions represent so I’ve made up that they are productCategory, loyaltyClass and existingCustomer. So you want to see the named custom dimensions together with id:s for visitor, session and hit. Since custom dimensions are nested/repeated you need to use one of the following tricks to turn multiple rows into multiple columns:

  1. Max/Case
  2. Custom Javascript UDF
  3. Custom SQL UDF (recommended)

This data transformation of flattening a table is also called “pivot”, but BigQuery doesn’t support that natively, yet. I will go through each of the tricks below.

Max/Case

If the custom dimension is assigned on hit level then you group them on the hit level and in order to do that you need to have a unique hit ID and in this case it is the combination of fullvisitorid, visitid and hitnumber in the Google Analytics dataset.

#Standard-SQL
SELECT 
  fullvisitorid,
  visitid,
  hit.hitnumber,
  max(case when customdimension.index = 1 then customdimension.value end) productCategory,
  max(case when customdimension.index = 2 then customdimension.value end) loyaltyClass,
  max(case when customdimension.index = 3 then customdimension.value end) existingCustomer
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`,
UNNEST(hits) as hit,
UNNEST(hit.customdimensions) as customdimension
GROUP BY fullvisitorid, visitid, hit.hitnumber
LIMIT 100

Custom Javascript UDF

An alternative approach using a User Defined Function (UDF) solving the same use case. Since we call the UDF with the custom dimensions as an array, we don’t have to unnest the custom dimensions in the SQL, and the syntax becomes cleaner.

#Standard-SQL
CREATE TEMPORARY FUNCTION customDimensionByIndex(index INT64, arr ARRAY<STRUCT<index INT64, value STRING>>)
RETURNS STRING
LANGUAGE js AS """
  for (var j = 0; j < arr.length; j++){
    if(arr[j].index == index){
      return arr[j].value;
    }
  }""";

SELECT 
  fullvisitorid,
  visitid,
  hit.hitnumber,
  customDimensionByIndex(1, hit.customDimensions) as productCategory,
  customDimensionByIndex(2, hit.customDimensions) as loyaltyClass,
  customDimensionByIndex(3, hit.customDimensions) as existingCustomer
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`,
UNNEST(hits) as hit
LIMIT 100

The UDF may cause a performance hit, but it is a choice between simplicity in writing SQL and the performance running it. The UDF is generic enough for you to apply it on all levels of custom dimension, i.e. session, hit or product.

Custom SQL UDF

The third option is to use a SQL UDF which should improve performance. Felipe Hoffa (Developer Advocate at Google) was kind enough to port the Javascript UDF to a SQL UDF.

#standardSQL
CREATE TEMP FUNCTION customDimensionByIndex(indx INT64, arr ARRAY<STRUCT<index INT64, value STRING>>) AS (
  (SELECT x.value FROM UNNEST(arr) x WHERE indx=x.index)
);

SELECT 
  fullvisitorid,
  visitid,
  hit.hitnumber,
  customDimensionByIndex(1, hit.customDimensions) as productCategory,
  customDimensionByIndex(2, hit.customDimensions) as loyaltyClass,
  customDimensionByIndex(3, hit.customDimensions) as existingCustomer
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`,
UNNEST(hits) as hit
LIMIT 100

Robert Sahlin

Digital Data Scientist and Engineer, Open source enthusiast.

Stockholm, Sweden https://robertsahlin.com