BigQuery: How to write a query for an AB test and do analysis in Jupyter Notebook?


This is Jing, a data analyst with great passion for data science and big data technology. Have you ever heard about Online Conversion Rate Optimisation (CRO)? Running an AB test is a critical part in CRO, testing the hypothesis of improving users’ online experience. This blog is gonna to show you how to write a Query for an AB test in BigQuery and do a basic analysis on the test result in the Jupyter Notebook.


Before we start, I assume you already have Jupyter Notebook installed on you computer and have a project on the google cloud platform with data saved in BigQuery. And the experiment id is properly set in Google Analytics for the AB test. 

Besides, I also assume you are familiar with Google Analytics, for example, you know what is client ID, session, hits, customer dimensions and etc.

Use Case

An AB test aiming for improving online performance is implemented on an e-commence website. The goal of this AB test is to see whether the variant part performs better than the control part  at the user level in terms of add to cart (ATC) rate, conversion rate (CR) or transaction revenue. Since the User-ID has not been enabled in the Analytics Setup, the Client ID is roughly used as the identification for each user in following analysis.

Standard SQL Example in BigQuery 

Here is an example of querying AB test data from the BigQuery for this AB test analysis. This Query is written in Jupyter Notebook which is connected with BigQuery. As for how to connect them, please refer my another blog: Connect your Jupyter Notebook to the BigQueryIf you are running the query directly in the BigQuery User Interface, please replace the query parameters e.g. @START_DATE with your owns (e.g. 20180627) 

AB test BigQuery

BigQuery Export Schema for Analytics Data

Explanation about some columns in the BigQuery Dataset:

  • fullVisitorId is the unique identification for an online visitor which is also known as client ID. 
  • visitId is the identification of a session, which is not globally uniqueonly unique to a user. 
    • Combination of fullVisitorId and visitId can identify a unique session
  •  hits is a nested field which is recording all the hit level information during the session.
  • hits.experiment.experimentId is the ID of the experiment. In this case, it has the ID info of an AB test. 
  • hits.experiment.experimentVariant is the variation or combination of variations present in a hit for an experiment. In this case, it indicates the control part and the variants for an AB test. 
  • hits.eCommerceAction.action_type stands for the action type of the hit, e.g. Add product(s) to cart = ‘3’
  • totals.transactions stands for the total number of e-commerce transactions within the session.
  • totals.totalTransactionRevenue stands for the total transaction revenue, expressed as the value passed to Analytics multiplied by 10^6 
  • More information at BigQuery Export schema

How the example Query work?

The example Query has two parts:

Part 1 : Query Data at a session level

For each session, we want to know 

  • which client this session belongs to
    • fullvisitorId
  • which AB test variant this session is in
    • hits.experiment.experimentVariant
  • which device this session comes from
    • device.deviceCategory
  • whether this session has add to cart action
    • As long as there is one hit has action_type = ‘3’, the add_to_cart will be 1 for this session.
      • MAX (IF (h.eCommerceAction.action_type = ‘3’, 1, 0)) AS add_to_cart
  • whether this session has a transaction
    • As long as the totals.transactions is larger than 0, the make_a_trans will be 1 for this session.
      • IF (MAX(totals.transactions)>0, 1, 0) AS make_a_trans
  • how much transaction revenue made in this session 
    • The total transaction revenue should be the maximum totals.transactionRevenue
      • MAX(totals.transactionRevenue/1000000) AS trans_rev

Part 2 : Aggregate the data to the user (client) level

Make a Query to the session level dataset we just got, then aggregate the data to the user level. For each client, we want to know

  • how many sessions each a client has
  • how many sessions with Add to Cart action  a client has
  • how many sessions with a transaction a client has
  • how much transaction revenue a client creates 

Example Result in a Pandas DataFrame format

BigQuery result in a Pandas DataFrame

Analyse the AB test result in Jupyter Notebook

We can get a KPI overview for different devices as a beginning :

AB test KPI overview

Then, continue on a deeper analysis on Rev./User to see whether the test 1 is significantly worse than the test 0. 

  • Eliminating the outliers (users with huge number of transaction revenue will be removed)
  • Run a Mannwhitney test to see whether the difference for transaction revenue per user in this AB test is significant. 

For an example, the Rev./User distribution for all the devices :

Mannwhitney Test for Transaction Revenue per User

You can do similar analysis on the every order value to see whether the test make a significant differences on encouraging users to spend more money. There are other analysis can be done. You can explore more by yourself. 

Find the whole scrip for running the query and analysing the result below:

Congratulations! We are done here!

Now you should have successfully queried your AB test data from BigQuery and done some analysis on your AB test. 

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.