BigQuery: How to write a query for an AB test and do analysis in Jupyter Notebook?

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. Have you ever heard about Online Conversion Rate Optimisation (CRO)? Running an AB test is a critical part in CRO, testing the hypothesis of improving users’ online experience. This blog is gonna to show you how to write a Query for an AB test in BigQuery and do a basic analysis on the test result in the Jupyter Notebook.

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer and have a project on the google cloud platform with data saved in BigQuery. And the experiment id is properly set in Google Analytics for the AB test. 

Besides, I also assume you are familiar with Google Analytics, for example, you know what is client ID, session, hits, customer dimensions and etc.

Use Case

An AB test aiming for improving online performance is implemented on an e-commence website. The goal of this AB test is to see whether the variant part performs better than the control part  at the user level in terms of add to cart (ATC) rate, conversion rate (CR) or transaction revenue. Since the User-ID has not been enabled in the Analytics Setup, the Client ID is roughly used as the identification for each user in following analysis.

Standard SQL Example in BigQuery 

Here is an example of querying AB test data from the BigQuery for this AB test analysis. This Query is written in Jupyter Notebook which is connected with BigQuery. As for how to connect them, please refer my another blog: Connect your Jupyter Notebook to the BigQueryIf you are running the query directly in the BigQuery User Interface, please replace the query parameters e.g. @START_DATE with your owns (e.g. 20180627) 

AB test BigQuery

BigQuery Export Schema for Analytics Data

Explanation about some columns in the BigQuery Dataset:

  • fullVisitorId is the unique identification for an online visitor which is also known as client ID. 
  • visitId is the identification of a session, which is not globally uniqueonly unique to a user. 
    • Combination of fullVisitorId and visitId can identify a unique session
  •  hits is a nested field which is recording all the hit level information during the session.
  • hits.experiment.experimentId is the ID of the experiment. In this case, it has the ID info of an AB test. 
  • hits.experiment.experimentVariant is the variation or combination of variations present in a hit for an experiment. In this case, it indicates the control part and the variants for an AB test. 
  • hits.eCommerceAction.action_type stands for the action type of the hit, e.g. Add product(s) to cart = ‘3’
  • totals.transactions stands for the total number of e-commerce transactions within the session.
  • totals.totalTransactionRevenue stands for the total transaction revenue, expressed as the value passed to Analytics multiplied by 10^6 
  • More information at BigQuery Export schema

How the example Query work?

The example Query has two parts:

Part 1 : Query Data at a session level

For each session, we want to know 

  • which client this session belongs to
    • fullvisitorId
  • which AB test variant this session is in
    • hits.experiment.experimentVariant
  • which device this session comes from
    • device.deviceCategory
  • whether this session has add to cart action
    • As long as there is one hit has action_type = ‘3’, the add_to_cart will be 1 for this session.
      • MAX (IF (h.eCommerceAction.action_type = ‘3’, 1, 0)) AS add_to_cart
  • whether this session has a transaction
    • As long as the totals.transactions is larger than 0, the make_a_trans will be 1 for this session.
      • IF (MAX(totals.transactions)>0, 1, 0) AS make_a_trans
  • how much transaction revenue made in this session 
    • The total transaction revenue should be the maximum totals.transactionRevenue
      • MAX(totals.transactionRevenue/1000000) AS trans_rev

Part 2 : Aggregate the data to the user (client) level

Make a Query to the session level dataset we just got, then aggregate the data to the user level. For each client, we want to know

  • how many sessions each a client has
  • how many sessions with Add to Cart action  a client has
  • how many sessions with a transaction a client has
  • how much transaction revenue a client creates 

Example Result in a Pandas DataFrame format

BigQuery result in a Pandas DataFrame

Analyse the AB test result in Jupyter Notebook

We can get a KPI overview for different devices as a beginning :

AB test KPI overview

Then, continue on a deeper analysis on Rev./User to see whether the test 1 is significantly worse than the test 0. 

  • Eliminating the outliers (users with huge number of transaction revenue will be removed)
  • Run a Mannwhitney test to see whether the difference for transaction revenue per user in this AB test is significant. 

For an example, the Rev./User distribution for all the devices :

Mannwhitney Test for Transaction Revenue per User

You can do similar analysis on the every order value to see whether the test make a significant differences on encouraging users to spend more money. There are other analysis can be done. You can explore more by yourself. 

Find the whole scrip for running the query and analysing the result below:

Congratulations! We are done here!

Now you should have successfully queried your AB test data from BigQuery and done some analysis on your AB test. 

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.

Featured

Connect your Jupyter Notebook to the BigQuery

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. This blog gonna to show you how to connect your Jupyter Notebook to the BigQuery by using a service account.

Screen Shot 2018-11-27 at 12.57.17

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer and have a project on the google cloud platform with data saved in BigQuery.

In this blog, I created a project called “helloworld”, with a dataset called “example_dataset” in the my BigQuery. I am going to show you how to get the data from BigQuery on the google cloud from my Jupyter Notebook on my local computer, MAC to be more specific.

Let’s begin

  • Step 0: Enable BigQuery API on the Google Cloud Platform Console

Search for the BigQuery API by using the search function on the Google Cloud Platform console and enable it. You should see something like this after you enabled it:

Screen Shot 2018-11-26 at 16.52.28
  • Step 1 : Get a service account key for the BigQuery in a Json File

The json file of service account key is used to authenticate your Jupyter Notebook on your local PC which will access the BigQuery later on.

Go to GoogleCloudPlatform -> API&Services -> Credentials

Screen Shot 2018-11-26 at 11.25.49

Get a service account by Create Credentials -> Service Account key

Screen Shot 2018-11-26 at 11.26.48

Select account -> new service account, give a name to your new service account, here we name it as “bq-jupyternotebook”. To limit the permission of this service account, we choose “BigQuery Admin” as the role for this service account. We want full access of BigQuery, but only to BigQuery, not other google products in this project. As for detailed information about service account, please refer to Service Accounts in Google Cloud

Screen Shot 2018-11-26 at 17.54.50

Click on “Create”, then you get the key for this service account in a json file. Put this json file in a folder which you created for your project. In this blog, we put it in a “key” folder under a folder called “helloworld” and also renamed it as “helloworld-key.json”.

Screen Shot 2018-11-26 at 15.59.52
  • Step 2 : Install the Google BigQuery API Client Libraries for Python on your computer

The google BigQuery api client python libraries includes the functions you need to connect your Jupyter Notebook to the BigQuery.

Open you terminal -> type in the command below. More information about Google BigQuery API client library

pip install google-cloud-bigquery
  • Step 3 : Set the environment variable GOOGLE_APPLICATION_CREDENTIALS for your python script

Set the environment variable called “GOOGLE_APPLICATION_CREDENTIALS” for pointing your Notebook to your service account key which you just created in Step 1. This variable lets Google knows where to find your authentication credentials. Run this before you start your Jupyter Notebook every time.

Type in the command below in your terminal:

export GOOGLE_APPLICATION_CREDENTIALS="/Users/jinjing/Desktop/helloworld/key/helloworld-key.json"

Be careful with the path to the service account key json file, you need to specify the whole path. Look at the example:

Screen Shot 2018-11-26 at 11.37.31
  • Step 4 : Open your Jupyter Notebook, let’s get data from the BigQuery

Create a new Jupyter Notebook in your folder for your project, and look at the example code to see how it works. More examples for using Python Client for BigQuery

Screen Shot 2018-11-27 at 10.34.48

Congratulations! We are done here!

Now you should have successfully connected your Jupyter Notebook to the BigQuery and gotten some data! Enjoy your data analysis and have fun!

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.