BigQuery: How to write a query for an AB test and do analysis in Jupyter Notebook?

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. Have you ever heard about Online Conversion Rate Optimisation (CRO)? Running an AB test is a critical part in CRO, testing the hypothesis of improving users’ online experience. This blog is gonna to show you how to write a Query for an AB test in BigQuery and do a basic analysis on the test result in the Jupyter Notebook.

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer and have a project on the google cloud platform with data saved in BigQuery. And the experiment id is properly set in Google Analytics for the AB test. 

Besides, I also assume you are familiar with Google Analytics, for example, you know what is client ID, session, hits, customer dimensions and etc.

Use Case

An AB test aiming for improving online performance is implemented on an e-commence website. The goal of this AB test is to see whether the variant part performs better than the control part  at the user level in terms of add to cart (ATC) rate, conversion rate (CR) or transaction revenue. Since the User-ID has not been enabled in the Analytics Setup, the Client ID is roughly used as the identification for each user in following analysis.

Standard SQL Example in BigQuery 

Here is an example of querying AB test data from the BigQuery for this AB test analysis. This Query is written in Jupyter Notebook which is connected with BigQuery. As for how to connect them, please refer my another blog: Connect your Jupyter Notebook to the BigQueryIf you are running the query directly in the BigQuery User Interface, please replace the query parameters e.g. @START_DATE with your owns (e.g. 20180627) 

AB test BigQuery

BigQuery Export Schema for Analytics Data

Explanation about some columns in the BigQuery Dataset:

  • fullVisitorId is the unique identification for an online visitor which is also known as client ID. 
  • visitId is the identification of a session, which is not globally uniqueonly unique to a user. 
    • Combination of fullVisitorId and visitId can identify a unique session
  •  hits is a nested field which is recording all the hit level information during the session.
  • hits.experiment.experimentId is the ID of the experiment. In this case, it has the ID info of an AB test. 
  • hits.experiment.experimentVariant is the variation or combination of variations present in a hit for an experiment. In this case, it indicates the control part and the variants for an AB test. 
  • hits.eCommerceAction.action_type stands for the action type of the hit, e.g. Add product(s) to cart = ‘3’
  • totals.transactions stands for the total number of e-commerce transactions within the session.
  • totals.totalTransactionRevenue stands for the total transaction revenue, expressed as the value passed to Analytics multiplied by 10^6 
  • More information at BigQuery Export schema

How the example Query work?

The example Query has two parts:

Part 1 : Query Data at a session level

For each session, we want to know 

  • which client this session belongs to
    • fullvisitorId
  • which AB test variant this session is in
    • hits.experiment.experimentVariant
  • which device this session comes from
    • device.deviceCategory
  • whether this session has add to cart action
    • As long as there is one hit has action_type = ‘3’, the add_to_cart will be 1 for this session.
      • MAX (IF (h.eCommerceAction.action_type = ‘3’, 1, 0)) AS add_to_cart
  • whether this session has a transaction
    • As long as the totals.transactions is larger than 0, the make_a_trans will be 1 for this session.
      • IF (MAX(totals.transactions)>0, 1, 0) AS make_a_trans
  • how much transaction revenue made in this session 
    • The total transaction revenue should be the maximum totals.transactionRevenue
      • MAX(totals.transactionRevenue/1000000) AS trans_rev

Part 2 : Aggregate the data to the user (client) level

Make a Query to the session level dataset we just got, then aggregate the data to the user level. For each client, we want to know

  • how many sessions each a client has
  • how many sessions with Add to Cart action  a client has
  • how many sessions with a transaction a client has
  • how much transaction revenue a client creates 

Example Result in a Pandas DataFrame format

BigQuery result in a Pandas DataFrame

Analyse the AB test result in Jupyter Notebook

We can get a KPI overview for different devices as a beginning :

AB test KPI overview

Then, continue on a deeper analysis on Rev./User to see whether the test 1 is significantly worse than the test 0. 

  • Eliminating the outliers (users with huge number of transaction revenue will be removed)
  • Run a Mannwhitney test to see whether the difference for transaction revenue per user in this AB test is significant. 

For an example, the Rev./User distribution for all the devices :

Mannwhitney Test for Transaction Revenue per User

You can do similar analysis on the every order value to see whether the test make a significant differences on encouraging users to spend more money. There are other analysis can be done. You can explore more by yourself. 

Find the whole scrip for running the query and analysing the result below:

Congratulations! We are done here!

Now you should have successfully queried your AB test data from BigQuery and done some analysis on your AB test. 

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.

Connect your Jupyter Notebook to Google Spread Sheet

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. As a data analyst, working with Excel or Google Spread Sheet is sometimes inevitable, considering the reality that a lot your colleagues are more comfortable with seeing data or results in Excel or Google Spread Sheet. This blog gonna to show you how to connect your Jupyter Notebook to Google Spread Sheet to make your workflow more automatic.

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer. 

Let’s begin!

Step 0 : Create a project and Enable Google Drive API

Create a project for your application in Google Cloud Console if you do not have one. In this blog, I created a project called “user deletion”, which I used in another blog Get Google Analytics data to your Jupyter Notebook by using Analytics Reporting API

Go to APIs & Service -> Library, find the Google Drive API and enable it.

Step 1 : Create a Service Account for your Jupyter Notebook

Go to APIs & Services -> Credentials -> Create credentials -> Service account key -> New service account, named it as “Google_Sheet” -> Select role as “Service Account User” for this account -> select “JSON”  -> click on “Create”. Download it, rename it as “service_account_GS.json” and put it in a folder you are comfortable with.

Step 2 : Create a Google Sheet and share it with Service Account E-mail

Open the “service_account_GS.json” you’ve just gotten. Find your “client_email”

Share the Google Sheet you just created with this “client_email”

Get the Sheet ID from the URL to the sheet

Step 3: Install the libraries

Open your terminal, type in the following command

pip install gspread oauth2client df2gspread

Step 4: Open your Jupyter Notebook

See the Notebook template here

Now you should see the Google Sheet is updated with the data you uploaded!


Congratulations! We are done here!

Now you should have successfully connect your Jupyter Notebook to a Google Sheet! Find more information about Google Spread Sheet at df2gspread

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.

Get Google Analytics data to your Jupyter Notebook by using Analytics Reporting API

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. This blog gonna to show you how to using Google Analytics Reporting API to get Google Analytics data to your Jupyter Notebook or save to Excel. 

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer and have access to a Google Analytics view of a website. Besides, I also assume you are familiar with Google Analytics, for example, you know what is client ID, session, event label, customer dimensions and etc.

Use Case : Get a list of segmented users’ client ID

To make a better explanation, I made up a use case which may sounds silly. Let’s imagine a scenario where for certain kind of reasons, some users’ phone numbers are saved in the GA Event Labels during their online sessions because of some tracking setups. We want to find out these users’ client ID and then later on we want to delete them to avoid potential GDPR issues. 

In this blog,  I will show you how to find these users and get a list of these users’ Client ID in your Jupyter Notebook. And then later on, I will show you how to delete these users in GA with once click in another blog. 

In this use case, your Jupyter Notebook is an application, sending request to Google Analytics server. So, your Jupyter Notebook is also a client to Google, which also have a Client ID. This might be confusing with client id in Google Analytics. Sorry for this.

Let’s begin!

Step 0 : Create a project and Enable Google Analytics Reporting API

Create a project for your application in Google Cloud Console if you do not have one. In this blog, I created a project called “user deletion”. 

Go to APIs & Service -> Library, find the Google Analytics Reporting API and enable it.

Step 1 : Create the client credential for your Jupyter Notebook for requesting Google Analytics 

Go to APIs & Services -> Credentials -> Create credentials -> OAuth client ID

Select Other, name our client ID as “Jupyter_Notebook”, click on Create -> OK


Then you can find your client credentials on the Credentials page. By clicking on the download button, you can download your client credential as a json file. Rename it as “client_secret.json” and put it in a folder you are comfortable with.

Step 2: Install the client library for your Jupyter Notebook

If you have done this before, skip this. Otherwise, Open you terminal -> type in the command below. More information about  Google API Client Libraries for Python. 

sudo pip install --upgrade google-api-python-client
pip install --upgrade oauth2client

Step 3: Create a segment in Google Analytics to find the target users

Go to User Explorer -> Add Segment -> Conditions, write an Regex to find the users whose event labels have “phoneNumber =”. Then save this Segment as “phoneNumber”

Step 4: Clone the code from jing-jin-mc Github 

https://github.com/jing-jin-mc/GA_reporting_API

In the repository, you have all the code you need. But there are some-places you need to modify to make this code work for you. 

Step 5: Get the segment ID and Custom Dimension ID

Go to Query Explorer, authenticate ourselves and get the segment ID and Custom Dimension ID for the Client ID in Google Analytics. When we type in Client ID, we know Custom Dimension 42 (in our website) is set for Client ID.

Similar way to get the segment id for the segment we just created:

Step 6: Jupyter Notebook template for Getting data from Google Analytics

Now, open the Jupyter Notebook template and modify the code with the information you have gotten. For the first time to run the template, you will need to authorise your notebook in the web browser. Just follow the instruction from Google and go back to your Notebook , you will see it is done.  

Congratulations! We are done here!

Now you should have successfully gotten Google Analytics data for this use case in your Jupyter Notebook! Besides, I will show you how to upload the data frame to a Google Sheet in another blog to avoid copy paste issues.

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.

 

Featured

Connect your Jupyter Notebook to the BigQuery

Hi,

This is Jing, a data analyst with great passion for data science and big data technology. This blog gonna to show you how to connect your Jupyter Notebook to the BigQuery by using a service account.

Screen Shot 2018-11-27 at 12.57.17

Prerequisite

Before we start, I assume you already have Jupyter Notebook installed on you computer and have a project on the google cloud platform with data saved in BigQuery.

In this blog, I created a project called “helloworld”, with a dataset called “example_dataset” in the my BigQuery. I am going to show you how to get the data from BigQuery on the google cloud from my Jupyter Notebook on my local computer, MAC to be more specific.

Let’s begin

Step 1: Enable BigQuery API on the Google Cloud Platform Console

Search for the BigQuery API by using the search function on the Google Cloud Platform console and enable it. You should see something like this after you enabled it:

Screen Shot 2018-11-26 at 16.52.28

Step 2: Get authentication file for your Jupyter Notebook

Option 1: Create your own default application login in terminal if you already have Cloud SDK installed

Open you terminal -> type in the command below.

gcloud auth application-default login

Make sure application_default_credentials.json is created

ls ~/.config/gcloud/

Option 2: A service account key for the BigQuery in a JSON File. Recommended for scripts which will be used as a service later on. e.g. running on a virtual machine

The json file of service account key is used to authenticate your Jupyter Notebook on your local PC which will access the BigQuery later on.

Go to GoogleCloudPlatform -> API&Services -> Credentials

Screen Shot 2018-11-26 at 11.25.49

Get a service account by Create Credentials -> Service Account key

Screen Shot 2018-11-26 at 11.26.48

Select account -> new service account, give a name to your new service account, here we name it as “bq-jupyternotebook”. To limit the permission of this service account, we choose “BigQuery Admin” as the role for this service account. We want full access of BigQuery, but only to BigQuery, not other google products in this project. As for detailed information about service account, please refer to Service Accounts in Google Cloud

Screen Shot 2018-11-26 at 17.54.50

Click on “Create”, then you get the key for this service account in a json file. Put this json file in a folder which you created for your project. In this blog, we put it in a “key” folder under a folder called “helloworld” and also renamed it as “helloworld-key.json”.

Screen Shot 2018-11-26 at 15.59.52

Step 3 : Install the Google BigQuery API Client Libraries for Python on your computer

The google BigQuery api client python libraries includes the functions you need to connect your Jupyter Notebook to the BigQuery.

Open you terminal -> type in the command below. More information about Google BigQuery API client library

pip install google-cloud-bigquery

Step 4 : Set the environment variable GOOGLE_APPLICATION_CREDENTIALS for your python script

Set the environment variable called “GOOGLE_APPLICATION_CREDENTIALS” for pointing your Notebook to your service account key which you just created in Step 2. This variable lets Google knows where to find your authentication credentials. Include the following code in your Jupyer Notebook.

For the option 1 in Step 2

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '~/.config/gcloud/application_default_credentials.json'

For the option 2 in the Step 2

import os
 os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/Users/jinjing/Desktop/helloworld/key/helloworld-key.json'

Be careful with the path to the service account key json file, you need to specify the whole path. Look at the example:

Screen Shot 2018-11-26 at 11.37.31

Step 5 : Open your Jupyter Notebook, let’s get data from the BigQuery

Create a new Jupyter Notebook in your folder for your project, and look at the example code to see how it works. More examples for using Python Client for BigQuery For option 1, you need to specify which project you are querying for, e.g.

bigquery_client = bigquery.Client(project='prodounf-yew-217512')
Screen Shot 2018-11-27 at 10.34.48

Congratulations! We are done here!

Now you should have successfully connected your Jupyter Notebook to the BigQuery and gotten some data! Enjoy your data analysis and have fun!

This is not something amazing, but hopefully will make your work a little bit easier! This is Jing, thanks for reading my blog! Feel free to commend and ask questions! I will get back to you as much as possible.