Analysis

Data Studio Report: US Unemployment Analysis

Data Studio Report

Over the last few months I have been working more and more with Google Data Studio, a tool provided by Google to explore and visualize data. During this period I have created best practices for Ecommerce reports and Publisher reports to share with Google Analytics users. I believe that Data Studio is a great solution for businesses and I will keep working on best practices and sample reports for marketers, advertisers, publishers, etc.

However, Data Studio can (and should!) be used to visualize the world around us too, helping us answer questions such as: How are we affecting our planet? How is society evolving? How people behave across different countries and cultures? Those questions (and many others) are not easy to answer, but data visualization is the best way to understand the world around us. Thankfully, many governments and global organizations are releasing their data publicly, which means we can create stories out of those datasets to start answering the questions above.

In this article I will present a visualization using data from the US Bureau of Labor Statistics to show trends in US Unemployment Rates segmented by Age, Gender and Race / Ethnicity. This data is important in that it provides us with insights on how we are doing as a society to provide equal opportunities to everyone. Needless to say, we have a long way to go!

US Unemployment Rates by Age, Gender and Race / Ethnicity

Below is a screenshot of the visualization, you can access the interactive version on Data Studio. If you are interested on some thoughts from the creation process, keep on reading!

US Unemployment Analysis

The Data

The data was pulled from the US Bureau of Labor Statistics and it contains data for 2014 and 2015, drilling down into Race / Ethnicity, Age and Gender. Note that the page linked updates with new data every now and then, so what you see there is not necessarily what was used in the visualization.

The data was copied into this Google Sheet, and than reformatted into a flat table (see additional tab in the same spreadsheet). The reason for the reformatting is that the Data Studio connector with Google Sheets will work only with flat tables. That’s it for data manipulation, not a big deal in this case, although sometimes it can be quite a hassle to get data into the right format for visualization.

Below is a screenshot of the final Data Source on Data Studio.

Data Source

The Visualizations

Since this is a publicly available report, and I wanted it to be a standalone visualization, I added my thoughts side-by-side with the visualizations, this helps viewers to understand the data. Here are the points highlighted throughout the report:

  1. Unemployment rates are significantly higher for younger (males) and for the Black / African American populations.
  2. Among the top 5 groups with the most significant drop in unemployment rate three are Asian, and the top 3 groups that got the most significant unemployment rate increase between 2014-15 are Hispanic / Latino.
  3. Black/African American has the highest unemployment rate, but it is improving significant faster than Hispanic/Latino and White. Asian has the best rates and it is improvement relatively fast.

One note regarding the insights: you should aim to have between 3 and 5 main insights per page on your Data Studio reports, this will help your viewers focus their attention.

Sharing is Caring

Sharing is very simple. Once you finish the analysis, you only have to click on share (top-right) and choose “Anyone with the link can view” (learn how). But make sure you check the data source usage guidelines to confirm that their data can be shared, this is extremely important!

Hopefully this visualization will bring some light into how we are developing as a society. But I am even more hopeful that it will inspire other people to use their data skills to start visualizing datasets that can help us create a better world.

image 
US Unemployment Analysis
Data Source


Online Behavior – Marketing Measurement & Optimization

Funnel Analysis with Google Analytics Data in BigQuery

Funnel Analysis with Google Analytics Data in BigQuery

Conversion funnels are a basic concept in web analytics, and if you’ve worked with them enough, you may have gotten to a point where you want to perform a deeper analysis than your tools will allow.

“Which steps in my funnel are being skipped? What was going on in this funnel before I defined it? Which user-characteristics determine sets of segments across which progression through my funnel differs?” These questions can be answered using the solution described in this article. In particular, I’m going to talk about how to use BigQuery (BQ) to analyze Google Analytics (GA) page-hit data, though the principles could be applied to any page-hit data stored in a relational database.

The Google Analytics Funnel Visualization report (see below) makes certain abstractions and has certain limitations, and advanced users can benefit through the use of Google BigQuery (BQ) – an infrastructure-as-a-service offering which allows for SQL-like queries over massive datasets.

Funnel Analysis

In this article, we’ll discuss the benefits of using BigQuery for funnel analysis as opposed to the Google Analytics user interface. In order to make the solution clear I will go over the basic structure of an SQL query for funnel analysis and explain how to use Funneler, a simple Windows application to automate query-writing. The source code of Funneler is also provided as a Python 3 script. Please note that in order to use the specific examples provided here you will need a Google Analytics Premium account linked to BigQuery (learn more about the BigQuery Export feature).

Funnel Analysis – Google Analytics UI vs. BigQuery

The solution I propose below works as follows: using a Windows application (or Python script) a BigQuery-dialect SQL query is generated which tracks user-sessions through a set of web properties, and optionally segmenting and/or filtering the sessions based on session characteristics. BigQuery’s output is a table with two columns per funnel stage: one for session-counts, and one for exit-counts.

Below is a list of the most significant differences between GA Funnel Visualization and the solution I will be discussing.

  1. Loopbacks: If a user goes from steps 1 -> 2 -> 1, GA will register two sessions: one which goes to step 1, one which goes to step 2, and an exit from step 2 to step 1. Our query will only count one session in the above instance: a session which goes from step 1 to step 2. Furthermore, since progress through the funnel is measured by the “deepest” page reached, the above scenario will not be distinguished from a session which goes from step 1 -> 2 -> 1.
  2. Backfilling funnel steps: GA will backfill any skipped steps between the entrance and the exit. This solution will only register actual page-hits, so you’ll get real numbers of page-hits.
  3. Historical Information: GA Funnels cannot show historical data on a new funnel, whereas this workflow can be used on any date range during which GA was tracking page-hits on the selected funnel-stage pages.
  4. Advanced Segmentation: GA Funnels don’t support advanced segmentation, whereas with Group By clauses in BigQuery, you can segment the funnel on any column.
  5. Sampling: GA Funnel Visualization shows up to 50,000 unique paths, whereas BQ will contain all the page-hits that GA recorded, and allow you to query them all.

The Query

For Google Analytics data, the basis of a funnel query is a list of URLs or Regular Expressions (regex), each representing a stage in the conversion funnel.

If you have a pre-existing funnel in GA, follow the steps below to find your funnel settings:

  1. Go to Admin in GA
  2. Select the correct Account, Property, and View
  3. Go to Goals
  4. Select a Goal
  5. Click Goal Details

In this screen you will find a regex or URL for each step of the funnel. They may look like this: “/job/apply/”.

The basic process of writing the query, given the list of regexes or URLs, is as follows:

1. Create a base-level subquery for each regex

For each row which has a regex-satisfying value in the URL column, pull out fullVisitorId and visitId (this works as a unique session ID), and the smallest hit-number. The smallest hit-number just serves as a non-null value which will be counted later. The result sets of these subqueries have one row per session.

SELECT
fullVisitorId,
visitId,
MIN(hits.hitNumber) AS firstHit
FROM
TABLE_DATE_RANGE([<id>.ga_sessions_], TIMESTAMP('YYYY-MM-DD'),
TIMESTAMP('YYYY-MM-DD'))
WHERE
REGEXP_MATCH(hits.page.pagePath, '<regex or URL>')
AND totals.visits = 1
GROUP BY
  fullVisitorId,
  visitId

2. Join the first subquery to the second on session ID

Select session ID, hit-number from the first subquery, and hit-number from the second subquery. When we use full outer joins, we’re saying sessions can enter the funnel at any step. To count sessions at each stage that have only hit a previous stage, use a left join.


SELECT
  s0.fullVisitorId,
  s0.visitId,
  s0.firstHit,
  s1.firstHit
FROM (
# Begin Subquery #1 aka s0
  SELECT
        fullVisitorId,
        visitId,
        MIN(hits.hitNumber) AS firstHit
  FROM
TABLE_DATE_RANGE([<id>.ga_sessions_], TIMESTAMP('2015-11-01'),
TIMESTAMP('2015-11-04'))
WHERE
      REGEXP_MATCH(hits.page.pagePath, '<regex or URL>')
        AND totals.visits = 1
GROUP BY
      fullVisitorId,
      visitId) s0
# End Subquery #1 aka s0
FULL OUTER JOIN EACH (
# Begin Subquery #2 aka s1
SELECT
    fullVisitorId,
    visitId,
    MIN(hits.hitNumber) AS firstHit
  FROM
TABLE_DATE_RANGE([<id>.ga_sessions_], TIMESTAMP('2015-11-01'),
TIMESTAMP('2015-11-04'))
WHERE
REGEXP_MATCH(hits.page.pagePath, '<regex or URL>')
  AND totals.visits = 1
GROUP BY
      fullVisitorId,
      visitId) s1
# End Subquery #2 aka s1

ON
  s0.fullVisitorId = s1.fullVisitorId
  AND s0.visitId = s1.visitId

3. Join the third subquery to the result of the above join on session ID

Select session ID, hit-number from the first subquery, hit-number from the second subquery, and hit-number from the third subquery.

4. Join the fourth subquery to the result of the above join on session ID

Select session ID, hit-number from the first subquery, hit-number from the second subquery, hit-number from the third subquery, and hit-number from the fourth subquery.

5. Continue until all subqueries are joined in this way

6. Aggregate results

Instead of a row for each session, we want one row with counts of non-null hit-numbers per funnel-step. Take the query so far, and wrap it with this:

SELECT
  COUNT(s0.firstHit) AS _job_details_,
  COUNT(s1.firstHit) AS _job_apply_
FROM (
  (query from 2. goes here if the funnel has two steps))

The query has a recursive structure, which means that we could use a recursive program to generate the query mechanically. This is a major advantage, because for longer funnels, the query can grow quite large (500+ lines for a 13-step funnel). By automating the process, we can save lots of development time. We’ll now go over how to use Funneler to generate the query.

Funneler

Funneler is an executable Python script (no need to have Python installed) which, when fed a json containing a list of regexes or URLs, generates the SQL query in the BigQuery dialect to build that funnel. It manipulates and combines strings of SQL code recursively. It extends the functionality of the query described in section 2 and it allows for segmenting and filtering of sessions based on any column in the BigQuery table.

Funneler and funneler.py can be found on my Github page (https://github.com/douug).

The input to Funneler is a json document with the following name/value pairs:

  • Table name, with the following format: [(Dataset ID).ga_sessions_]
  • Start date: ‘YYYY-MM-DD’
  • End date: ‘YYYY-MM-DD’
  • List of regexes: one regex per funnel-step
  • Segmode: True for segmenting, False otherwise
  • Segment: The column to segment on
  • Filtermode: True for filtering, False otherwise
  • Filtercol: The column to filter on
  • Filterval: The value to filter on in the above-mentioned column

Here is an example of an input json:


{
  "table": "[123456789.ga_sessions_]",
  "start": "'2015-11-01'",
  "end": "'2015-11-04'",
  "regex_list": ["'/job/details/'",
        "'/job/apply/'",
        "'/job/apply/upload-resume/'",
        "'/job/apply/basic-profile/'",
        "'/job/apply/full-profile/'",
        "'/job/apply/(assessment/external|thank-you)'"],
  "segmode": "True",
  "segment": "device.deviceCategory",
  "filtermode": "False",
  "filtercol" : "hits.customDimensions.index",
  "filterval" : "23"
}

Please note the quoted quotes (e.g. in the elements of the value of the key "regex_list" above). These are included because after the json is ingested into a Python dictionary, the Python strings may contain SQL strings, which themselves require quotes. But, the value of the key "filterval" has no inside quotes because 23 is of type int in SQL and wouldn’t be quoted.

To run Funneler, go to \dist_funneler\data. Open input.json and modify the contents, then go back to \dist_funneler and run funneler.exe. Three files should appear – std_error.log, std_out.log (which contains feedback about whether Segmode or Filtermode are engaged, and where the generated query can be found), and your query. Copy and paste your query into BigQuery. Try starting with a short funnel, as it may take a few tries to format the input correctly.

Alternatively, if you are running funneler.py, it can be executed from the command line with the following:

python funneler.py input.json

In this case, the contents of the above mentioned std_error.log and std_out.log files will appear in-console. This query can then be copied into your BQ instance. The resulting table should have two columns per regex/funnel-step – one for hits, and one for exits – and one row . If segmode is set to True, then there will be a row per value in the segment column.

Hopefully these tools help you to quickly create complex queries and meet analysis objectives to perform deeper analysis of GA page-hit data.

image 
Funnel Analysis


Online Behavior – Marketing Measurement & Optimization