Predicting Ecommerce with Google Data

Predicting Ecommerce with Google Data

Predicting Ecommerce with Google Data

Here is what we will cover.

  • What is the business problem?
  • What is the data?
  • JSON Manipulation

What is the business problem?

The business problem is that we want to predict whether a user will make a purchase or not.

Data was discovered using Google Bard and parf of BigQuery. For this exercise I exported the data as a compressed JSON file. After downloading the file I unzipped it and uploaded it to Databricks. The schema looks like this:

root
 |-- channelGrouping: string (nullable = true)
 |-- customDimensions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- date: string (nullable = true)
 |-- device: struct (nullable = true)
 |    |-- browser: string (nullable = true)
 |    |-- browserSize: string (nullable = true)
 |    |-- browserVersion: string (nullable = true)
 |    |-- deviceCategory: string (nullable = true)
 |    |-- flashVersion: string (nullable = true)
 |    |-- isMobile: boolean (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- mobileDeviceBranding: string (nullable = true)
 |    |-- mobileDeviceInfo: string (nullable = true)
 |    |-- mobileDeviceMarketingName: string (nullable = true)
 |    |-- mobileDeviceModel: string (nullable = true)
 |    |-- mobileInputSelector: string (nullable = true)
 |    |-- operatingSystem: string (nullable = true)
 |    |-- operatingSystemVersion: string (nullable = true)
 |    |-- screenColors: string (nullable = true)
 |    |-- screenResolution: string (nullable = true)
 |-- fullVisitorId: string (nullable = true)
 |-- geoNetwork: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- cityId: string (nullable = true)
 |    |-- continent: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- latitude: string (nullable = true)
 |    |-- longitude: string (nullable = true)
 |    |-- metro: string (nullable = true)
 |    |-- networkDomain: string (nullable = true)
 |    |-- networkLocation: string (nullable = true)
 |    |-- region: string (nullable = true)
 |    |-- subContinent: string (nullable = true)
 |-- hits: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- appInfo: struct (nullable = true)
 |    |    |    |-- exitScreenName: string (nullable = true)
 |    |    |    |-- landingScreenName: string (nullable = true)
 |    |    |    |-- screenDepth: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |-- contentGroup: struct (nullable = true)
 |    |    |    |-- contentGroup1: string (nullable = true)
 |    |    |    |-- contentGroup2: string (nullable = true)
 |    |    |    |-- contentGroup3: string (nullable = true)
 |    |    |    |-- contentGroup4: string (nullable = true)
 |    |    |    |-- contentGroup5: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews1: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews2: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews3: string (nullable = true)
 |    |    |    |-- previousContentGroup1: string (nullable = true)
 |    |    |    |-- previousContentGroup2: string (nullable = true)
 |    |    |    |-- previousContentGroup3: string (nullable = true)
 |    |    |    |-- previousContentGroup4: string (nullable = true)
 |    |    |    |-- previousContentGroup5: string (nullable = true)
 |    |    |-- customDimensions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- customMetrics: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- customVariables: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- dataSource: string (nullable = true)
 |    |    |-- eCommerceAction: struct (nullable = true)
 |    |    |    |-- action_type: string (nullable = true)
 |    |    |    |-- option: string (nullable = true)
 |    |    |    |-- step: string (nullable = true)
 |    |    |-- eventInfo: struct (nullable = true)
 |    |    |    |-- eventAction: string (nullable = true)
 |    |    |    |-- eventCategory: string (nullable = true)
 |    |    |    |-- eventLabel: string (nullable = true)
 |    |    |-- exceptionInfo: struct (nullable = true)
 |    |    |    |-- isFatal: boolean (nullable = true)
 |    |    |-- experiment: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- hitNumber: string (nullable = true)
 |    |    |-- hour: string (nullable = true)
 |    |    |-- isEntrance: boolean (nullable = true)
 |    |    |-- isExit: boolean (nullable = true)
 |    |    |-- isInteraction: boolean (nullable = true)
 |    |    |-- item: struct (nullable = true)
 |    |    |    |-- currencyCode: string (nullable = true)
 |    |    |    |-- transactionId: string (nullable = true)
 |    |    |-- latencyTracking: struct (nullable = true)
 |    |    |    |-- domContentLoadedTime: string (nullable = true)
 |    |    |    |-- domInteractiveTime: string (nullable = true)
 |    |    |    |-- domLatencyMetricsSample: string (nullable = true)
 |    |    |    |-- domainLookupTime: string (nullable = true)
 |    |    |    |-- pageDownloadTime: string (nullable = true)
 |    |    |    |-- pageLoadSample: string (nullable = true)
 |    |    |    |-- pageLoadTime: string (nullable = true)
 |    |    |    |-- redirectionTime: string (nullable = true)
 |    |    |    |-- serverConnectionTime: string (nullable = true)
 |    |    |    |-- serverResponseTime: string (nullable = true)
 |    |    |    |-- speedMetricsSample: string (nullable = true)
 |    |    |-- minute: string (nullable = true)
 |    |    |-- page: struct (nullable = true)
 |    |    |    |-- hostname: string (nullable = true)
 |    |    |    |-- pagePath: string (nullable = true)
 |    |    |    |-- pagePathLevel1: string (nullable = true)
 |    |    |    |-- pagePathLevel2: string (nullable = true)
 |    |    |    |-- pagePathLevel3: string (nullable = true)
 |    |    |    |-- pagePathLevel4: string (nullable = true)
 |    |    |    |-- pageTitle: string (nullable = true)
 |    |    |    |-- searchCategory: string (nullable = true)
 |    |    |    |-- searchKeyword: string (nullable = true)
 |    |    |-- product: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- customDimensions: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- customMetrics: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- isClick: boolean (nullable = true)
 |    |    |    |    |-- isImpression: boolean (nullable = true)
 |    |    |    |    |-- localProductPrice: string (nullable = true)
 |    |    |    |    |-- localProductRevenue: string (nullable = true)
 |    |    |    |    |-- productBrand: string (nullable = true)
 |    |    |    |    |-- productListName: string (nullable = true)
 |    |    |    |    |-- productListPosition: string (nullable = true)
 |    |    |    |    |-- productPrice: string (nullable = true)
 |    |    |    |    |-- productQuantity: string (nullable = true)
 |    |    |    |    |-- productRevenue: string (nullable = true)
 |    |    |    |    |-- productSKU: string (nullable = true)
 |    |    |    |    |-- productVariant: string (nullable = true)
 |    |    |    |    |-- v2ProductCategory: string (nullable = true)
 |    |    |    |    |-- v2ProductName: string (nullable = true)
 |    |    |-- promotion: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- promoCreative: string (nullable = true)
 |    |    |    |    |-- promoId: string (nullable = true)
 |    |    |    |    |-- promoName: string (nullable = true)
 |    |    |    |    |-- promoPosition: string (nullable = true)
 |    |    |-- promotionActionInfo: struct (nullable = true)
 |    |    |    |-- promoIsClick: boolean (nullable = true)
 |    |    |    |-- promoIsView: boolean (nullable = true)
 |    |    |-- publisher_infos: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- referer: string (nullable = true)
 |    |    |-- social: struct (nullable = true)
 |    |    |    |-- hasSocialSourceReferral: string (nullable = true)
 |    |    |    |-- socialInteractionNetworkAction: string (nullable = true)
 |    |    |    |-- socialNetwork: string (nullable = true)
 |    |    |-- time: string (nullable = true)
 |    |    |-- transaction: struct (nullable = true)
 |    |    |    |-- affiliation: string (nullable = true)
 |    |    |    |-- currencyCode: string (nullable = true)
 |    |    |    |-- localTransactionRevenue: string (nullable = true)
 |    |    |    |-- localTransactionShipping: string (nullable = true)
 |    |    |    |-- localTransactionTax: string (nullable = true)
 |    |    |    |-- transactionId: string (nullable = true)
 |    |    |    |-- transactionRevenue: string (nullable = true)
 |    |    |    |-- transactionShipping: string (nullable = true)
 |    |    |    |-- transactionTax: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- socialEngagementType: string (nullable = true)
 |-- totals: struct (nullable = true)
 |    |-- bounces: string (nullable = true)
 |    |-- hits: string (nullable = true)
 |    |-- newVisits: string (nullable = true)
 |    |-- pageviews: string (nullable = true)
 |    |-- sessionQualityDim: string (nullable = true)
 |    |-- timeOnSite: string (nullable = true)
 |    |-- totalTransactionRevenue: string (nullable = true)
 |    |-- transactionRevenue: string (nullable = true)
 |    |-- transactions: string (nullable = true)
 |    |-- visits: string (nullable = true)
 |-- trafficSource: struct (nullable = true)
 |    |-- adContent: string (nullable = true)
 |    |-- adwordsClickInfo: struct (nullable = true)
 |    |    |-- adNetworkType: string (nullable = true)
 |    |    |-- criteriaParameters: string (nullable = true)
 |    |    |-- gclId: string (nullable = true)
 |    |    |-- isVideoAd: boolean (nullable = true)
 |    |    |-- page: string (nullable = true)
 |    |    |-- slot: string (nullable = true)
 |    |-- campaign: string (nullable = true)
 |    |-- isTrueDirect: boolean (nullable = true)
 |    |-- keyword: string (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- referralPath: string (nullable = true)
 |    |-- source: string (nullable = true)
 |-- visitId: string (nullable = true)
 |-- visitNumber: string (nullable = true)
 |-- visitStartTime: string (nullable = true)

Selecting a subset of the data

Given that the business use case, we want a dataset that looks like this:

root
 |-- visit_date_dimension_identifier: string (nullable = true)
 |-- visitnumber: long (nullable = true)
 |-- visitid: long (nullable = true)
 |-- visitstarttime: long (nullable = true)
 |-- fullvisitorid: string (nullable = true)
 |-- hitnumber: long (nullable = true)
 |-- index: long (nullable = true)
 |-- value: string (nullable = true)
 |-- base_website_name: string (nullable = true)
 |-- hostname: string (nullable = true)
 |-- browser: string (nullable = true)
 |-- devicecategory: string (nullable = true)
 |-- operatingsystem: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- action_type: string (nullable = true)
 |-- pagepath: string (nullable = true)
 |-- isentrance: boolean (nullable = true)
 |-- isexit: boolean (nullable = true)
 |-- channelgrouping: string (nullable = true)
 |-- bounces: long (nullable = true)
 |-- hitseventcat: string (nullable = true)
 |-- hitseventaction: string (nullable = true)
 |-- contentgroup2: string (nullable = true)
 |-- hitseventlabel: string (nullable = true)
 |-- transactionid: string (nullable = true)
 |-- bounce_indicator: integer (nullable = true)
 |-- total_hits: long (nullable = true)
 |-- total_page_views: long (nullable = true)
 |-- total_session_cart_amount: long (nullable = true)
 |-- analytics_uid: string (nullable = true)
 |-- web_geo_network_city: string (nullable = true)
 |-- web_geo_network_metro_area: string (nullable = true)
 |-- web_geo_network_country: string (nullable = true)
 |-- web_geo_network_state: string (nullable = true)
 |-- eventcategory: string (nullable = true)
 |-- filename: string (nullable = true)

So let’s line these up in a table:

SNoOriginal Column NameNew Column NameSources and Notes
1visit_data_dim_idvisit_date_dimension_identifierEnsure that date type is right. Also may need to use a different uid
2visitNumbervisit_numberThis is the number of the visit.
3visitIdvisit_idThis is the unique identifier for the visit.
4visitStartTimevisit_date_dimension_identifierEnsure that date type is right. Also may need to use a different uid
5fullVisitorIdfullvisitoridThis is the unique identifier for the visitor.
6hits.hitNumberhit_numberThis is the number of the hit.
7customDimensions.elements.index</markcustom_dim_valueThis is the value of the custom dimension.
9hits.page.pagebase_website_nameThis is the page of the page.
10hits.page.hostnamehostnameThis is the hostname of the page.
11device.browserbrowserThis is the browser of the device.
12device.deviceCategorydevice_categoryThis is the device category.
13device.operatingSystemoperating_systemThis is the operating system of the device.
14trafficSource.mediummediumThis is the medium of the traffic source.
15hits.eCommerceAction.actionTypeaction_typeThis is the action type of the ecommerce action.
16hits.page.pagePathpagepathThis is the page path of the page.
17hits.isEntranceisentranceThis is the entrance indicator of the hit.
18hits.isExitisexitThis is the exit indicator of the hit.
19channelGroupingchannelgroupingThis is the channel grouping.
20totals.bouncesbouncesThis is the bounces.
21hits.eventInfo.eventCategoryhitseventcatThis is the event category of the event info.
22hits.eventInfo.eventActionhitseventactionThis is the event action of the event info.
23hits.contentGroup.contentGroup2contentgroup2This is the content group 2 of the content group.
24hits.eventInfo.eventLabelhitseventlabelThis is the event label of the event info.
25hits.transaction.transactionIdtransactionidThis is the transaction id of the transaction.
26totals.bouncesbounce_indicatorThis is the bounce indicator.
27totals.hitstotal_hitsThis is the total hits.
28totals.pageviewstotal_page_viewsThis is the total page views.
29totals.transactionRevenuetotal_session_cart_amountThis is the total session cart amount.
30customDimensions.elements.valueanalytics_uidThis is the analytics uid.
31geoNetwork.cityweb_geo_network_cityThis is the web geo network city.
32geoNetwork.metroweb_geo_network_metro_areaThis is the web geo network metro area.
33geoNetwork.countryweb_geo_network_countryThis is the web geo network country.
34geoNetwork.regionweb_geo_network_stateThis is the web geo network state.
35hits.eventInfo.eventCategoryeventcategoryThis is the event category.
36filenamefilenameThis is the filename.

JSON Manipulation

The first thing we need to do is to flatten the JSON. We can do this using the following code: