Background and Introduction

This is an analytics product sample for one of my internal teams which is dealing a Hospitality client, they have launched a new campaign with various creatives and tried to find out what are key properties to drive their website sales. Basically they want me to do the predictions on their website sales based on their historical creative data. They will use the prediction to guide their creative target in the future campaign.

Below is a quick demo & model porotype that I built for them based on 4 months’ historical data.

Data Dictionary

Total Conversions: This is the target variable. Number corresponds to the number of creatives that drove audience to purchase on website.
Date: This is a time variable with format like “2016-10-01”.
Creative: Unique creative name used to identify each creative look.
Creative Pixel Size: Size of the creative.
Creative Field 4: This is a field included some property information for creative.
Platform Type: The platform the creative shows.
State/Region: The state the creative shows.
Impressions: Meeting of one message with one consumer, number of creative expressed to audience.

From the data dictionary that we could know we need to predict Total Conversions based on other variables.

Required Packages:

require(readxl)    # For reading specific sheet from an excel file 
require(lubridate) # For making dealing with date variables
require(caTools)   # For stratify sampling
require(dplyr)     # For data wrangling
require(xgboost)   # For XGBoosting model
require(Matrix)    # For construting binary variables from catrgorical variables (One-hot encoding)
require(h2o)       # For Neural Network deep learning model
require(AER)       # For dispersion Test in Poisson GLMs
require(ggplot2)   # For data visualization

Algorithm Selection

The time period is not long enough, and each day record has multiple obversions break down to different dimensions such as size, platform, language, etc. So it is not available to run time series analysis.

Dependent variable Total Conversions is a discrate variable, so it is a regression question.

The majority of independent variables are categorical variables, only Impressions is numeric count variable. So even we create dummy variables to repsent those categorical variables, the Pure linear regression may hard to handle this dataset. It is better to use Regularized Regression, Regression Tree or Neural Network. Tree or network based model also pretty robust to multicollinearity.

In order to improve the accuracy and prediction power of the model, I would like to use ensemble method, specifically, XGBoost and H2O Deep Learning, since it’s high performance and speed. I will also use Regulatized Poisson Regression to create a base line model and to interpret the relationships between variables, since it’s easy to explain not like other two models. They all have kind of “Black Box” systems.

Data Preparation

# Read sample data from target sheet of excel file
Disney_SP = read_excel("C:/_Project/Segmentation Product/Disney_SP.xlsx", sheet = "Merge")

# Look at the size of the data and data type of each variable
str(Disney_SP)

# Take a look at typical value of each variable
summary(Disney_SP)

# Data cleaning
Disney_SP$Size = Disney_SP$`Creative Pixel Size`
Disney_SP$Region = Disney_SP$`State/Region`
Disney_SP$Platform = Disney_SP$`Platform Type`
Disney_SP$Conversions = Disney_SP$`Total Conversions`
Disney_SP$Field = Disney_SP$`Creative Field 4`

Feature Engineering

Since XGBoost only take numerical variables, there are lots of categorical variables in the dataset, so I need to do some feature transformations here:
1. For categorical variables with less levels, I will convert them into binary dummy variables;
2. For categorical variables with more levels, I will convert them into lower scales dummy variables depends on the distributions of Conversions;
3. For date variable, I will create some new variables to catch seasonality.

H2O Deep Learning and Regulatized Poisson Regression can catch factor type variables, so I will convet those categorical variables to factor type.

Create new variables

# Time related new variables
Disney_SP$Weekday = weekdays(Disney_SP$Date)
Disney_SP$Month = months(Disney_SP$Date)
Disney_SP$WeekofMonth = ifelse(ceiling(mday(Disney_SP$Date)/7)==5,4,ceiling(mday(Disney_SP$Date)/7))

# New language category
Disney_SP$Language = "ENG"
Disney_SP[grep("SPAN", Disney_SP$Field,fixed = TRUE),]["Language"] = "SPAN"

# New price symbol category
Disney_SP$PriceSymbol = "No"
Disney_SP[grep("$", Disney_SP$Field, fixed = TRUE),]["PriceSymbol"] = "Yes"

# New percentage symbol category
Disney_SP$PercentageSymbol = "No"
Disney_SP[grep("%", Disney_SP$Field, fixed = TRUE),]["PercentageSymbol"] = "Yes"

# New product symbol category
Disney_SP$ProductSymbol = "No"
Disney_SP[grep("Resort", Disney_SP$Field, fixed = TRUE),]["ProductSymbol"] = "Yes"
Disney_SP[grep("Hotel", Disney_SP$Field, fixed = TRUE),]["ProductSymbol"] = "Yes"

# New region category (CA takes the majority of population, so we only consider CA or Not CA)
Disney_SP$RegionCA = "No"
Disney_SP[Disney_SP$Region == "CA",]["RegionCA"] = "Yes"

# New creative type category
Disney_SP$CreativeType = "Image"
Disney_SP[grep("html", Disney_SP$Creative, fixed = TRUE),]["CreativeType"] = "HTMLBanner"

Delete useless variables

# Drop meaningless (in terms of modeling) variables from dataset
names(Disney_SP)
drops <- c("Date", "Creative", "Creative Pixel Size", "Creative Field 4", "Platform Type", "State/Region", 
           "Total Conversions", "Field", "Region")

Disney_SP <- Disney_SP[, !(names(Disney_SP) %in% drops)]

Create dummy variables

# Convert all categorical variables to factor type
Disney_SP$Month = as.factor(Disney_SP$Month)
Disney_SP$Weekday = as.factor(Disney_SP$Weekday)
Disney_SP$WeekofMonth = as.factor(Disney_SP$WeekofMonth)
Disney_SP$Size = as.factor(Disney_SP$Size)
Disney_SP$Platform = as.factor(Disney_SP$Platform)
Disney_SP$Language = as.factor(Disney_SP$Language)
Disney_SP$PriceSymbol = as.factor(Disney_SP$PriceSymbol)
Disney_SP$PercentageSymbol = as.factor(Disney_SP$PercentageSymbol)
Disney_SP$ProductSymbol = as.factor(Disney_SP$ProductSymbol)
Disney_SP$RegionCA = as.factor(Disney_SP$RegionCA)
Disney_SP$CreativeType = as.factor(Disney_SP$CreativeType)

Sampling

Before we build model, we need to split the dataset into train (80%) and test(20%) first. We can use test data to do model evaluation after modeling.

# Stratify sampling based on time (month)
train_rows = sample.split(Disney_SP$Month, SplitRatio = 0.80)
train = na.omit(Disney_SP[train_rows,])
test = na.omit(Disney_SP[!train_rows,])

Exploratory Analysis

Let’s describe our final data first

# Describe train data
str(train)

summary(train)

Each variable has 5638 valid observations and their distributions seem quite reasonable. Our model assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so). We can display the summary statistics by using lookup tables by categorical variables.

# Make lookup tables for categorical variables
Size_Impressions_lookup = train %>% group_by(Size) %>% summarise(Impressions_avg = mean(Impressions))
Size_Conversions_lookup = train %>% group_by(Size) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(Size_Impressions_lookup,Size_Conversions_lookup)

Language_Impressions_lookup = train %>% group_by(Language) %>% summarise(Impressions_avg = mean(Impressions))
Language_Conversions_lookup = train %>% group_by(Language) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(Language_Impressions_lookup,Language_Conversions_lookup)

PriceSymbol_Impressions_lookup = train %>% group_by(PriceSymbol) %>% summarise(Impressions_avg = mean(Impressions))
PriceSymbol_Conversions_lookup = train %>% group_by(PriceSymbol) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(PriceSymbol_Impressions_lookup,PriceSymbol_Conversions_lookup)

PercentageSymbol_Impressions_lookup = train %>% group_by(PercentageSymbol) %>% summarise(Impressions_avg = mean(Impressions))
PercentageSymbol_Conversions_lookup = train %>% group_by(PercentageSymbol) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(PercentageSymbol_Impressions_lookup,PercentageSymbol_Conversions_lookup)

ProductSymbol_Impressions_lookup = train %>% group_by(ProductSymbol) %>% summarise(Impressions_avg = mean(Impressions))
ProductSymbol_Conversions_lookup = train %>% group_by(ProductSymbol) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(ProductSymbol_Impressions_lookup,ProductSymbol_Conversions_lookup)

RegionCA_Impressions_lookup = train %>% group_by(RegionCA) %>% summarise(Impressions_avg = mean(Impressions))
RegionCA_Conversions_lookup = train %>% group_by(RegionCA) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(RegionCA_Impressions_lookup,RegionCA_Conversions_lookup)

CreativeType_Impressions_lookup = train %>% group_by(CreativeType) %>% summarise(Impressions_avg = mean(Impressions))
CreativeType_Conversions_lookup = train %>% group_by(CreativeType) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(CreativeType_Impressions_lookup,CreativeType_Conversions_lookup)

Platform_Impressions_lookup = train %>% group_by(Platform) %>% summarise(Impressions_avg = mean(Impressions))
Platform_Conversions_lookup = train %>% group_by(Platform) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(Platform_Impressions_lookup,Platform_Conversions_lookup)

# Make lookup tables for time variables (Seasonality)
Month_Impressions_lookup = train %>% group_by(Month) %>% summarise(Impressions_avg = mean(Impressions))
Month_Conversions_lookup = train %>% group_by(Month) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(Month_Impressions_lookup,Month_Conversions_lookup)

Weekday_Impressions_lookup = train %>% group_by(Weekday) %>% summarise(Impressions_avg = mean(Impressions))
Weekday_Conversions_lookup = train %>% group_by(Weekday) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(Weekday_Impressions_lookup,Weekday_Conversions_lookup)

WeekofMonth_Impressions_lookup = train %>% group_by(WeekofMonth) %>% summarise(Impressions_avg = mean(Impressions))
WeekofMonth_Conversions_lookup = train %>% group_by(WeekofMonth) %>% summarise(Conversions_avg = mean(Conversions))
inner_join(WeekofMonth_Impressions_lookup,WeekofMonth_Conversions_lookup)

It looks like conversions do have some sensitive to month, weekday and week of month, so those three variables should be able to catch this relationship.

# Distribtion of conversions
ggplot(train, aes(Conversions)) + geom_histogram(binwidth=4)

We can see from graph above, the target variable falls into a Poisson Distribution. Now let’s build the model.

Build the model

Regulatized Poisson Regression

# Fit a GLM model with poisson option
pr.fit = glm(Conversions~., data = train, family = poisson)

# Summarise the model
summary(pr.fit)

# Run a dispersion test for this model
dispersiontest(pr.fit, trafo = 1)

As we can see from model above, the overdispersion corresponds to alpha > 0. So we need to use Quasi Poisson Regression to analyze.

# Fit a Quasi poisson regression model
qpr.fit = glm(Conversions~., data = train, family = quasipoisson)

# Summarise the quasi poisson regression model
summary(qpr.fit)

# Predict on test data
qpr.fit_pred = predict(qpr.fit, newdata = test, family = quasipoisson)

# Round the conversions prediction into integer
qpr.fit_pred = round(qpr.fit_pred)

# Predicted conversions should not less than 0.
qpr.fit_pred <- ifelse(qpr.fit_pred > 0, qpr.fit_pred ,0)

# MSE
sum((qpr.fit_pred-test$Conversions)^2)/nrow(test)

# Pseudo R2
1-(qpr.fit$deviance/qpr.fit$null.deviance)

XGBoost

# Create the sparse matrix for both train and test datasets
train_sparse_matrix = sparse.model.matrix(Conversions~.-1, data = train)
test_sparse_matrix = sparse.model.matrix(Conversions~.-1, data = test)

# Create the dependent vector
train_output_vector = as.matrix(train[, "Conversions"])
test_output_vector = as.matrix(test[, "Conversions"])

# Set parameters
param = list("booster" = "gblinear",
             "objective" = "reg:linear",
             "nthread" = 4,
             "eta" = 0.1,
             "subsample" = 0.8,
             "min_child_weight" = 2,
             "max_depth" = 15
             )

# Construct a watch list, use test error to measure the model quality to avoid overfitting during model training (similar to cross validation)
dtrain <- xgb.DMatrix(data = train_sparse_matrix, label=train_output_vector)
dtest <- xgb.DMatrix(data = test_sparse_matrix, label=test_output_vector)
watchlist <- list(train=dtrain, test=dtest)

# Fit a xgboost model
bst.fit = xgb.train(params = param, data = dtrain, nround = 75, watchlist = watchlist)
## [1]  train-rmse:8.038461 test-rmse:6.845778 
## [2]  train-rmse:7.443459 test-rmse:6.229445 
## [3]  train-rmse:7.060273 test-rmse:5.832447 
## [4]  train-rmse:6.769723 test-rmse:5.528380 
## [5]  train-rmse:6.533261 test-rmse:5.278708 
## [6]  train-rmse:6.336672 test-rmse:5.069609 
## [7]  train-rmse:6.172209 test-rmse:4.896956 
## [8]  train-rmse:6.034286 test-rmse:4.753260 
## [9]  train-rmse:5.919357 test-rmse:4.635913 
## [10] train-rmse:5.823464 test-rmse:4.540049 
## [11] train-rmse:5.743324 test-rmse:4.462027 
## [12] train-rmse:5.676066 test-rmse:4.397252 
## [13] train-rmse:5.619929 test-rmse:4.346950 
## [14] train-rmse:5.572789 test-rmse:4.306294 
## [15] train-rmse:5.532988 test-rmse:4.273437 
## [16] train-rmse:5.499516 test-rmse:4.247403 
## [17] train-rmse:5.471227 test-rmse:4.226669 
## [18] train-rmse:5.447204 test-rmse:4.210104 
## [19] train-rmse:5.426787 test-rmse:4.197153 
## [20] train-rmse:5.409341 test-rmse:4.186851 
## [21] train-rmse:5.394389 test-rmse:4.178907 
## [22] train-rmse:5.381533 test-rmse:4.172609 
## [23] train-rmse:5.370416 test-rmse:4.167859 
## [24] train-rmse:5.360775 test-rmse:4.164186 
## [25] train-rmse:5.352318 test-rmse:4.161371 
## [26] train-rmse:5.344912 test-rmse:4.159219 
## [27] train-rmse:5.338461 test-rmse:4.157533 
## [28] train-rmse:5.332737 test-rmse:4.156409 
## [29] train-rmse:5.327681 test-rmse:4.155524 
## [30] train-rmse:5.323175 test-rmse:4.154906 
## [31] train-rmse:5.319144 test-rmse:4.154440 
## [32] train-rmse:5.315518 test-rmse:4.154153 
## [33] train-rmse:5.312245 test-rmse:4.153966 
## [34] train-rmse:5.309266 test-rmse:4.153908 
## [35] train-rmse:5.306546 test-rmse:4.153877 
## [36] train-rmse:5.304088 test-rmse:4.153821 
## [37] train-rmse:5.301851 test-rmse:4.153755 
## [38] train-rmse:5.299798 test-rmse:4.153764 
## [39] train-rmse:5.297899 test-rmse:4.153761 
## [40] train-rmse:5.296138 test-rmse:4.153625 
## [41] train-rmse:5.294523 test-rmse:4.153666 
## [42] train-rmse:5.293029 test-rmse:4.153648 
## [43] train-rmse:5.291622 test-rmse:4.153706 
## [44] train-rmse:5.290336 test-rmse:4.153630 
## [45] train-rmse:5.289135 test-rmse:4.153589 
## [46] train-rmse:5.288014 test-rmse:4.153506 
## [47] train-rmse:5.286959 test-rmse:4.153480 
## [48] train-rmse:5.285979 test-rmse:4.153386 
## [49] train-rmse:5.285059 test-rmse:4.153293 
## [50] train-rmse:5.284191 test-rmse:4.153233 
## [51] train-rmse:5.283377 test-rmse:4.153144 
## [52] train-rmse:5.282617 test-rmse:4.152998 
## [53] train-rmse:5.281898 test-rmse:4.152904 
## [54] train-rmse:5.281215 test-rmse:4.152833 
## [55] train-rmse:5.280577 test-rmse:4.152727 
## [56] train-rmse:5.279967 test-rmse:4.152658 
## [57] train-rmse:5.279397 test-rmse:4.152562 
## [58] train-rmse:5.278861 test-rmse:4.152422 
## [59] train-rmse:5.278352 test-rmse:4.152334 
## [60] train-rmse:5.277871 test-rmse:4.152236 
## [61] train-rmse:5.277419 test-rmse:4.152082 
## [62] train-rmse:5.276991 test-rmse:4.151937 
## [63] train-rmse:5.276586 test-rmse:4.151806 
## [64] train-rmse:5.276201 test-rmse:4.151687 
## [65] train-rmse:5.275837 test-rmse:4.151593 
## [66] train-rmse:5.275491 test-rmse:4.151470 
## [67] train-rmse:5.275161 test-rmse:4.151410 
## [68] train-rmse:5.274848 test-rmse:4.151340 
## [69] train-rmse:5.274548 test-rmse:4.151299 
## [70] train-rmse:5.274262 test-rmse:4.151248 
## [71] train-rmse:5.273993 test-rmse:4.151193 
## [72] train-rmse:5.273736 test-rmse:4.151170 
## [73] train-rmse:5.273494 test-rmse:4.151066 
## [74] train-rmse:5.273265 test-rmse:4.150946 
## [75] train-rmse:5.273047 test-rmse:4.150868
# Predict on test data
bst.fit_pred = predict(bst.fit, test_sparse_matrix)

# Round the conversions prediction into integer
bst.fit_pred = round(bst.fit_pred)

# Predicted conversions should not less than 0.
bst.fit_pred <- ifelse(bst.fit_pred > 0, bst.fit_pred ,0)

# MSE
sum((bst.fit_pred-test$Conversions)^2)/nrow(test)
## [1] 17.17885
r2 = sum((bst.fit_pred-mean(test$Conversions))^2)/sum((test$Conversions-mean(test$Conversions))^2)
print(paste("R square is ", r2))
## [1] "R square is  0.828201588297557"
# Find what the actual tree looks like
model = xgb.dump(bst.fit, with_stats = T)

# Print top 10 nodes of the model
model[1:10]
##  [1] "booster[0]"  "bias:"       "0.77103"     "weight:"     "0.000493209"
##  [6] "0.0715889"   "0.488492"    "-0.0189073"  "0.428137"    "-2.77031"
# Get the feature real names
names = dimnames(train_sparse_matrix)[[2]]

# Compute feature importance matrix
importance_matrix = xgb.importance(names, model=bst.fit)

# Plot features
xgb.plot.importance(importance_matrix, rel_to_first = F, main = "Relative importance")