This blog post is a write up about the project I have undertaken as part of Udacity Data Science nanodegree. The project is about data exploration and building a model for Starbucks rewards app data. Starbucks provided a dataset that contains simulated data. The dataset mimics customer behavior on the Starbucks rewards mobile app. According to the project’s introduction, once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).
The objective of this project is to understand what influence a customer to purchase or respond to an offer. The analysis is based on customer’s transaction data, demographic profile, and offer type. The goals of this project are: to combine all three datasets, explore the data, provide an analysis, and build a model that will identify which demographic will most likely respond and complete an offer. It is important to note that, the dataset provided for this project is limited and a simplified version of the real Starbucks app. The simulator data has one product only, whereas Starbucks sells dozens of products.
In this project, I attempted to understand what demographic group completes an offer, and if offer type or day of the week have influences in completing the offer. I did a feature analysis to identify demographics of customers who completed the offer. Additionally, I used supervised learning technique to build a model that will identify the features that will lead to completion of an offer. To achieve that goal I created two simple models using Sklearn Random Forest Classifier and Gradient Boosting Classifier algorithms.
The sklearn.ensemble.RandomForestClassifierr is random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The sklearn.ensemble.GradientBoostingClassifier builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. The gradient boosting classifier returns a score, which is the mean accuracy on the given test data and labels.
To complete this project, I followed the steps from CRISP-DM Process (Cross Industry Process for Data Mining). The steps.
1. Business Understanding — As stated above, the objective of this project is to understand what influence a customer to purchase or respond to an offer. The analysis is based on customer’s transaction data, demographic profile, and offer type.
2. Data Understanding — In this phase, I collected the three datasets provided by Starbucks. Explored the data to understand structure of each file, the contents, and data types. I queried each data set and visualized to identify relationship among the three datasets, to identify possible outliers, and to identify which variables need encoding.
3. Prepare Data — In this part I used several methods to perform data assessment and cleanup. I encoded categorical variables using one-hot encoding and impute any missing values. I reconstructed the data as necessary and created new attributes, that will help in analysis and building the model. Finally, in this stage I integrated the data by combining all three datasets to a new dataset for analysis and building the model.
4. Modeling — In this part, I selected to use two different algorithms to build the model. I used Sklearn’s train_test_split to split the data for test and training randomly. The split is composed of 70% train data, 30% test data.
5. Evaluation — After building two simple models, I looked at the scores produced by each algorithm. Based on the result of the two models I made recommendation for implementation. Additionally, in each step, the results are evaluated and documented in the Jupyter Notebook.
6. Implement — In this final step, I developed and recommended a plan to deploy.
1. Data Understanding
The data is contained in three files:
· portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
· profile.json — demographic data for each customer
· transcript.json — records for transactions, offers received, offers viewed, and offers completed
2. Data Exploration
To understand the data, I started with basic methods to understand the shape of each dataset. I also defined the following functions that are re-usable all throughout the project for efficiency and better understand the data.
1. Function to count values and corresponding percentage (pct_var)
2. Function of simple function to plot a bar chart of variable value counts (plot_var)
3. Function to plot correlation of a variable to average amount spend (plot_ave)
4. Function to plot correlation of variable count to another variable count (plt_cnt)
5. Function to plot crosstab of 2 variables (pltcrosstab)
Data Exploration Results
Following are results of data exploration, the actual data cleanup was performed in the Data Preparation section.
Portfolio — The dataset contains 10 records and 6 columns, composed of offer with corresponding attributes, such as type, reward, duration, and channel. The dataset does not contain missing values.
1. Channel is a categorical variable and will be encoded into numerical value.
2. Offer is a categorical variable and will be encoded into numerical value.
Profile — The dataset contains demographic information, with 17,000 records and 5 columns.
1. Became_member_on is an integer data type. This will be converted into date feature and then converted into number of days based on max date of became_member_on, which is ‘08–26–2018’. Note that I used the maximum date from the dataset to compute membership length, instead of current date [today()], so that the membership length will be relative to the recency of the data and will not result to false positive result or increased membership length.
2. There are age value = 118, and will be converted to NaN
3. Gender is categorical value and contain Nan values. The gender will be encoded into numerical value.
4. Income are encoded into income group.
5. Age are encoded into age group.
Transcript — The dataset contains about 300K records of transactions and 4 columns.
1. Value variable contain either an offer id or transaction amount depending on the record. The data will be extracted from value variable based on transaction event and offer type.
2. Transformation or joining the variables will be done, depending on each step and result of analysis.
3. Data Preparation
Before performing the data cleanup, I copied the original dataset into new dataset. In this phase, the categorical variables were encoded into numeric values and missing values were imputed. This process is necessary because learning algorithms expect inputs to be numeric and null values filled.
Portfolio data cleanup
- Channels and offer type were encoded using one-hot encoding
Profile data cleanup
- Gender was encoded into numerical values
- Created a column for Income group
- Created a column for Age group, with the following values
- Became_member_on, was converted into number of years as membership length
- As a result of encoding, the most number of customers are:
Age, between 30 and 60yo, Male, with income of more than $70,000, and membership length is about1 year.
Transcript data cleanup
- Value variable contain either an offer id or transaction amount depending on the record. The data was extracted from value variable based on transaction event and offer type
Merging the data
The three datasets were merged and transformed, so that the new data set will contain the encoded transaction data per customer, with corresponding offer attributes such as amount, reward, day of event, i.e. day the offer was completed or day the transaction was made. The transposed data contains customer’s demographic also.
In the following analysis, the re-usable functions were used to plot and understand the data.
1. Gender relationship to average transaction spend, and completed offer
The graph indicates that there are more Male customers than Female, and small number of Others. However, the average spend of Female and Others are higher than Male. The male customers are more likely to complete an offer.
2. Age group relationship to average transaction spend, and completed offer
Age Group: 1 if <=30, 2 if x > 30 and x < 60, 3 if x>= 60, Else, 0
The graph indicates that most number of customers are 30–60yo. Age group 60 and above has the highest average spend. Age group 30 -60 are more likely to complete an offer.
3. Income group relationship to average transaction spend, and completed offer
Income Group: 1 if <=50000, 2 if x > 50000 and x < 70000, 3 if x>= 70000, Else, 0
The graph indicates that most number of customers belong to income group 2 and 3, which translated to 50K and above. Income group 70K and above has the highest average spend. Similarly, income group 70K and above are more likely to complete an offer.
4. Membership length relationship to average transaction spend, and completed offer
The graph indicates that most number of customers have been member for a year. Customers for 2years has the highest average spend. Customers within a year are more likely to complete an offer.
5. Offer type relationship to completed offer
The graph indicates that there is slight difference of completed offers between BOGO and Discount offer type. As expected, informational offer type has no completed offer.
6. Offer attributes reward, difficulty, and duration relationship to completed offer
The graph indicates that highest reward of 10 has slightly lower completed offer than 2 and 5 rewards. Minimum required spend of 10 has the most completed offer. And duration of 7 days has most completed offer, while 5 and 10days have almost the same result.
7. Offer channel relationship to completed offer
The offer channel doesn’t indicate any difference or relationship to completed offer. All three channels shows the same trend.
8. Count of offers, viewed, received, completed on day of the week
Most offers are received or sent to customer on Monday and viewed on the same day. Offers are most likely to be completed on Monday and Tuesday. Least number of completed offers are during weekend.
4. Build the Model
In this part, I attempted to build two models, so I could compare the results if there are any differences that will provide useful insights. Before building the model, I used Sklearn train_test_split function to split data into 70% train data and 30% test data.
Model using Random Forest Classifier:
Model using Gradient Boosting Classifier:
5. Model Evaluation
To better understand the result of the model and determine the implementation of the model, I used feature_importances_ function to extract features that are significant to predictive model.
Random Forest Classifier Result:
Both models resulted to 0.97 score and similar important features. The Random Forest Classifier model top 5 features are: the day of the week the offer was completed (wkday_completed), day of the week the offer was viewed (wkday_viewed), rewardvalue, discount offer type, bogo offer type. The Gradient Boosting Classifier model top 5 features are: the day of the week the offer was completed (wkday_completed), discount offer type, bogo offer type, day of the week the offer was viewed (wkday_viewed), and income_code. Between the two models, the common features that are significant to predictive model are:
(1) day of the week the offer was completed (wkday_completed)
(2) day of the week the offer was viewed (wkday_viewed)
(3) discount offer type
(4) bogo offer type
The implementation can be done into the following stages:
1. Save the data into a database (i.e. SQL or DB2)
2. Develop a Jupyter Notebook that will:
- Read the data from database
- Define callable functions to perform the data cleanup and transformation as implemented in this Jupyter Notebook.
- Develop callable functions to perform feature analysis and visualization, which can be deployed into online dashboard
- Define callable functions that will train and execute two predictive models
3. Deploy the model into a hosting environment and can be called from applications
To recap, the objective of this project is to combine the provided datasets, transaction, demographic and offer data, to determine which demographic groups respond best to offer type. To achieve this objective, I did a feature analysis to identify demographics of customers who completed the offer, using visualization and different functions. I also built two models using supervised learning algorithms.
Based on the result of feature analysis, Male, who are 30–60yo, with income of 70K and higher, and customers within a year are most likely to complete an offer. Additionally, sending the offer to customer on Monday is effective because customers are most likely to complete the offer on Monday and Tuesday, regardless of demographic profile. This conclusion is supported by the model, as day of the week is in top five important features.
As expected, the data understanding and data preparation took longer time, than the actual analysis and building the model. Also, even though, I know the goal and business question I am trying to solve, it was challenging to determine what kind of encoding, transformation, calculation of variables that can be used, for me to build the model. I decided to do simple calculations, define functions, and build simple model, so that I could complete the project. The data cleanup, transformation, and final dataframe structure can be refined further. For example, further analysis and transformation can made by analyzing the difference between the time the offer was viewed to the time the offer was completed, to build more insightful model.
I hope this blog post will provide useful insights. Please checkout the code and details here. This analysis for online learning project and doesn’t represent insights from Starbucks company.