Overview

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Outline

In this session we cover …

  1. Introduction to Data
  2. Splitting Data into Training and Test sets
  3. Model 0: A Single Classification Tree
  4. Model 1: Bagging of ctrees
  5. Model 2: Random Forest for classification trees
  6. Model 2a: CForest for Conditional Inference Tree
  7. Model 3: Random Forest with Boosting
  8. Model Stacking (Not inlcluded yet)
  9. Model Comparison
  10. Conclusion

Prelim - Loading libraries used in this script.

library(psych)  #for general functions
library(ggplot2)  #for data visualization

# library(devtools)
# devtools::install_github('topepo/caret/pkg/caret') #May need the github version to correct a bug with parallelizing
library(caret)  #for training and cross validation (also calls other model libaries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart)  #for trees
#library(rattle)    # Fancy tree plot This is a difficult library to install (https://gist.github.com/zhiyzuo/a489ffdcc5da87f28f8589a55aa206dd) 
library(rpart.plot)             # Enhanced tree plots
library(RColorBrewer)       # Color selection for fancy tree plot
library(party)                  # Alternative decision tree algorithm
library(partykit)               # Convert rpart object to BinaryTree
library(pROC)   #for ROC curves

library(ISLR)  #for the Carseat Data
## Warning: package 'ISLR' was built under R version 3.4.2

1. Introduction to Data

Lets look at another data example … #### Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of child car seats at 400 different stores. Sales can be predicted by 10 other variables.

#loading the data
data("Carseats")

Prelim - Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(Carseats,10)
##    Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1   9.50       138     73          11        276   120       Bad  42
## 2  11.22       111     48          16        260    83      Good  65
## 3  10.06       113     35          10        269    80    Medium  59
## 4   7.40       117    100           4        466    97    Medium  55
## 5   4.15       141     64           3        340   128       Bad  38
## 6  10.81       124    113          13        501    72       Bad  78
## 7   6.63       115    105           0         45   108    Medium  71
## 8  11.85       136     81          15        425   120      Good  67
## 9   6.54       132    110           0        108   124    Medium  76
## 10  4.69       132    113           0        131   124    Medium  76
##    Education Urban  US
## 1         17   Yes Yes
## 2         10   Yes Yes
## 3         12   Yes Yes
## 4         14   Yes Yes
## 5         13   Yes  No
## 6         16    No Yes
## 7         15   Yes  No
## 8         10   Yes Yes
## 9         10    No  No
## 10        17    No Yes

Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each location.

(Note again that there is no id variable. This is convenient for some tasks.)

Descriptives

#sample descriptives
describe(Carseats)
##             vars   n   mean     sd median trimmed    mad min    max  range
## Sales          1 400   7.50   2.82   7.49    7.43   2.87   0  16.27  16.27
## CompPrice      2 400 124.97  15.33 125.00  125.04  14.83  77 175.00  98.00
## Income         3 400  68.66  27.99  69.00   68.26  35.58  21 120.00  99.00
## Advertising    4 400   6.63   6.65   5.00    5.89   7.41   0  29.00  29.00
## Population     5 400 264.84 147.38 272.00  265.56 191.26  10 509.00 499.00
## Price          6 400 115.80  23.68 117.00  115.92  22.24  24 191.00 167.00
## ShelveLoc*     7 400   2.31   0.83   3.00    2.38   0.00   1   3.00   2.00
## Age            8 400  53.32  16.20  54.50   53.48  20.02  25  80.00  55.00
## Education      9 400  13.90   2.62  14.00   13.88   2.97  10  18.00   8.00
## Urban*        10 400   1.70   0.46   2.00    1.76   0.00   1   2.00   1.00
## US*           11 400   1.64   0.48   2.00    1.68   0.00   1   2.00   1.00
##              skew kurtosis   se
## Sales        0.18    -0.11 0.14
## CompPrice   -0.04     0.01 0.77
## Income       0.05    -1.10 1.40
## Advertising  0.63    -0.57 0.33
## Population  -0.05    -1.21 7.37
## Price       -0.12     0.41 1.18
## ShelveLoc*  -0.62    -1.28 0.04
## Age         -0.08    -1.14 0.81
## Education    0.04    -1.31 0.13
## Urban*      -0.90    -1.20 0.02
## US*         -0.60    -1.64 0.02
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  geom_vline(xintercept = 8, color="red", size=2) +
  labs(x = "Sales")

For convenience of didactic illustration we create a new variable HighSales that is binary, “No” if Sales <= 8, and “Yes” otherwise.

#creating new binary variable
Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")

Some Data cleanup

#remove old variable
Carseats$Sales <- NULL
#convert a factor variable into a numeric variable 
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)

2.Splitting the data into training and test sets

We split the data - half for Training, half for Testing

#random sample half the rows 
halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]

We will use these to evaluate a variety of different classification algorithms: Random Forests, CForests,

Setting up the k-fold cross validation k = 10 cross-validation folds.

First, we set up the cross validation control

#Setting the random seed for replication
set.seed(1234)

#setting up cross-validation
cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)

3. Model 0: A Single Classification Tree

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize the size of the tree - tuning the complexity parameter.

train.tree <- train(as.factor(HighSales) ~ ., 
                   data=Carseats.train,
                   method="ctree",
                   trControl=cvcontrol,
                   tuneLength = 10)
train.tree
## Conditional Inference Tree 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy  Kappa     
##   0.0100000     0.570     0.11907115
##   0.1188889     0.570     0.11907115
##   0.2277778     0.560     0.09628222
##   0.3366667     0.560     0.09758445
##   0.4455556     0.570     0.11934915
##   0.5544444     0.570     0.11934915
##   0.6633333     0.580     0.14348169
##   0.7722222     0.585     0.15642361
##   0.8811111     0.600     0.19649796
##   0.9900000     0.560     0.09070466
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.8811111.
plot(train.tree)

We see how the accruacy is maximized at a relatively less complex tree.

Look at the final tree

# plot tree
plot(train.tree$finalModel,
    main="Regression Tree for Carseat High Sales")