Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.
In this session we cover …
library(psych) #for general functions
library(ggplot2) #for data visualization
# library(devtools)
# devtools::install_github('topepo/caret/pkg/caret') #May need the github version to correct a bug with parallelizing
library(caret) #for training and cross validation (also calls other model libaries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart) #for trees
#library(rattle) # Fancy tree plot This is a difficult library to install (https://gist.github.com/zhiyzuo/a489ffdcc5da87f28f8589a55aa206dd)
library(rpart.plot) # Enhanced tree plots
library(RColorBrewer) # Color selection for fancy tree plot
library(party) # Alternative decision tree algorithm
library(partykit) # Convert rpart object to BinaryTree
library(pROC) #for ROC curves
library(ISLR) #for the Carseat Data
## Warning: package 'ISLR' was built under R version 3.4.2
Lets look at another data example … #### Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of child car seats at 400 different stores. Sales can be predicted by 10 other variables.
#loading the data
data("Carseats")
Lets have a quick look at the data file and the descriptives.
#data structure
head(Carseats,10)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## 7 6.63 115 105 0 45 108 Medium 71
## 8 11.85 136 81 15 425 120 Good 67
## 9 6.54 132 110 0 108 124 Medium 76
## 10 4.69 132 113 0 131 124 Medium 76
## Education Urban US
## 1 17 Yes Yes
## 2 10 Yes Yes
## 3 12 Yes Yes
## 4 14 Yes Yes
## 5 13 Yes No
## 6 16 No Yes
## 7 15 Yes No
## 8 10 Yes Yes
## 9 10 No No
## 10 17 No Yes
Our outcome of interest will be a binary version of Sales
: Unit sales (in thousands) at each location.
(Note again that there is no id
variable. This is convenient for some tasks.)
Descriptives
#sample descriptives
describe(Carseats)
## vars n mean sd median trimmed mad min max range
## Sales 1 400 7.50 2.82 7.49 7.43 2.87 0 16.27 16.27
## CompPrice 2 400 124.97 15.33 125.00 125.04 14.83 77 175.00 98.00
## Income 3 400 68.66 27.99 69.00 68.26 35.58 21 120.00 99.00
## Advertising 4 400 6.63 6.65 5.00 5.89 7.41 0 29.00 29.00
## Population 5 400 264.84 147.38 272.00 265.56 191.26 10 509.00 499.00
## Price 6 400 115.80 23.68 117.00 115.92 22.24 24 191.00 167.00
## ShelveLoc* 7 400 2.31 0.83 3.00 2.38 0.00 1 3.00 2.00
## Age 8 400 53.32 16.20 54.50 53.48 20.02 25 80.00 55.00
## Education 9 400 13.90 2.62 14.00 13.88 2.97 10 18.00 8.00
## Urban* 10 400 1.70 0.46 2.00 1.76 0.00 1 2.00 1.00
## US* 11 400 1.64 0.48 2.00 1.68 0.00 1 2.00 1.00
## skew kurtosis se
## Sales 0.18 -0.11 0.14
## CompPrice -0.04 0.01 0.77
## Income 0.05 -1.10 1.40
## Advertising 0.63 -0.57 0.33
## Population -0.05 -1.21 7.37
## Price -0.12 0.41 1.18
## ShelveLoc* -0.62 -1.28 0.04
## Age -0.08 -1.14 0.81
## Education 0.04 -1.31 0.13
## Urban* -0.90 -1.20 0.02
## US* -0.60 -1.64 0.02
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") +
geom_vline(xintercept = 8, color="red", size=2) +
labs(x = "Sales")
For convenience of didactic illustration we create a new variable HighSales
that is binary, “No” if Sales <= 8, and “Yes” otherwise.
#creating new binary variable
Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")
Some Data cleanup
#remove old variable
Carseats$Sales <- NULL
#convert a factor variable into a numeric variable
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)
We split the data - half for Training, half for Testing
#random sample half the rows
halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]
We will use these to evaluate a variety of different classification algorithms: Random Forests, CForests,
First, we set up the cross validation control
#Setting the random seed for replication
set.seed(1234)
#setting up cross-validation
cvcontrol <- trainControl(method="repeatedcv", number = 10,
allowParallel=TRUE)
We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize the size of the tree - tuning the complexity parameter.
train.tree <- train(as.factor(HighSales) ~ .,
data=Carseats.train,
method="ctree",
trControl=cvcontrol,
tuneLength = 10)
train.tree
## Conditional Inference Tree
##
## 200 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## mincriterion Accuracy Kappa
## 0.0100000 0.570 0.11907115
## 0.1188889 0.570 0.11907115
## 0.2277778 0.560 0.09628222
## 0.3366667 0.560 0.09758445
## 0.4455556 0.570 0.11934915
## 0.5544444 0.570 0.11934915
## 0.6633333 0.580 0.14348169
## 0.7722222 0.585 0.15642361
## 0.8811111 0.600 0.19649796
## 0.9900000 0.560 0.09070466
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.8811111.
plot(train.tree)
We see how the accruacy is maximized at a relatively less complex tree.
Look at the final tree
# plot tree
plot(train.tree$finalModel,
main="Regression Tree for Carseat High Sales")