Overview

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Outline

In this session we cover …

  1. Introduction to Data
  2. Splitting Data into Training and Test sets
  3. Model 0: A Single Classification Tree
  4. Model 1: Bagging of ctrees
  5. Model 2: Random Forest for classification trees
  6. Model 2a: CForest for Conditional Inference Tree
  7. Model 3: Random Forest with Boosting
  8. Model Stacking (Not inlcluded yet)
  9. Model Comparison
  10. Conclusion

Prelim - Loading libraries used in this script.

library(psych)  #for general functions
library(ggplot2)  #for data visualization

# library(devtools)
# devtools::install_github('topepo/caret/pkg/caret') #May need the github version to correct a bug with parallelizing
library(caret)  #for training and cross validation (also calls other model libaries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart)  #for trees
#library(rattle)    # Fancy tree plot This is a difficult library to install (https://gist.github.com/zhiyzuo/a489ffdcc5da87f28f8589a55aa206dd) 
library(rpart.plot)             # Enhanced tree plots
library(RColorBrewer)       # Color selection for fancy tree plot
library(party)                  # Alternative decision tree algorithm
library(partykit)               # Convert rpart object to BinaryTree
library(pROC)   #for ROC curves

library(ISLR)  #for the Carseat Data
## Warning: package 'ISLR' was built under R version 3.4.2

1. Introduction to Data

Lets look at another data example … #### Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of child car seats at 400 different stores. Sales can be predicted by 10 other variables.

#loading the data
data("Carseats")

Prelim - Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(Carseats,10)
##    Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1   9.50       138     73          11        276   120       Bad  42
## 2  11.22       111     48          16        260    83      Good  65
## 3  10.06       113     35          10        269    80    Medium  59
## 4   7.40       117    100           4        466    97    Medium  55
## 5   4.15       141     64           3        340   128       Bad  38
## 6  10.81       124    113          13        501    72       Bad  78
## 7   6.63       115    105           0         45   108    Medium  71
## 8  11.85       136     81          15        425   120      Good  67
## 9   6.54       132    110           0        108   124    Medium  76
## 10  4.69       132    113           0        131   124    Medium  76
##    Education Urban  US
## 1         17   Yes Yes
## 2         10   Yes Yes
## 3         12   Yes Yes
## 4         14   Yes Yes
## 5         13   Yes  No
## 6         16    No Yes
## 7         15   Yes  No
## 8         10   Yes Yes
## 9         10    No  No
## 10        17    No Yes

Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each location.

(Note again that there is no id variable. This is convenient for some tasks.)

Descriptives

#sample descriptives
describe(Carseats)
##             vars   n   mean     sd median trimmed    mad min    max  range
## Sales          1 400   7.50   2.82   7.49    7.43   2.87   0  16.27  16.27
## CompPrice      2 400 124.97  15.33 125.00  125.04  14.83  77 175.00  98.00
## Income         3 400  68.66  27.99  69.00   68.26  35.58  21 120.00  99.00
## Advertising    4 400   6.63   6.65   5.00    5.89   7.41   0  29.00  29.00
## Population     5 400 264.84 147.38 272.00  265.56 191.26  10 509.00 499.00
## Price          6 400 115.80  23.68 117.00  115.92  22.24  24 191.00 167.00
## ShelveLoc*     7 400   2.31   0.83   3.00    2.38   0.00   1   3.00   2.00
## Age            8 400  53.32  16.20  54.50   53.48  20.02  25  80.00  55.00
## Education      9 400  13.90   2.62  14.00   13.88   2.97  10  18.00   8.00
## Urban*        10 400   1.70   0.46   2.00    1.76   0.00   1   2.00   1.00
## US*           11 400   1.64   0.48   2.00    1.68   0.00   1   2.00   1.00
##              skew kurtosis   se
## Sales        0.18    -0.11 0.14
## CompPrice   -0.04     0.01 0.77
## Income       0.05    -1.10 1.40
## Advertising  0.63    -0.57 0.33
## Population  -0.05    -1.21 7.37
## Price       -0.12     0.41 1.18
## ShelveLoc*  -0.62    -1.28 0.04
## Age         -0.08    -1.14 0.81
## Education    0.04    -1.31 0.13
## Urban*      -0.90    -1.20 0.02
## US*         -0.60    -1.64 0.02
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  geom_vline(xintercept = 8, color="red", size=2) +
  labs(x = "Sales")

For convenience of didactic illustration we create a new variable HighSales that is binary, “No” if Sales <= 8, and “Yes” otherwise.

#creating new binary variable
Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")

Some Data cleanup

#remove old variable
Carseats$Sales <- NULL
#convert a factor variable into a numeric variable 
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)

2.Splitting the data into training and test sets

We split the data - half for Training, half for Testing

#random sample half the rows 
halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]

We will use these to evaluate a variety of different classification algorithms: Random Forests, CForests,

Setting up the k-fold cross validation k = 10 cross-validation folds.

First, we set up the cross validation control

#Setting the random seed for replication
set.seed(1234)

#setting up cross-validation
cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)

3. Model 0: A Single Classification Tree

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize the size of the tree - tuning the complexity parameter.

train.tree <- train(as.factor(HighSales) ~ ., 
                   data=Carseats.train,
                   method="ctree",
                   trControl=cvcontrol,
                   tuneLength = 10)
train.tree
## Conditional Inference Tree 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy  Kappa     
##   0.0100000     0.570     0.11907115
##   0.1188889     0.570     0.11907115
##   0.2277778     0.560     0.09628222
##   0.3366667     0.560     0.09758445
##   0.4455556     0.570     0.11934915
##   0.5544444     0.570     0.11934915
##   0.6633333     0.580     0.14348169
##   0.7722222     0.585     0.15642361
##   0.8811111     0.600     0.19649796
##   0.9900000     0.560     0.09070466
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.8811111.
plot(train.tree)

We see how the accruacy is maximized at a relatively less complex tree.

Look at the final tree

# plot tree
plot(train.tree$finalModel,
    main="Regression Tree for Carseat High Sales")

To evalaute the accuracy of the tree we can look at the confusion matrix for the Training data.

#obtaining class predictions
tree.classTrain <-  predict(train.tree, 
                          type="raw")
head(tree.classTrain)
## [1] Yes No  No  Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,tree.classTrain)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  62  48
##        Yes  8  82
##                                          
##                Accuracy : 0.72           
##                  95% CI : (0.6523, 0.781)
##     No Information Rate : 0.65           
##     P-Value [Acc > NIR] : 0.02131        
##                                          
##                   Kappa : 0.4563         
##  Mcnemar's Test P-Value : 1.872e-07      
##                                          
##             Sensitivity : 0.8857         
##             Specificity : 0.6308         
##          Pos Pred Value : 0.5636         
##          Neg Pred Value : 0.9111         
##              Prevalence : 0.3500         
##          Detection Rate : 0.3100         
##    Detection Prevalence : 0.5500         
##       Balanced Accuracy : 0.7582         
##                                          
##        'Positive' Class : No             
## 

Some Errors. But the model was learned.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
tree.classTest <-  predict(train.tree, 
                         newdata = Carseats.test,
                          type="raw")
head(tree.classTest)
## [1] Yes No  Yes Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,tree.classTest)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  63  63
##        Yes 14  60
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.5438, 0.6828)
##     No Information Rate : 0.615           
##     P-Value [Acc > NIR] : 0.5312          
##                                           
##                   Kappa : 0.2734          
##  Mcnemar's Test P-Value : 4.498e-08       
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.4878          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.8108          
##              Prevalence : 0.3850          
##          Detection Rate : 0.3150          
##    Detection Prevalence : 0.6300          
##       Balanced Accuracy : 0.6530          
##                                           
##        'Positive' Class : No              
## 

Accuracy of 0.71

When evaluating classification models, a few other functions may be useful. For example, the pROC package provides convenience for calculating confusion matrices, the associcated measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data
tree.probs=predict(train.tree,
                 newdata=Carseats.test,
                 type="prob")
head(tree.probs)
##          No       Yes
## 1 0.4473684 0.5526316
## 2 0.8709677 0.1290323
## 3 0.2962963 0.7037037
## 4 0.2962963 0.7037037
## 5 0.2962963 0.7037037
## 6 0.4473684 0.5526316
#Calculate ROC curve
rocCurve.tree <- roc(Carseats.test$HighSales,tree.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))

#calculate the area under curve (bigger is better)
auc(rocCurve.tree)
## Area under the curve: 0.6714

4. Model 1: Bagging of ctrees

Training the model using treebag

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize the size of the tree - tuning the complexity parameter.

#Fix data file for use in bag() function
# Carseats2 <- Carseats.train
# Carseats2$Urban <- as.factor(Carseats2$Urban)
# Carseats2$US <- as.factor(Carseats2$US)
# Carseats2$HighSales <- as.factor(Carseats2$HighSales)
# 
# train.bagg <- bag(Carseats2[,-11],Carseats2[,11], B = 10
#                    ,
#                    bagControl = bagControl(fit = ctreeBag$fit,
#                                         predict = ctreeBag$pred,
#                                         aggregate = ctreeBag$aggregate))


#Using treebag 
train.bagg <- train(as.factor(HighSales) ~ ., 
                   data=Carseats.train,
                   method="treebag",
                   trControl=cvcontrol,
                   importance=TRUE)

train.bagg
## Bagged CART 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.75      0.4963593
plot(varImp(train.bagg))

Not yet sure how to parse mode details from the output in order to look at the collection of trees.

Look at the collection of final trees

To evalaute the accuracy of the Bagged Trees we can look at the confusion matrix for the Training data.

#obtaining class predictions
bagg.classTrain <-  predict(train.bagg, 
                          type="raw")
head(bagg.classTrain)
## [1] No  No  No  Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,bagg.classTrain)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  110   0
##        Yes   0  90
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.55       
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.55       
##          Detection Rate : 0.55       
##    Detection Prevalence : 0.55       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : No         
## 

The accuracy is perfect!

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
bagg.classTest <-  predict(train.bagg, 
                         newdata = Carseats.test,
                          type="raw")
head(bagg.classTest)
## [1] No  No  No  No  No  Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,bagg.classTest)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  107  19
##        Yes  16  58
##                                          
##                Accuracy : 0.825          
##                  95% CI : (0.7651, 0.875)
##     No Information Rate : 0.615          
##     P-Value [Acc > NIR] : 9.519e-11      
##                                          
##                   Kappa : 0.6277         
##  Mcnemar's Test P-Value : 0.7353         
##                                          
##             Sensitivity : 0.8699         
##             Specificity : 0.7532         
##          Pos Pred Value : 0.8492         
##          Neg Pred Value : 0.7838         
##              Prevalence : 0.6150         
##          Detection Rate : 0.5350         
##    Detection Prevalence : 0.6300         
##       Balanced Accuracy : 0.8116         
##                                          
##        'Positive' Class : No             
## 

Accuracy of 0.76

We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data
bagg.probs=predict(train.bagg,
                 newdata=Carseats.test,
                 type="prob")
head(bagg.probs)
##     No  Yes
## 1 0.96 0.04
## 2 0.60 0.40
## 3 0.96 0.04
## 4 0.52 0.48
## 5 0.72 0.28
## 6 0.04 0.96
#Calculate ROC curve
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.bagg,col=c(6))

#calculate the area under curve (bigger is better)
auc(rocCurve.bagg)
## Area under the curve: 0.8904

5. Model 2: Random Forest for classification trees

Training the model using random forest

train.rf <- train(as.factor(HighSales) ~ ., 
                  data=Carseats.train,
                  method="rf",
                  trControl=cvcontrol,
                  #tuneLength = 3,
                  importance=TRUE)
train.rf
## Random Forest 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##    2    0.775     0.5397404
##    6    0.755     0.5059471
##   10    0.775     0.5441237
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
rf.classTrain <-  predict(train.rf, 
                          type="raw")
head(rf.classTrain)
## [1] No  No  No  Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,rf.classTrain)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  110   0
##        Yes   0  90
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.55       
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.55       
##          Detection Rate : 0.55       
##    Detection Prevalence : 0.55       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : No         
## 

No Errors. That is good - the model was learned well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
rf.classTest <-  predict(train.rf, 
                         newdata = Carseats.test,
                          type="raw")
head(rf.classTest)
## [1] No  No  No  No  No  Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,rf.classTest)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  116  10
##        Yes  22  52
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7817, 0.8879)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 9.004e-07       
##                                           
##                   Kappa : 0.6449          
##  Mcnemar's Test P-Value : 0.05183         
##                                           
##             Sensitivity : 0.8406          
##             Specificity : 0.8387          
##          Pos Pred Value : 0.9206          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.6900          
##          Detection Rate : 0.5800          
##    Detection Prevalence : 0.6300          
##       Balanced Accuracy : 0.8396          
##                                           
##        'Positive' Class : No              
## 

Accuracy of 0.78. An improvement over Bagging only

We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data
rf.probs=predict(train.rf,
                 newdata=Carseats.test,
                 type="prob")
head(rf.probs)
##       No   Yes
## 1  0.686 0.314
## 4  0.588 0.412
## 5  0.762 0.238
## 9  0.570 0.430
## 10 0.646 0.354
## 18 0.298 0.702
#Calculate ROC curve
rocCurve.rf <- roc(Carseats.test$HighSales,rf.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.rf,col=c(1))

#calculate the area under curve (bigger is better)
auc(rocCurve.rf)
## Area under the curve: 0.9021

6. Model 2a: CForest for Conditional Inference Tree

An implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners (from party package)

train.cf <- train(HighSales ~ .,   #cforest knows the outcome is binary (unlike rf)
                   data=Carseats.train,
                   method="cforest",
                   trControl=cvcontrol)  #Note that importance not available here 
train.cf
## Conditional Inference Random Forest 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##    2    0.645     0.2429219
##    6    0.735     0.4504639
##   10    0.705     0.3909498
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
cf.classTrain <-  predict(train.cf, 
                          type="raw")
head(cf.classTrain)
## [1] No  No  No  Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,cf.classTrain)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  106   4
##        Yes  11  79
##                                           
##                Accuracy : 0.925           
##                  95% CI : (0.8793, 0.9574)
##     No Information Rate : 0.585           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8474          
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##             Sensitivity : 0.9060          
##             Specificity : 0.9518          
##          Pos Pred Value : 0.9636          
##          Neg Pred Value : 0.8778          
##              Prevalence : 0.5850          
##          Detection Rate : 0.5300          
##    Detection Prevalence : 0.5500          
##       Balanced Accuracy : 0.9289          
##                                           
##        'Positive' Class : No              
## 

A few Errors. Model learned pretty well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
cf.classTest <-  predict(train.cf, 
                         newdata = Carseats.test,
                          type="raw")
head(cf.classTest)
## [1] No  No  No  No  No  Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,cf.classTest)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  119   7
##        Yes  22  52
##                                           
##                Accuracy : 0.855           
##                  95% CI : (0.7984, 0.9007)
##     No Information Rate : 0.705           
##     P-Value [Acc > NIR] : 5.477e-07       
##                                           
##                   Kappa : 0.6754          
##  Mcnemar's Test P-Value : 0.00933         
##                                           
##             Sensitivity : 0.8440          
##             Specificity : 0.8814          
##          Pos Pred Value : 0.9444          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.7050          
##          Detection Rate : 0.5950          
##    Detection Prevalence : 0.6300          
##       Balanced Accuracy : 0.8627          
##                                           
##        'Positive' Class : No              
## 

Accuracy of 0.715

We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data
cf.probs=predict(train.cf,
                 newdata=Carseats.test,
                 type="prob")
head(cf.probs)
##          No       Yes
## 1 0.5551222 0.4448778
## 2 0.6379772 0.3620228
## 3 0.7206398 0.2793602
## 4 0.5318676 0.4681324
## 5 0.5603060 0.4396940
## 6 0.3079523 0.6920477
#Calculate ROC curve
rocCurve.cf <- roc(Carseats.test$HighSales,cf.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.cf,col=c(2))

#calculate the area under curve (bigger is better)
auc(rocCurve.cf)
## Area under the curve: 0.9299

7. Model 3: Random Forest with Boosting

Possible ot use a variety of packages: “gbm”, “ada”, and “xgbLinear” – all can be accessed through caret. Can lookup the various tuning parmaters

modelLookup("ada")
##   model parameter          label forReg forClass probModel
## 1   ada      iter         #Trees  FALSE     TRUE      TRUE
## 2   ada  maxdepth Max Tree Depth  FALSE     TRUE      TRUE
## 3   ada        nu  Learning Rate  FALSE     TRUE      TRUE
modelLookup("gbm")
##   model         parameter                   label forReg forClass
## 1   gbm           n.trees   # Boosting Iterations   TRUE     TRUE
## 2   gbm interaction.depth          Max Tree Depth   TRUE     TRUE
## 3   gbm         shrinkage               Shrinkage   TRUE     TRUE
## 4   gbm    n.minobsinnode Min. Terminal Node Size   TRUE     TRUE
##   probModel
## 1      TRUE
## 2      TRUE
## 3      TRUE
## 4      TRUE

Here, we use Gradient Boosting Example tuning parameters for “gbm: http://topepo.github.io/caret/training.html

Training with gradient boosting

train.gbm <- train(as.factor(HighSales) ~ ., 
                   data=Carseats.train,
                   method="gbm",
                   verbose=F,
                   trControl=cvcontrol)
train.gbm
## Stochastic Gradient Boosting 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy  Kappa    
##   1                   50      0.730     0.4511090
##   1                  100      0.785     0.5654964
##   1                  150      0.805     0.6066137
##   2                   50      0.790     0.5702364
##   2                  100      0.820     0.6342360
##   2                  150      0.795     0.5864473
##   3                   50      0.785     0.5648166
##   3                  100      0.795     0.5864562
##   3                  150      0.810     0.6174929
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
gbm.classTrain <-  predict(train.gbm, 
                          type="raw")
head(gbm.classTrain)
## [1] No  No  No  Yes Yes Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,gbm.classTrain)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  105   5
##        Yes   6  84
##                                           
##                Accuracy : 0.945           
##                  95% CI : (0.9037, 0.9722)
##     No Information Rate : 0.555           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8888          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.9438          
##          Pos Pred Value : 0.9545          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.5550          
##          Detection Rate : 0.5250          
##    Detection Prevalence : 0.5500          
##       Balanced Accuracy : 0.9449          
##                                           
##        'Positive' Class : No              
## 

A few Errors. Model learned quite well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
gbm.classTest <-  predict(train.gbm, 
                         newdata = Carseats.test,
                          type="raw")
head(gbm.classTest)
## [1] No  No  No  No  No  Yes
## Levels: No Yes
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,gbm.classTest)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  115  11
##        Yes  14  60
##                                          
##                Accuracy : 0.875          
##                  95% CI : (0.821, 0.9174)
##     No Information Rate : 0.645          
##     P-Value [Acc > NIR] : 1.627e-13      
##                                          
##                   Kappa : 0.7296         
##  Mcnemar's Test P-Value : 0.6892         
##                                          
##             Sensitivity : 0.8915         
##             Specificity : 0.8451         
##          Pos Pred Value : 0.9127         
##          Neg Pred Value : 0.8108         
##              Prevalence : 0.6450         
##          Detection Rate : 0.5750         
##    Detection Prevalence : 0.6300         
##       Balanced Accuracy : 0.8683         
##                                          
##        'Positive' Class : No             
## 

Accuracy of 0.83

We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data
gbm.probs=predict(train.gbm,
                 newdata=Carseats.test,
                 type="prob")
head(gbm.probs)
##           No       Yes
## 1 0.70521837 0.2947816
## 2 0.56658110 0.4334189
## 3 0.85531345 0.1446865
## 4 0.67297281 0.3270272
## 5 0.73232024 0.2676798
## 6 0.04450397 0.9554960
#Calculate ROC curve
rocCurve.gbm <- roc(Carseats.test$HighSales,gbm.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.gbm, col=c(3))

#calculate the area under curve (bigger is better)
auc(rocCurve.gbm)
## Area under the curve: 0.9453

8. Model Stacking

See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/

9. Model Comparisons

We can examine how the models do by looking at the ROC curves.

plot(rocCurve.tree,col=c(4))
plot(rocCurve.bagg,add=TRUE,col=c(6)) # color magenta is bagg
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf
plot(rocCurve.cf,add=TRUE,col=c(2)) # color red is cforest
plot(rocCurve.gbm,add=TRUE,col=c(3)) # color green is gbm

Tree = blue, Bagg = magenta, RF = black, CForest = red, gradient boosting = green

10. Conclusion

For this example, random forests and boosting are more stable than the other methods. Comparing the variable importance metrics to the decision tree results is a way to see how likely the tree is to generalize.

Thank you for playing!