What is Machine Learning?

One of many definitions: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P,if its performance at tasks T, as measured by P, improves with experience E.Mitchell, T. 1997

The Caret Package

Paper from JSS

#Install the Caret Package and its depencencis if it is not already installed
if (!require("caret")) install.packages("caret", dependencies=T)

#Load the Package
library("caret")

We will work with the Iris and Glass Data set from the mlbech Package.

This package contains A collection of artificial and real-world machine learning benchmark problems

#Install the mlbench Package and its depencencis if it is not already installed
if (!require("mlbench")) install.packages("mlbench", dependencies=T)

#Load the Package
library("mlbench")

# Glass Identification Database
data(Glass)
dim(Glass)
## [1] 214  10
levels(Glass$Type)
## [1] "1" "2" "3" "5" "6" "7"
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
# Iris  Database
data(iris)
dim(iris)
## [1] 150   5
levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

First step, Exploratory Data Analysis and Visualization

#Plot Glass dataset
featurePlot(x = Glass[, 1:9], 
            y = Glass$Type, 
            plot = "ellipse", 
            ## Add a key at the top
            auto.key = list(columns = 7))

#Plot Iris dataset
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "ellipse",
            ## Add a key at the top
            auto.key = list(columns = 3))

The Caret Package provides a lot of functions for visualization

As you can see in some cases it possible for use to visually identify rules that will allows us to discrete between classes (e.g., If Petal.Length < 3 it is a setosa). But for more complex data we need better algorithms.

Data Pre-processing

Why?

#Add NA to the dataset
GlassNA<-Glass
random_row<-sample(c(1:nrow(Glass)), 5)
random_col<-sample(c(1:ncol(Glass)), 5)
GlassNA[random_row,random_col ]<-NA
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
summary(GlassNA)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.93   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.31   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.42   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.83   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##  NA's   :5       NA's   :5                                      
##        Si              K                Ca               Ba        
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.0000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.0000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.0000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.956   Mean   :0.1792  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.180   3rd Qu.:0.0000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.1500  
##                                   NA's   :5        NA's   :5       
##        Fe            Type   
##  Min.   :0.00000   1   :66  
##  1st Qu.:0.00000   2   :76  
##  Median :0.00000   3   :17  
##  Mean   :0.05701   5   :12  
##  3rd Qu.:0.10000   6   : 9  
##  Max.   :0.51000   7   :29  
##                    NA's: 5

Imputing missing values using KNN, also centering and scaling numerical columns

GlasspreNA<- preProcess(GlassNA, method = c("knnImpute","center","scale"))
Glasspre_proNA<-predict(GlasspreNA, Glass)

Glasspre<- preProcess(Glass, method = c("center","scale"))
Glasspre_pro<-predict(Glasspre, Glass)
summary(Glasspre_proNA)
##        RI                  Na                 Mg         
##  Min.   :-2.359632   Min.   :-3.30826   Min.   :-1.8611  
##  1st Qu.:-0.602831   1st Qu.:-0.63105   1st Qu.:-0.3948  
##  Median :-0.224330   Median :-0.14848   Median : 0.5515  
##  Mean   :-0.000199   Mean   :-0.01588   Mean   : 0.0000  
##  3rd Qu.: 0.258810   3rd Qu.: 0.49700   3rd Qu.: 0.6347  
##  Max.   : 5.089398   Max.   : 4.86782   Max.   : 1.2517  
##        Al                Si                K           
##  Min.   :-2.3132   Min.   :-3.6679   Min.   :-0.76213  
##  1st Qu.:-0.5106   1st Qu.:-0.4789   1st Qu.:-0.57430  
##  Median :-0.1701   Median : 0.1795   Median : 0.08884  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.3707   3rd Qu.: 0.5636   3rd Qu.: 0.17318  
##  Max.   : 4.1162   Max.   : 3.5622   Max.   : 8.75961  
##        Ca                  Ba                  Fe          Type  
##  Min.   :-2.470830   Min.   :-0.356749   Min.   :-0.5851   1:70  
##  1st Qu.:-0.501803   1st Qu.:-0.356749   1st Qu.:-0.5851   2:76  
##  Median :-0.249544   Median :-0.356749   Median :-0.5851   3:17  
##  Mean   : 0.000587   Mean   :-0.008335   Mean   : 0.0000   5:13  
##  3rd Qu.: 0.151619   3rd Qu.:-0.356749   3rd Qu.: 0.4412   6: 9  
##  Max.   : 5.068929   Max.   : 5.913031   Max.   : 4.6490   7:29
summary(Glasspre_pro)
##        RI                Na                Mg                Al         
##  Min.   :-2.3759   Min.   :-3.2793   Min.   :-1.8611   Min.   :-2.3132  
##  1st Qu.:-0.6068   1st Qu.:-0.6127   1st Qu.:-0.3948   1st Qu.:-0.5106  
##  Median :-0.2257   Median :-0.1321   Median : 0.5515   Median :-0.1701  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2608   3rd Qu.: 0.5108   3rd Qu.: 0.6347   3rd Qu.: 0.3707  
##  Max.   : 5.1252   Max.   : 4.8642   Max.   : 1.2517   Max.   : 4.1162  
##        Si                K                  Ca                Ba         
##  Min.   :-3.6679   Min.   :-0.76213   Min.   :-2.4783   Min.   :-0.3521  
##  1st Qu.:-0.4789   1st Qu.:-0.57430   1st Qu.:-0.5038   1st Qu.:-0.3521  
##  Median : 0.1795   Median : 0.08884   Median :-0.2508   Median :-0.3521  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5636   3rd Qu.: 0.17318   3rd Qu.: 0.1515   3rd Qu.:-0.3521  
##  Max.   : 3.5622   Max.   : 8.75961   Max.   : 5.0824   Max.   : 5.9832  
##        Fe          Type  
##  Min.   :-0.5851   1:70  
##  1st Qu.:-0.5851   2:76  
##  Median :-0.5851   3:17  
##  Mean   : 0.0000   5:13  
##  3rd Qu.: 0.4412   6: 9  
##  Max.   : 4.6490   7:29

Evaluating several algorithms.

“Overfitting vs Generalizability”

Data Splitting

#Split data set           
inTrain<-createDataPartition(y=iris$Species, p=0.75, list=FALSE)

#Create Training  and Testing Sets
training.Iris<-iris[inTrain,]
testing.Iris<-iris[-inTrain,]

dim(training.Iris)
## [1] 114   5
dim(testing.Iris)
## [1] 36  5

How does the Training Iris data looks (you always look at the training, never EDA the testing dataset, to avoid bias)

boxplot(training.Iris[, -5], main="Raw Data")

Need to Center and Scale the Training set

#Center and Scale all varialbes (not class) == Standardizing
preObj<-preProcess(training.Iris[,-5], method = c("center", "scale"))
preObjData<-predict(preObj,training.Iris[,-5])
boxplot(preObjData, main="Normalized data" )

set.seed(123456)

#Linear Discriminant Analysis (LDA)
#preProcess option allows to  center and scale the data 
modelFit<-train(Species~., data=training.Iris, preProcess=c("center", "scale"), method="lda")

modelFit$finalModel
## Call:
## lda(x, grouping = y)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa       -0.9696408   0.8276933   -1.2898741  -1.2379443
## versicolor    0.0351469  -0.7205801    0.2615276   0.1403154
## virginica     0.9344940  -0.1071133    1.0283465   1.0976288
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.7688442  0.3794765
## Sepal.Width   0.6544428  0.8651571
## Petal.Length -4.0178459 -2.0063989
## Petal.Width  -1.9855373  2.1727546
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9881 0.0119

Now look at the testing data

#Predict new data with model fitted
predictions<-predict(modelFit, newdata=testing.Iris)

#Shows Confusion Matrix and performance metrics
confusionMatrix(predictions, testing.Iris$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         12         0
##   virginica       0          0        12
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9026, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

We did very well on the testing set, but was the testing set representative of possible “new dataset”"

K-folds, Resampling/Bootstrapping

#K-folds
folds<-createFolds(y=training.Iris$Species, k=10, list=T)
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##     12     11     12     11     12     12     12     11     11     10
folds[[1]]   
##  [1]  6 12 30 38 51 63 69 72 79 83 85 87
#Bootstrapping
Resamples<-createResample(y=training.Iris$Species, time=10, list=T)
sapply(Resamples, length)
## Resample01 Resample02 Resample03 Resample04 Resample05 Resample06 
##        114        114        114        114        114        114 
## Resample07 Resample08 Resample09 Resample10 
##        114        114        114        114
Resamples[[1]][1:20]    
##  [1]  2  4  4  4  6  6  7  7  8  8  9  9 10 11 11 11 13 16 17 17

Cross-Validation

There is always a trade-off between bias an variance of our model. The best CV approach will depends on the data size and type.See Kohavi,1995

kfoldcv <- trainControl(method="cv", number=10)
performance_metric <- "Accuracy"

Iris Dataset

set.seed(1234)

#Linear Discriminant Analysis (LDA)
lda.iris <- train(Species~., data=iris, method="lda", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

#Classification and Regression Trees (CART)
cart.iris <- train(Species~., data=iris, method="rpart", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

#Support Vector Machines (SVM)
svm.iris <- train(Species~., data=iris, method="svmRadial", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

# Random Forest
rf.iris <- train(Species~., data=iris, method="rf", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

Accuracy Summary of Iris Dataset

results.iris <- resamples(list(lda=lda.iris, cart=cart.iris,  svm=svm.iris, rf=rf.iris))
summary(results.iris)
## 
## Call:
## summary.resamples(object = results.iris)
## 
## Models: lda, cart, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## lda  0.9333333 0.9500000 1.0000000 0.9800000 1.0000000    1    0
## cart 0.8000000 0.8833333 0.9333333 0.9200000 0.9333333    1    0
## svm  0.8000000 0.9333333 0.9666667 0.9466667 1.0000000    1    0
## rf   0.8666667 0.9333333 0.9333333 0.9400000 0.9833333    1    0
## 
## Kappa 
##      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda   0.9   0.925   1.00 0.97   1.000    1    0
## cart  0.7   0.825   0.90 0.88   0.900    1    0
## svm   0.7   0.900   0.95 0.92   1.000    1    0
## rf    0.8   0.900   0.90 0.91   0.975    1    0
dotplot(results.iris)

Glass Dataset

set.seed(1234)

#Linear Discriminant Analysis (LDA)
lda.glass <- train(Type~., data=Glass, method="lda", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

#Classification and Regression Trees (CART)
cart.glass <- train(Type~., data=Glass, method="rpart", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

#Support Vector Machines (SVM)
svm.glass <- train(Type~., data=Glass, method="svmRadial", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

# Random Forest
rf.glass <- train(Type~., data=Glass, method="rf", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))

Accuracy Summary of Glass Dataset

results.glass <- resamples(list(lda=lda.glass, cart=cart.glass,  svm=svm.glass, rf=rf.glass))
summary(results.glass)
## 
## Call:
## summary.resamples(object = results.glass)
## 
## Models: lda, cart, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.5238095 0.5931818 0.6376812 0.6399172 0.7061688 0.7619048    0
## cart 0.5909091 0.6233766 0.6681922 0.6647245 0.7107143 0.7272727    0
## svm  0.4545455 0.5568182 0.6363636 0.6355206 0.7424812 0.7727273    0
## rf   0.6363636 0.7648810 0.8380435 0.8040806 0.8571429 0.8636364    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.3312102 0.3930499 0.5004657 0.4946380 0.5949753 0.6546053    0
## cart 0.4142012 0.4598739 0.5304960 0.5238349 0.5755901 0.6312849    0
## svm  0.2095808 0.3249807 0.4657971 0.4586234 0.6083838 0.6839080    0
## rf   0.4927954 0.6833067 0.7783288 0.7302891 0.7975868 0.8125000    0
dotplot(results.glass)

Can we do better?

Parameter Tuning

Grid Search: parameter mtry

mtry: Number of variables randomly sampled as candidates at each split of the tree

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(mtry=c(1:15))
rf_gridsearch <- train(Type~., data=Glass, method="rf", metric=performance_metric, tuneGrid=tunegrid, trControl=control,preProcess=c("center", "scale"))
print(rf_gridsearch)
## Random Forest 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 192, 192, 194, 192, 192, 193, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7876150  0.7008625
##    2    0.8092076  0.7345026
##    3    0.7933057  0.7130053
##    4    0.7966526  0.7204068
##    5    0.7775039  0.6938497
##    6    0.7793086  0.6968253
##    7    0.7652384  0.6777026
##    8    0.7589281  0.6691727
##    9    0.7606939  0.6722052
##   10    0.7608978  0.6720817
##   11    0.7556876  0.6650425
##   12    0.7543240  0.6629915
##   13    0.7637314  0.6757493
##   14    0.7606145  0.6716818
##   15    0.7558391  0.6654074
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
plot(rf_gridsearch)