One of many definitions: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P,if its performance at tasks T, as measured by P, improves with experience E.” Mitchell, T. 1997
Example Tasks T: Classification, Regression, Clustering…
Example Performance P: Accuracy, F-1 score, RMSE…
Example Experience E: Unsupervised (no label), Supervised (label data)…
#Install the Caret Package and its depencencis if it is not already installed
if (!require("caret")) install.packages("caret", dependencies=T)
#Load the Package
library("caret")
This package contains A collection of artificial and real-world machine learning benchmark problems
#Install the mlbench Package and its depencencis if it is not already installed
if (!require("mlbench")) install.packages("mlbench", dependencies=T)
#Load the Package
library("mlbench")
# Glass Identification Database
data(Glass)
dim(Glass)
## [1] 214 10
levels(Glass$Type)
## [1] "1" "2" "3" "5" "6" "7"
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
# Iris Database
data(iris)
dim(iris)
## [1] 150 5
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
#Plot Glass dataset
featurePlot(x = Glass[, 1:9],
y = Glass$Type,
plot = "ellipse",
## Add a key at the top
auto.key = list(columns = 7))
#Plot Iris dataset
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "ellipse",
## Add a key at the top
auto.key = list(columns = 3))
The Caret Package provides a lot of functions for visualization
As you can see in some cases it possible for use to visually identify rules that will allows us to discrete between classes (e.g., If Petal.Length < 3 it is a setosa). But for more complex data we need better algorithms.
Why?
#Add NA to the dataset
GlassNA<-Glass
random_row<-sample(c(1:nrow(Glass)), 5)
random_col<-sample(c(1:ncol(Glass)), 5)
GlassNA[random_row,random_col ]<-NA
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
summary(GlassNA)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.93 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.31 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.42 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.83 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## NA's :5 NA's :5
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.0000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.0000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.0000
## Mean :72.65 Mean :0.4971 Mean : 8.956 Mean :0.1792
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.180 3rd Qu.:0.0000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.1500
## NA's :5 NA's :5
## Fe Type
## Min. :0.00000 1 :66
## 1st Qu.:0.00000 2 :76
## Median :0.00000 3 :17
## Mean :0.05701 5 :12
## 3rd Qu.:0.10000 6 : 9
## Max. :0.51000 7 :29
## NA's: 5
GlasspreNA<- preProcess(GlassNA, method = c("knnImpute","center","scale"))
Glasspre_proNA<-predict(GlasspreNA, Glass)
Glasspre<- preProcess(Glass, method = c("center","scale"))
Glasspre_pro<-predict(Glasspre, Glass)
summary(Glasspre_proNA)
## RI Na Mg
## Min. :-2.359632 Min. :-3.30826 Min. :-1.8611
## 1st Qu.:-0.602831 1st Qu.:-0.63105 1st Qu.:-0.3948
## Median :-0.224330 Median :-0.14848 Median : 0.5515
## Mean :-0.000199 Mean :-0.01588 Mean : 0.0000
## 3rd Qu.: 0.258810 3rd Qu.: 0.49700 3rd Qu.: 0.6347
## Max. : 5.089398 Max. : 4.86782 Max. : 1.2517
## Al Si K
## Min. :-2.3132 Min. :-3.6679 Min. :-0.76213
## 1st Qu.:-0.5106 1st Qu.:-0.4789 1st Qu.:-0.57430
## Median :-0.1701 Median : 0.1795 Median : 0.08884
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3707 3rd Qu.: 0.5636 3rd Qu.: 0.17318
## Max. : 4.1162 Max. : 3.5622 Max. : 8.75961
## Ca Ba Fe Type
## Min. :-2.470830 Min. :-0.356749 Min. :-0.5851 1:70
## 1st Qu.:-0.501803 1st Qu.:-0.356749 1st Qu.:-0.5851 2:76
## Median :-0.249544 Median :-0.356749 Median :-0.5851 3:17
## Mean : 0.000587 Mean :-0.008335 Mean : 0.0000 5:13
## 3rd Qu.: 0.151619 3rd Qu.:-0.356749 3rd Qu.: 0.4412 6: 9
## Max. : 5.068929 Max. : 5.913031 Max. : 4.6490 7:29
summary(Glasspre_pro)
## RI Na Mg Al
## Min. :-2.3759 Min. :-3.2793 Min. :-1.8611 Min. :-2.3132
## 1st Qu.:-0.6068 1st Qu.:-0.6127 1st Qu.:-0.3948 1st Qu.:-0.5106
## Median :-0.2257 Median :-0.1321 Median : 0.5515 Median :-0.1701
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2608 3rd Qu.: 0.5108 3rd Qu.: 0.6347 3rd Qu.: 0.3707
## Max. : 5.1252 Max. : 4.8642 Max. : 1.2517 Max. : 4.1162
## Si K Ca Ba
## Min. :-3.6679 Min. :-0.76213 Min. :-2.4783 Min. :-0.3521
## 1st Qu.:-0.4789 1st Qu.:-0.57430 1st Qu.:-0.5038 1st Qu.:-0.3521
## Median : 0.1795 Median : 0.08884 Median :-0.2508 Median :-0.3521
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5636 3rd Qu.: 0.17318 3rd Qu.: 0.1515 3rd Qu.:-0.3521
## Max. : 3.5622 Max. : 8.75961 Max. : 5.0824 Max. : 5.9832
## Fe Type
## Min. :-0.5851 1:70
## 1st Qu.:-0.5851 2:76
## Median :-0.5851 3:17
## Mean : 0.0000 5:13
## 3rd Qu.: 0.4412 6: 9
## Max. : 4.6490 7:29
“Overfitting vs Generalizability”
#Split data set
inTrain<-createDataPartition(y=iris$Species, p=0.75, list=FALSE)
#Create Training and Testing Sets
training.Iris<-iris[inTrain,]
testing.Iris<-iris[-inTrain,]
dim(training.Iris)
## [1] 114 5
dim(testing.Iris)
## [1] 36 5
How does the Training Iris data looks (you always look at the training, never EDA the testing dataset, to avoid bias)
boxplot(training.Iris[, -5], main="Raw Data")
Need to Center and Scale the Training set
#Center and Scale all varialbes (not class) == Standardizing
preObj<-preProcess(training.Iris[,-5], method = c("center", "scale"))
preObjData<-predict(preObj,training.Iris[,-5])
boxplot(preObjData, main="Normalized data" )
set.seed(123456)
#Linear Discriminant Analysis (LDA)
#preProcess option allows to center and scale the data
modelFit<-train(Species~., data=training.Iris, preProcess=c("center", "scale"), method="lda")
modelFit$finalModel
## Call:
## lda(x, grouping = y)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa -0.9696408 0.8276933 -1.2898741 -1.2379443
## versicolor 0.0351469 -0.7205801 0.2615276 0.1403154
## virginica 0.9344940 -0.1071133 1.0283465 1.0976288
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.7688442 0.3794765
## Sepal.Width 0.6544428 0.8651571
## Petal.Length -4.0178459 -2.0063989
## Petal.Width -1.9855373 2.1727546
##
## Proportion of trace:
## LD1 LD2
## 0.9881 0.0119
Now look at the testing data
#Predict new data with model fitted
predictions<-predict(modelFit, newdata=testing.Iris)
#Shows Confusion Matrix and performance metrics
confusionMatrix(predictions, testing.Iris$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 12 0
## virginica 0 0 12
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9026, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
We did very well on the testing set, but was the testing set representative of possible “new dataset”"
#K-folds
folds<-createFolds(y=training.Iris$Species, k=10, list=T)
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 12 11 12 11 12 12 12 11 11 10
folds[[1]]
## [1] 6 12 30 38 51 63 69 72 79 83 85 87
#Bootstrapping
Resamples<-createResample(y=training.Iris$Species, time=10, list=T)
sapply(Resamples, length)
## Resample01 Resample02 Resample03 Resample04 Resample05 Resample06
## 114 114 114 114 114 114
## Resample07 Resample08 Resample09 Resample10
## 114 114 114 114
Resamples[[1]][1:20]
## [1] 2 4 4 4 6 6 7 7 8 8 9 9 10 11 11 11 13 16 17 17
There is always a trade-off between bias an variance of our model. The best CV approach will depends on the data size and type.See Kohavi,1995
kfoldcv <- trainControl(method="cv", number=10)
performance_metric <- "Accuracy"
set.seed(1234)
#Linear Discriminant Analysis (LDA)
lda.iris <- train(Species~., data=iris, method="lda", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
#Classification and Regression Trees (CART)
cart.iris <- train(Species~., data=iris, method="rpart", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
#Support Vector Machines (SVM)
svm.iris <- train(Species~., data=iris, method="svmRadial", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
# Random Forest
rf.iris <- train(Species~., data=iris, method="rf", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
Accuracy Summary of Iris Dataset
results.iris <- resamples(list(lda=lda.iris, cart=cart.iris, svm=svm.iris, rf=rf.iris))
summary(results.iris)
##
## Call:
## summary.resamples(object = results.iris)
##
## Models: lda, cart, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.9333333 0.9500000 1.0000000 0.9800000 1.0000000 1 0
## cart 0.8000000 0.8833333 0.9333333 0.9200000 0.9333333 1 0
## svm 0.8000000 0.9333333 0.9666667 0.9466667 1.0000000 1 0
## rf 0.8666667 0.9333333 0.9333333 0.9400000 0.9833333 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.9 0.925 1.00 0.97 1.000 1 0
## cart 0.7 0.825 0.90 0.88 0.900 1 0
## svm 0.7 0.900 0.95 0.92 1.000 1 0
## rf 0.8 0.900 0.90 0.91 0.975 1 0
dotplot(results.iris)
set.seed(1234)
#Linear Discriminant Analysis (LDA)
lda.glass <- train(Type~., data=Glass, method="lda", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
#Classification and Regression Trees (CART)
cart.glass <- train(Type~., data=Glass, method="rpart", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
#Support Vector Machines (SVM)
svm.glass <- train(Type~., data=Glass, method="svmRadial", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
# Random Forest
rf.glass <- train(Type~., data=Glass, method="rf", metric=performance_metric, trControl=kfoldcv,preProcess=c("center", "scale"))
Accuracy Summary of Glass Dataset
results.glass <- resamples(list(lda=lda.glass, cart=cart.glass, svm=svm.glass, rf=rf.glass))
summary(results.glass)
##
## Call:
## summary.resamples(object = results.glass)
##
## Models: lda, cart, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.5238095 0.5931818 0.6376812 0.6399172 0.7061688 0.7619048 0
## cart 0.5909091 0.6233766 0.6681922 0.6647245 0.7107143 0.7272727 0
## svm 0.4545455 0.5568182 0.6363636 0.6355206 0.7424812 0.7727273 0
## rf 0.6363636 0.7648810 0.8380435 0.8040806 0.8571429 0.8636364 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.3312102 0.3930499 0.5004657 0.4946380 0.5949753 0.6546053 0
## cart 0.4142012 0.4598739 0.5304960 0.5238349 0.5755901 0.6312849 0
## svm 0.2095808 0.3249807 0.4657971 0.4586234 0.6083838 0.6839080 0
## rf 0.4927954 0.6833067 0.7783288 0.7302891 0.7975868 0.8125000 0
dotplot(results.glass)
Grid Search: parameter mtry
mtry: Number of variables randomly sampled as candidates at each split of the tree
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(mtry=c(1:15))
rf_gridsearch <- train(Type~., data=Glass, method="rf", metric=performance_metric, tuneGrid=tunegrid, trControl=control,preProcess=c("center", "scale"))
print(rf_gridsearch)
## Random Forest
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 192, 192, 194, 192, 192, 193, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7876150 0.7008625
## 2 0.8092076 0.7345026
## 3 0.7933057 0.7130053
## 4 0.7966526 0.7204068
## 5 0.7775039 0.6938497
## 6 0.7793086 0.6968253
## 7 0.7652384 0.6777026
## 8 0.7589281 0.6691727
## 9 0.7606939 0.6722052
## 10 0.7608978 0.6720817
## 11 0.7556876 0.6650425
## 12 0.7543240 0.6629915
## 13 0.7637314 0.6757493
## 14 0.7606145 0.6716818
## 15 0.7558391 0.6654074
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(rf_gridsearch)