Unsupervised Machine Learning: No label or ground truth data
Pros: “If we are interested in discovering what types of labels best explain the data rather than imposing a pre-determined set of labels on the data, then we must use unsupervised rather than supervised learning.”Libbrecht & Noble, 2015, pag 4
Cons:“No supervisor telling you if you are doing right or not”.Also some algorithms require us to provide some input parameter a priori.
Type of Unsupervised Machine Learning tasks (some):
#RUN ONLY ONCE
URL_FPC<-"https://www.researchgate.net/profile/Alfred_Ultsch/publication/281492504_Fundamental_Clustering_Problems_Suite_FCPS/data/55eafe1608ae21d099c5dd95/FCPS.zip"
download.file(URL_FPC, "./data.zip")
unzip ("data.zip")
We will used some of the data sets provided in the Fundamental Clustering Problems Suite (FCPS), as well as the Iris data set. The FCPS contain several data set designed to asses the performance of unsupervised machine learning algorithms on particular clustering challenges. All these data sets provide ground truth data yo help us evaluate the performance of the difference unsupervised ML algorithms.
setwd("./FCPS/01FCPSdata")
files<-list.files()
Hepta_data<-read.table("Hepta.lrn", skip=4, header = F)[,-1]
Hepta_class<-read.table("Hepta.cls", skip=1, header = F)[,-1]
Link_data<-read.table("Chainlink.lrn", skip=4, header = F)[,-1]
Link_class<-read.table("Chainlink.cls", skip=3, header = F)[,-1]
#As well as the iris dataset
data(iris)
Iris_data<-iris[, -5]
Iris_class<-iris[, 5]
#RGL and KNITR PACKAGE for Plotting
if (!require("rgl")) install.packages("rgl", dependencies=T)
library("rgl") #Load the Package
if (!require("knitr")) install.packages("knitr", dependencies=T)
library("knitr") #Load the Package
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =F)
plot3d(Hepta_data,col=Hepta_class, type = "s",radius=0.2, main="Hepta Dataset")
plot3d(Link_data,col=Link_class, type = "s",radius=0.05, main="Link Dataset")
rglwidget()
As mentioned before, data pre-processing is key in ML, especially in unsupervised algorithms. It is recommended to remove any NAs and scale the data. This will reduce any bias due to the different range of values of our features.
summary(Hepta_data)
## V2 V3 V4
## Min. :-3.970394 Min. :-3.881493 Min. :-3.90929
## 1st Qu.:-0.407496 1st Qu.:-0.476732 1st Qu.:-0.42314
## Median :-0.003762 Median :-0.004597 Median : 0.02132
## Mean : 0.015418 Mean : 0.034183 Mean :-0.03563
## 3rd Qu.: 0.438729 3rd Qu.: 0.500722 3rd Qu.: 0.38436
## Max. : 3.747710 Max. : 3.774495 Max. : 3.89939
Hepta_data<-scale(Hepta_data)
summary(Hepta_data)
## V2 V3 V4
## Min. :-2.41899 Min. :-2.34401 Min. :-2.37820
## 1st Qu.:-0.25667 1st Qu.:-0.30584 1st Qu.:-0.23791
## Median :-0.01164 Median :-0.02321 Median : 0.03496
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.25691 3rd Qu.: 0.27928 3rd Qu.: 0.25785
## Max. : 2.26513 Max. : 2.23903 Max. : 2.41587
Link_data<-scale(Link_data)
Iris_data<-scale(Iris_data)
library("rgl")
library("knitr")
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =F)
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Hepta Dataset")
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Link Dataset")
rglwidget()
Here we see that the Hepta and Link data set do not change significantly, since they already had features with similar range.
#stats PACKAGE
if (!require("stats")) install.packages("stats", dependencies=T)
library("stats") #Load the Package
This is one of the most used clustering algorithms. Why?: (i) can identify multiple cluster layers, and (ii) it can be easily visualized in a tree diagram (dendrogram).
The objective of any agglomerative hierarchical clustering algorithm is to cluster a set of n objects based on an n x n similarity matrix. The procedure for these clustering algorithms was described by Johnson, 1967 , as follows:
Hence, the first step is to calculate the similarity between the object of our data set.
There are several metric to evaluate the similarity between object. The selection will be driven by the type/scale of data (e.g., ratio, interval, ordinal, binary, categorical)see Clifford et al. (2011)
#euclidean
Hepta_d_euclidian <- dist(Hepta_data, method = "euclidean")
Link_d_euclidian<-dist(Link_data, method = "euclidean")
Iris_d_euclidian<-dist(Iris_data, method = "euclidean")
#manhattan
Hepta_d_manhattan<- dist(Hepta_data, method = "manhattan")
heatmap(as.matrix(Hepta_d_euclidian), Rowv = NA, Colv=NA)
heatmap(as.matrix(Hepta_d_manhattan), Rowv = NA, Colv=NA)
As you can see the heatmap is no exactly the same, but we can still seethe same “pattern” of relationship/similarities.
#Plot histogram of two distances
Hepta_d<-cbind(c(as.vector(Hepta_d_euclidian)),c(as.vector(Hepta_d_manhattan)))
colnames(Hepta_d)<-c("euclidean","manhattan")
library(reshape)
library(ggplot2)
Hepta_d2<-melt(Hepta_d)
ggplot(Hepta_d2, aes(value, fill=X2)) + geom_histogram(alpha = 0.5, position = 'identity')
Since the Manhattan will tend to have wider range of values than the Euclidian distance because the way it is calculated
For step 3.(“link clusters”“), there are different Linkage methods we can used. The performance of each one will depend on the underlying structure of the data set (i.e., noise, clear boundaries). see Clifford et al. (2011)
The Complete Linkage takes the maximum value between clusters to merge.
The Single Linkage takes the minimum value between clusters to merge.
# Hierarchical clustering using complete Linkage method
Hepta_complete <- hclust(Hepta_d_euclidian, method = "complete" )
# Hierarchical clustering using single Linkage method
Hepta_single <- hclust(Hepta_d_euclidian, method = "single" )
# Plot the obtained dendrogram
plot(Hepta_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Hepta_complete, k=length(unique(Hepta_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Hepta_complete, k=length(unique(Hepta_class)), border="red")
plot(Hepta_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Hepta_single, k=length(unique(Hepta_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Hepta_single, k=length(unique(Hepta_class)), border="red")
library(caret)
#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 32 0 0 0 0 0 0
## 2 0 30 0 0 0 0 0
## 3 0 0 30 0 0 0 0
## 4 0 0 0 30 0 0 0
## 5 0 0 0 0 30 0 0
## 6 0 0 0 0 0 30 0
## 7 0 0 0 0 0 0 30
head(both$overall,2)
## Accuracy Kappa
## 1 1
#Compare with ground truth data
true<-confusionMatrix(groups_complete, Hepta_class)
true$table
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 32 0 0 0 0 0 0
## 2 0 30 0 0 0 0 0
## 3 0 0 30 0 0 0 0
## 4 0 0 0 30 0 0 0
## 5 0 0 0 0 30 0 0
## 6 0 0 0 0 0 30 0
## 7 0 0 0 0 0 0 30
head(true$overall,2)
## Accuracy Kappa
## 1 1
Hepta_final<-groups_complete
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 3, sharedMouse =T)
plot3d((Hepta_data),col=groups_complete, type = "s",radius=0.1, main="Complete")
plot3d((Hepta_data),col=groups_single, type = "s",radius=0.1, main="Single")
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
In this case both linkage method performed very good. This data set had very well define cluster structures. What about if we dont have shows structure?
# Hierarchical clustering using complete Linkage method
Link_complete<-hclust(Link_d_euclidian, method = "complete" )
# Hierarchical clustering using single Linkage method
Link_single<-hclust(Link_d_euclidian, method = "single" )
# Plot the obtained dendrogram
plot(Link_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Link_complete, k=length(unique(Link_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Link_complete, k=length(unique(Link_class)), border="red")
plot(Link_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Link_single, k=length(unique(Link_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Link_single, k=length(unique(Link_class)), border="red")
#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table
## Reference
## Prediction 1 2
## 1 214 214
## 2 286 286
head(both$overall,2)
## Accuracy Kappa
## 0.5 0.0
#Compare with ground truth data
true<-confusionMatrix(groups_single, Link_class)
true$table
## Reference
## Prediction 1 2
## 1 500 0
## 2 0 500
head(true$overall,2)
## Accuracy Kappa
## 1 1
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 3, sharedMouse =T)
plot3d((Link_data),col=groups_complete, type = "s",radius=0.1, main="Complete")
plot3d((Link_data),col=groups_single, type = "s",radius=0.1, main="Single")
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
In this case only the single linkage method performed well. This illustrate the difference between the methods.
# Hierarchical clustering using complete Linkage method
Iris_complete<-hclust(Iris_d_euclidian, method = "complete" )
# Hierarchical clustering using single Linkage method
Iris_single<-hclust(Iris_d_euclidian, method = "single" )
# Plot the obtained dendrogram
plot(Iris_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Iris_complete, k=length(unique(Iris_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Iris_complete, k=length(unique(Iris_class)), border="red")
plot(Iris_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Iris_single, k=length(unique(Iris_class)))
# Draw dendogram with red borders around the clusters
rect.hclust(Iris_single, k=length(unique(Iris_class)), border="red")
There a clear difference between the dendrograms.There are a lot of different packages to help visualized nice dendrograms, for example
if (!require("dendextend")) install.packages("dendextend", dependencies=T)
library("dendextend") #Load the Package
# Create two dendrograms
dend1 <- as.dendrogram (Iris_complete)
dend2 <- as.dendrogram (Iris_single)
tanglegram(dend1, dend2,
highlight_distinct_edges = F, # Turn-off dashed lines
common_subtrees_color_lines = T, # Turn-off line colors
common_subtrees_color_branches = T, # Color common branches
main = "Complete Single")
The colors indicate similar object and how they are cluster between the different methods.
#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table
## Reference
## Prediction 1 2 3
## 1 49 0 0
## 2 0 1 23
## 3 0 0 77
head(both$overall,2)
## Accuracy Kappa
## 0.8466667 0.7212121
#Compare with ground truth data
true<-confusionMatrix(groups_single, as.numeric(Iris_class))
true$table
## Reference
## Prediction 1 2 3
## 1 49 0 0
## 2 1 0 0
## 3 0 50 50
head(true$overall,2)
## Accuracy Kappa
## 0.66 0.49
Based on the accuracy metric we did not performed very good. Let do some plots.
This package provides tools to visualize the results of clustering algorithms and multivariate analysis. Be aware that PCA with ordinal or categorical data is not meaningful.
if (!require("factoextra")) install.packages("factoextra", dependencies=T)
library("factoextra") #Load the Package
#With Single
complete_HCcluster_iris <- hcut(Iris_data,3,hc_method ="single" , hc_metric="euclidian")
fviz_cluster(complete_HCcluster_iris, data = Iris_data)
#With Complete
complete_HCcluster_iris <- hcut(Iris_data,3,hc_method ="complete" , hc_metric="euclidian")
fviz_cluster(complete_HCcluster_iris, data = Iris_data)
From the plots and results of the Complete Linkage method, it looks like we did not performed that bad. But why we had accuracy of 66%?
This package provides tools to evaluate clustering algorithms. Using clustering external validation indexes is more appropriate (e.g., what is the cluster names/number are not the same between methods or ground truth data)
if (!require("clusteval")) install.packages("clusteval", dependencies=T)
library("clusteval") #Load the Package
cluster_similarity(Iris_class, groups_complete, similarity="rand")
## [1] 0.8021477
cluster_similarity(Iris_class, groups_single, similarity="rand")
## [1] 0.7719016
The Factoextra Package can help.
#factoextra PACKAGE
if (!require("factoextra")) install.packages("factoextra", dependencies=T)
library("factoextra") #Load the Package
#Need to define function
single_HC<-function(x,k){hcut(x,k, hc_method ="single" , hc_metric="euclidian")}
#Hepta
fviz_nbclust(as.matrix(Hepta_data), single_HC, method = "silhouette")
#Link
fviz_nbclust(as.matrix(Link_data), single_HC, method = "silhouette")
For the Hepta and Link data sets, the average Silhouette index did provide he right number of cluster. But this might not be the case for all methods and data sets.
#CAREFUL interpreting results
#Iris
fviz_nbclust(as.matrix(Iris_data), single_HC, method = "silhouette")
complete_HC<-function(x,k){hcut(x,k, hc_method ="complete" , hc_metric="euclidian")}
fviz_nbclust(as.matrix(Iris_data), complete_HC, method = "silhouette")
We use the PVCLUST PACKAGE to perform bootstrapping (look at notes)
#PVCLUST PACKAGE
if (!require("pvclust")) install.packages("pvclust", dependencies=T)
library("pvclust") #Load the Package
#We can use multiple cores and change the no. of bootstraps. The more bootstraps the less prob. of clustering error, recommended for final evaluation: [10,000-100,000], for testing 1,000.
#pvclust reads col as the objects.
fit <- pvclust(as.data.frame(t(iris)), method.hclust="complete",
method.dist="euclidian",nboot = 1000, parallel=T)
## Creating a temporary cluster...done:
## socket cluster with 3 nodes on host 'localhost'
## Multiscale bootstrap... Done.
plot(fit) # dendogram with p values
pvrect(fit, alpha=.95)
#PVCLUST PACKAGE
if (!require("cluster")) install.packages("cluster", dependencies=T)
library("cluster") #Load the Package
Hepta_diana <- diana(Hepta_data)
Hepta_diana_cluster<-cutree(as.hclust(Hepta_diana), k = length(unique(Hepta_class)))
plot(as.dendrogram(Hepta_diana))
rect.hclust(Hepta_diana, k=length(unique(Hepta_class)), border="red")
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =T)
plot3d((Hepta_data),col=Hepta_diana_cluster, type = "s",radius=0.1, main="DIANA")
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
Here is a clear example why we need to used clustering validation indexes (e.g., no accuracy)
both<-confusionMatrix(Hepta_diana_cluster, Hepta_class)
both$table
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 32 0 0 29 0 0 0
## 2 0 30 0 0 0 0 0
## 3 0 0 30 0 0 0 0
## 4 0 0 0 1 0 30 0
## 5 0 0 0 0 29 0 0
## 6 0 0 0 0 1 0 0
## 7 0 0 0 0 0 0 30
head(both$overall,2)
## Accuracy Kappa
## 0.7169811 0.6692840
cluster_similarity(Hepta_diana_cluster, Hepta_class, similarity="rand")
## [1] 0.9545739
### NEED TO SET SEED FOR RANDOM STARTS
set.seed(1234)
Hepta_Kmeans <- kmeans(Hepta_data, centers=length(unique(Hepta_class)), nstart=100)
cluster_similarity(Hepta_Kmeans$cluster, Hepta_class, similarity="rand")
## [1] 1
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =T)
plot3d((Hepta_data),col=Hepta_Kmeans$cluster, type = "s",radius=0.1, main="K-means")
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
### NEED TO SET SEED FOR RANDOM STARTS
set.seed(1234)
Link_Kmeans <- kmeans(Link_data, centers=length(unique(Link_class)), nstart=100)
cluster_similarity(Link_Kmeans$cluster, Link_class, similarity="rand")
## [1] 0.4996276
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =T)
plot3d((Link_data),col=Link_Kmeans$cluster, type = "s",radius=0.1, main="K-means")
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
No need to provide number of clusters
if (!require("mclust")) install.packages("mclust", dependencies=T)
library("mclust") #Load the Package
Hepta_EM<-mclustBIC(Hepta_data)
plot(Hepta_EM)
model <-Mclust(Hepta_data, x = Hepta_EM)
cluster_similarity(model$classification, Hepta_class, similarity="rand")
## [1] 1
Link_EM<-mclustBIC(Link_data)
plot(Link_EM)
model <-Mclust(Link_data, x = Link_EM)
cluster_similarity(model$classification, Link_class, similarity="rand")
## [1] 0.6146406
knit_hooks$set(webgl = hook_webgl)
mfrow3d(nr = 1, nc = 2, sharedMouse =T)
plot3d((Link_data),col=model$classification, type = "s",radius=0.1, main="EM")
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
No need to provide cluster numbers, but other parameters are required
if (!require("fpc")) install.packages("fpc", dependencies=T)
library("fpc") #Load the Package
set.seed(123456)
Hepta_dbs_15_5 <- dbscan(Hepta_data, eps = 0.15, MinPts = 5)
Hepta_dbs_5_10 <- dbscan(Hepta_data, eps = 0.5, MinPts = 5)
mfrow3d(nr = 1, nc = 3, sharedMouse =T)
plot3d((Hepta_data),col=(Hepta_dbs_15_5$cluster)+1, type = "s",radius=0.1, main="DNSCAN eps 0.15")
plot3d((Hepta_data),col=(Hepta_dbs_5_10$cluster)+1, type = "s",radius=0.1, main="DNSCAN eps 0.5")
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")
rglwidget()
You can play more with this values in this Website
More Tutorial about:
*Mclust