Unsupervised Machine Learning

Unsupervised Machine Learning: No label or ground truth data

Pros: “If we are interested in discovering what types of labels best explain the data rather than imposing a pre-determined set of labels on the data, then we must use unsupervised rather than supervised learning.”Libbrecht & Noble, 2015, pag 4

Cons:“No supervisor telling you if you are doing right or not”.Also some algorithms require us to provide some input parameter a priori.

Type of Unsupervised Machine Learning tasks (some):

Clustering:

Outlier/Anomaly detection
Latent/Reduction variables …

#RUN ONLY ONCE
URL_FPC<-"https://www.researchgate.net/profile/Alfred_Ultsch/publication/281492504_Fundamental_Clustering_Problems_Suite_FCPS/data/55eafe1608ae21d099c5dd95/FCPS.zip"

download.file(URL_FPC, "./data.zip")
unzip ("data.zip")

We will used some of the data sets provided in the Fundamental Clustering Problems Suite (FCPS), as well as the Iris data set. The FCPS contain several data set designed to asses the performance of unsupervised machine learning algorithms on particular clustering challenges. All these data sets provide ground truth data yo help us evaluate the performance of the difference unsupervised ML algorithms.

setwd("./FCPS/01FCPSdata")
files<-list.files()

Hepta_data<-read.table("Hepta.lrn", skip=4, header = F)[,-1]
Hepta_class<-read.table("Hepta.cls", skip=1, header = F)[,-1]

Link_data<-read.table("Chainlink.lrn", skip=4, header = F)[,-1]
Link_class<-read.table("Chainlink.cls", skip=3, header = F)[,-1]

#As well as the iris dataset
data(iris)
Iris_data<-iris[, -5]
Iris_class<-iris[, 5]

#RGL and KNITR PACKAGE for Plotting
if (!require("rgl")) install.packages("rgl", dependencies=T)
library("rgl")  #Load the Package

if (!require("knitr")) install.packages("knitr", dependencies=T)
library("knitr")  #Load the Package


knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =F) 
plot3d(Hepta_data,col=Hepta_class, type = "s",radius=0.2, main="Hepta Dataset")  
plot3d(Link_data,col=Link_class, type = "s",radius=0.05, main="Link Dataset")  
rglwidget()

Data Pre-process (scale and center)

As mentioned before, data pre-processing is key in ML, especially in unsupervised algorithms. It is recommended to remove any NAs and scale the data. This will reduce any bias due to the different range of values of our features.

summary(Hepta_data)

##        V2                  V3                  V4          
##  Min.   :-3.970394   Min.   :-3.881493   Min.   :-3.90929  
##  1st Qu.:-0.407496   1st Qu.:-0.476732   1st Qu.:-0.42314  
##  Median :-0.003762   Median :-0.004597   Median : 0.02132  
##  Mean   : 0.015418   Mean   : 0.034183   Mean   :-0.03563  
##  3rd Qu.: 0.438729   3rd Qu.: 0.500722   3rd Qu.: 0.38436  
##  Max.   : 3.747710   Max.   : 3.774495   Max.   : 3.89939

Hepta_data<-scale(Hepta_data)
summary(Hepta_data)

##        V2                 V3                 V4          
##  Min.   :-2.41899   Min.   :-2.34401   Min.   :-2.37820  
##  1st Qu.:-0.25667   1st Qu.:-0.30584   1st Qu.:-0.23791  
##  Median :-0.01164   Median :-0.02321   Median : 0.03496  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.25691   3rd Qu.: 0.27928   3rd Qu.: 0.25785  
##  Max.   : 2.26513   Max.   : 2.23903   Max.   : 2.41587

Link_data<-scale(Link_data)
Iris_data<-scale(Iris_data)


library("rgl")
library("knitr")  

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =F)  
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Hepta Dataset")  
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Link Dataset")  
rglwidget()

Here we see that the Hepta and Link data set do not change significantly, since they already had features with similar range.

The hclust from the stats package

#stats PACKAGE
if (!require("stats")) install.packages("stats", dependencies=T)
library("stats") #Load the Package

Hierarchical clustering:Agglomerative

This is one of the most used clustering algorithms. Why?: (i) can identify multiple cluster layers, and (ii) it can be easily visualized in a tree diagram (dendrogram).

The objective of any agglomerative hierarchical clustering algorithm is to cluster a set of n objects based on an n x n similarity matrix. The procedure for these clustering algorithms was described by Johnson, 1967 , as follows:

Assign each of the n objects to a unique cluster, each containing just one object; hence a total of n clusters are generated.
Set the similarity values between the clusters to be the same as the similarity between the objects they contain.
Find the most similar pair of clusters and merge them into a new single cluster; hence now there is one cluster less.
Calculate the similarity between the new cluster and each of the old clusters
Repeated steps 2 and 3 until all items are clustered into a single cluster that contains all n objects

Hence, the first step is to calculate the similarity between the object of our data set.

Dissimilarity/Similarity matrix

There are several metric to evaluate the similarity between object. The selection will be driven by the type/scale of data (e.g., ratio, interval, ordinal, binary, categorical)see Clifford et al. (2011)

#euclidean
Hepta_d_euclidian <- dist(Hepta_data, method = "euclidean")
Link_d_euclidian<-dist(Link_data, method = "euclidean")
Iris_d_euclidian<-dist(Iris_data, method = "euclidean")

#manhattan
Hepta_d_manhattan<- dist(Hepta_data, method = "manhattan")

heatmap(as.matrix(Hepta_d_euclidian), Rowv = NA, Colv=NA)

heatmap(as.matrix(Hepta_d_manhattan), Rowv = NA, Colv=NA)

As you can see the heatmap is no exactly the same, but we can still seethe same “pattern” of relationship/similarities.

#Plot histogram of two distances
Hepta_d<-cbind(c(as.vector(Hepta_d_euclidian)),c(as.vector(Hepta_d_manhattan)))
colnames(Hepta_d)<-c("euclidean","manhattan")

library(reshape)
library(ggplot2)
Hepta_d2<-melt(Hepta_d)

ggplot(Hepta_d2, aes(value, fill=X2)) + geom_histogram(alpha = 0.5, position = 'identity')

Since the Manhattan will tend to have wider range of values than the Euclidian distance because the way it is calculated

Hierarchical clustering using Complete vs Single Linkage method

For step 3.(“link clusters”“), there are different Linkage methods we can used. The performance of each one will depend on the underlying structure of the data set (i.e., noise, clear boundaries). see Clifford et al. (2011)

The Complete Linkage takes the maximum value between clusters to merge.

The Single Linkage takes the minimum value between clusters to merge.

Hepta Dataset

# Hierarchical clustering using complete Linkage method
Hepta_complete <- hclust(Hepta_d_euclidian, method = "complete" )

# Hierarchical clustering using single Linkage method
Hepta_single <- hclust(Hepta_d_euclidian, method = "single" )


# Plot the obtained dendrogram
plot(Hepta_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Hepta_complete, k=length(unique(Hepta_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Hepta_complete, k=length(unique(Hepta_class)), border="red")

plot(Hepta_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Hepta_single, k=length(unique(Hepta_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Hepta_single, k=length(unique(Hepta_class)), border="red")

How well we did?

library(caret)
#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table

##           Reference
## Prediction  1  2  3  4  5  6  7
##          1 32  0  0  0  0  0  0
##          2  0 30  0  0  0  0  0
##          3  0  0 30  0  0  0  0
##          4  0  0  0 30  0  0  0
##          5  0  0  0  0 30  0  0
##          6  0  0  0  0  0 30  0
##          7  0  0  0  0  0  0 30

head(both$overall,2)

## Accuracy    Kappa 
##        1        1

#Compare with  ground truth data
true<-confusionMatrix(groups_complete, Hepta_class)
true$table

##           Reference
## Prediction  1  2  3  4  5  6  7
##          1 32  0  0  0  0  0  0
##          2  0 30  0  0  0  0  0
##          3  0  0 30  0  0  0  0
##          4  0  0  0 30  0  0  0
##          5  0  0  0  0 30  0  0
##          6  0  0  0  0  0 30  0
##          7  0  0  0  0  0  0 30

head(true$overall,2)

## Accuracy    Kappa 
##        1        1

Hepta_final<-groups_complete

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 3, sharedMouse =T)  
plot3d((Hepta_data),col=groups_complete, type = "s",radius=0.1, main="Complete")  
plot3d((Hepta_data),col=groups_single, type = "s",radius=0.1, main="Single")  
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

In this case both linkage method performed very good. This data set had very well define cluster structures. What about if we dont have shows structure?

Link Dataset

# Hierarchical clustering using complete Linkage method
Link_complete<-hclust(Link_d_euclidian, method = "complete" )

# Hierarchical clustering using single Linkage method
Link_single<-hclust(Link_d_euclidian, method = "single" )

# Plot the obtained dendrogram
plot(Link_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Link_complete, k=length(unique(Link_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Link_complete, k=length(unique(Link_class)), border="red")

plot(Link_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Link_single, k=length(unique(Link_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Link_single, k=length(unique(Link_class)), border="red")

How well we did?

#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table

##           Reference
## Prediction   1   2
##          1 214 214
##          2 286 286

head(both$overall,2)

## Accuracy    Kappa 
##      0.5      0.0

#Compare with  ground truth data
true<-confusionMatrix(groups_single, Link_class)
true$table

##           Reference
## Prediction   1   2
##          1 500   0
##          2   0 500

head(true$overall,2)

## Accuracy    Kappa 
##        1        1

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 3, sharedMouse =T)  
plot3d((Link_data),col=groups_complete, type = "s",radius=0.1, main="Complete")  
plot3d((Link_data),col=groups_single, type = "s",radius=0.1, main="Single")  
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

In this case only the single linkage method performed well. This illustrate the difference between the methods.

Iris dataset

# Hierarchical clustering using complete Linkage method
Iris_complete<-hclust(Iris_d_euclidian, method = "complete" )

# Hierarchical clustering using single Linkage method
Iris_single<-hclust(Iris_d_euclidian, method = "single" )

# Plot the obtained dendrogram
plot(Iris_complete, cex = 0.6, hang = -1)
#Cut dendrogram
groups_complete <- cutree(Iris_complete, k=length(unique(Iris_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Iris_complete, k=length(unique(Iris_class)), border="red")

plot(Iris_single, cex = 0.6, hang = -1)
#Cut dendrogram
groups_single <- cutree(Iris_single, k=length(unique(Iris_class))) 
# Draw dendogram with red borders around the  clusters 
rect.hclust(Iris_single, k=length(unique(Iris_class)), border="red")

There a clear difference between the dendrograms.There are a lot of different packages to help visualized nice dendrograms, for example

dendextend packages

if (!require("dendextend")) install.packages("dendextend", dependencies=T)
library("dendextend") #Load the Package

# Create two dendrograms
dend1 <- as.dendrogram (Iris_complete)
dend2 <- as.dendrogram (Iris_single)

tanglegram(dend1, dend2,
  highlight_distinct_edges = F, # Turn-off dashed lines
  common_subtrees_color_lines = T, # Turn-off line colors
  common_subtrees_color_branches = T, # Color common branches 
  main = "Complete       Single")

The colors indicate similar object and how they are cluster between the different methods.

How well we did?

#Compare Both models
both<-confusionMatrix(groups_complete, groups_single)
both$table

##           Reference
## Prediction  1  2  3
##          1 49  0  0
##          2  0  1 23
##          3  0  0 77

head(both$overall,2)

##  Accuracy     Kappa 
## 0.8466667 0.7212121

#Compare with  ground truth data
true<-confusionMatrix(groups_single, as.numeric(Iris_class))
true$table

##           Reference
## Prediction  1  2  3
##          1 49  0  0
##          2  1  0  0
##          3  0 50 50

head(true$overall,2)

## Accuracy    Kappa 
##     0.66     0.49

Based on the accuracy metric we did not performed very good. Let do some plots.

To Plot more than 4D we can use PCA and the Factoextra Package.

This package provides tools to visualize the results of clustering algorithms and multivariate analysis. Be aware that PCA with ordinal or categorical data is not meaningful.

if (!require("factoextra")) install.packages("factoextra", dependencies=T)
library("factoextra") #Load the Package

#With Single
complete_HCcluster_iris <- hcut(Iris_data,3,hc_method ="single" , hc_metric="euclidian")

fviz_cluster(complete_HCcluster_iris, data = Iris_data)

#With Complete
complete_HCcluster_iris <- hcut(Iris_data,3,hc_method ="complete" , hc_metric="euclidian")

fviz_cluster(complete_HCcluster_iris, data = Iris_data)

From the plots and results of the Complete Linkage method, it looks like we did not performed that bad. But why we had accuracy of 66%?

Is Accuracy the right metric?

Rand Index matrix from clusteval Package (you might need to used the Adjusted Rand Index).

This package provides tools to evaluate clustering algorithms. Using clustering external validation indexes is more appropriate (e.g., what is the cluster names/number are not the same between methods or ground truth data)

if (!require("clusteval")) install.packages("clusteval", dependencies=T)
library("clusteval") #Load the Package

cluster_similarity(Iris_class, groups_complete, similarity="rand")

## [1] 0.8021477

cluster_similarity(Iris_class, groups_single, similarity="rand")

## [1] 0.7719016

What is we dont know the how many clusters are in the dataset (i.e., k?)

The Factoextra Package can help.

#factoextra PACKAGE
if (!require("factoextra")) install.packages("factoextra", dependencies=T)
library("factoextra") #Load the Package

#Need to define function
single_HC<-function(x,k){hcut(x,k, hc_method ="single" , hc_metric="euclidian")}

#Hepta
fviz_nbclust(as.matrix(Hepta_data), single_HC, method = "silhouette")

#Link
fviz_nbclust(as.matrix(Link_data), single_HC, method = "silhouette")

For the Hepta and Link data sets, the average Silhouette index did provide he right number of cluster. But this might not be the case for all methods and data sets.

#CAREFUL interpreting results
#Iris
fviz_nbclust(as.matrix(Iris_data), single_HC, method = "silhouette")

complete_HC<-function(x,k){hcut(x,k, hc_method ="complete" , hc_metric="euclidian")}

fviz_nbclust(as.matrix(Iris_data), complete_HC, method = "silhouette")

Are these clusters statistically significant ?

We use the PVCLUST PACKAGE to perform bootstrapping (look at notes)

#PVCLUST PACKAGE
if (!require("pvclust")) install.packages("pvclust", dependencies=T)
library("pvclust")  #Load the Package


#We can use multiple cores and change the no. of bootstraps. The more bootstraps the less prob. of clustering error, recommended  for final evaluation: [10,000-100,000], for testing 1,000. 


#pvclust reads col as the objects.
fit <- pvclust(as.data.frame(t(iris)), method.hclust="complete",
   method.dist="euclidian",nboot = 1000, parallel=T)

## Creating a temporary cluster...done:
## socket cluster with 3 nodes on host 'localhost'
## Multiscale bootstrap... Done.

plot(fit) # dendogram with p values
pvrect(fit, alpha=.95)

Hierarchical clustering: Divisive

Fun. DIANA from cluster Package

#PVCLUST PACKAGE
if (!require("cluster")) install.packages("cluster", dependencies=T)
library("cluster")  #Load the Package


Hepta_diana <- diana(Hepta_data)

 
Hepta_diana_cluster<-cutree(as.hclust(Hepta_diana), k = length(unique(Hepta_class)))
 plot(as.dendrogram(Hepta_diana))
rect.hclust(Hepta_diana, k=length(unique(Hepta_class)), border="red")

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =T)  
plot3d((Hepta_data),col=Hepta_diana_cluster, type = "s",radius=0.1, main="DIANA")  
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

Here is a clear example why we need to used clustering validation indexes (e.g., no accuracy)

both<-confusionMatrix(Hepta_diana_cluster, Hepta_class)
both$table

##           Reference
## Prediction  1  2  3  4  5  6  7
##          1 32  0  0 29  0  0  0
##          2  0 30  0  0  0  0  0
##          3  0  0 30  0  0  0  0
##          4  0  0  0  1  0 30  0
##          5  0  0  0  0 29  0  0
##          6  0  0  0  0  1  0  0
##          7  0  0  0  0  0  0 30

head(both$overall,2)

##  Accuracy     Kappa 
## 0.7169811 0.6692840

cluster_similarity(Hepta_diana_cluster, Hepta_class, similarity="rand")

## [1] 0.9545739

Centroid based clustering:K-mean

Hepta Dataset

### NEED TO SET SEED FOR RANDOM STARTS
set.seed(1234)

Hepta_Kmeans <- kmeans(Hepta_data, centers=length(unique(Hepta_class)), nstart=100)


cluster_similarity(Hepta_Kmeans$cluster, Hepta_class, similarity="rand")

## [1] 1

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =T)  
plot3d((Hepta_data),col=Hepta_Kmeans$cluster, type = "s",radius=0.1, main="K-means")  
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

Link Dataset

### NEED TO SET SEED FOR RANDOM STARTS
set.seed(1234)

Link_Kmeans <- kmeans(Link_data, centers=length(unique(Link_class)), nstart=100)


cluster_similarity(Link_Kmeans$cluster, Link_class, similarity="rand")

## [1] 0.4996276

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =T)  
plot3d((Link_data),col=Link_Kmeans$cluster, type = "s",radius=0.1, main="K-means")  
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

Model based clustering:Expectation Maximization

Using the mclust package

No need to provide number of clusters

if (!require("mclust")) install.packages("mclust", dependencies=T)
library("mclust") #Load the Package

Hepta Dataset

Hepta_EM<-mclustBIC(Hepta_data)

plot(Hepta_EM)

model <-Mclust(Hepta_data, x = Hepta_EM)

cluster_similarity(model$classification, Hepta_class, similarity="rand")

## [1] 1

Link Dataset

Link_EM<-mclustBIC(Link_data)

plot(Link_EM)

model <-Mclust(Link_data, x = Link_EM)

cluster_similarity(model$classification, Link_class, similarity="rand")

## [1] 0.6146406

knit_hooks$set(webgl = hook_webgl)

mfrow3d(nr = 1, nc = 2, sharedMouse =T)  
plot3d((Link_data),col=model$classification, type = "s",radius=0.1, main="EM")  
plot3d((Link_data),col=Link_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

Density based clustering: DBSCAN

Using the fcp package

No need to provide cluster numbers, but other parameters are required

if (!require("fpc")) install.packages("fpc", dependencies=T)
library("fpc") #Load the Package

set.seed(123456)
Hepta_dbs_15_5 <- dbscan(Hepta_data, eps = 0.15, MinPts = 5)
Hepta_dbs_5_10 <- dbscan(Hepta_data, eps = 0.5, MinPts = 5)

mfrow3d(nr = 1, nc = 3, sharedMouse =T)  
plot3d((Hepta_data),col=(Hepta_dbs_15_5$cluster)+1, type = "s",radius=0.1, main="DNSCAN eps 0.15")  

plot3d((Hepta_data),col=(Hepta_dbs_5_10$cluster)+1, type = "s",radius=0.1, main="DNSCAN eps 0.5")  
plot3d((Hepta_data),col=Hepta_class, type = "s",radius=0.1, main="Ground Truth")  
rglwidget()

You can play more with this values in this Website

Unsupervised Machine Learning: The hclust, pvclust, cluster, mclust, and more

Christian Lopez

November 13, 2017

Unsupervised Machine Learning

Data Pre-process (scale and center)

The hclust from the stats package

Hierarchical clustering:Agglomerative

Dissimilarity/Similarity matrix

Hierarchical clustering using Complete vs Single Linkage method

Hepta Dataset

How well we did?

Link Dataset

How well we did?

Iris dataset

dendextend packages

How well we did?

To Plot more than 4D we can use PCA and the Factoextra Package.

Is Accuracy the right metric?

Rand Index matrix from clusteval Package (you might need to used the Adjusted Rand Index).

What is we dont know the how many clusters are in the dataset (i.e., k?)

Are these clusters statistically significant ?

Hierarchical clustering: Divisive

Fun. DIANA from cluster Package

Centroid based clustering:K-mean

Hepta Dataset

Link Dataset

Model based clustering:Expectation Maximization

Using the mclust package

Hepta Dataset

Link Dataset

Density based clustering: DBSCAN

Using the fcp package