Overview

Continuing our exploration of how Data Mining Methods can be applied in analyis of intensive longitudinal data obtained in experience sampling studies (daily diary, EMA, ambulatory assessment, etc.), we illustrate use of Unsupervised Learning methods for identifying structure in such data. In particular, we illustrate how cluster analysis methods can be used to identify groups of individuals whose time-series data are similar in some way (i.e., exhibit similar dynamic characteristics of some sort). The exploration is facilitated by the functions in the TSclust package for calculating dissimilarity/distance between time-series.

We make use of two data sets …
(1) The Cortisol Data, 9-occasion data that exhibit strong shapes, and
(2) The AMIB (Phase 2) Data, 21-occasion data that exhibit fluctuations. Importantly, in both data sets, the time variable is aligned at the same t = 0 for everyone.

Outline

This script covers …

A. Reading in The Cortisol Data
B. Clustering the time-series data and plotting C. Reading in The AMIB Data (Phase 2 daily) D. Clustering the time-series data and plotting

Preliminaries

Loading libraries used in this script.

#general packages
library(psych)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
#cluster packages
library(cluster) #clustering
library(TSclust) #time series clustering

A. Reading in The Cortisol Data

Reading in the time-series data

Loading the Cortisol Data - public data.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/TheCortisolData.csv"
#read in the .csv file using the url() function
cortisol_wide <- read.csv(file=url(filepath),header=TRUE)

Looking at the top few rows of the wide data.

head(cortisol_wide,6)
##   id cort_0 cort_1 cort_2 cort_3 cort_4 cort_5 cort_6 cort_7 cort_8
## 1  1    4.2    4.1    9.7   14.0   19.0   18.0   20.0   23.0   24.0
## 2  2    5.5    5.6   14.0   16.0   19.0   17.0   18.0   20.0   19.0
## 3  3    4.0    3.8    7.5   12.0   14.0   13.0    9.1    8.2    7.9
## 4  4    6.1    5.6   14.0   20.0   26.0   23.0   26.0   25.0   26.0
## 5  5    4.6    4.4    7.2   12.3   15.8   16.1   17.0   17.8   19.1
## 6  6    6.8    9.5   14.2   19.6   19.0   13.9   13.4   12.5   11.7

Note that the data are in wide format.

The TSclust procedures require that the data are structured either as (a) a matrix object with all the individual time-series in separate rows (“wide data”) OR (b) a data.frame object with all the individual time-series in separate columns (“non-standard long data”) - in both cases without an id or time index. For convenience, we also create standard long data that can be used for plotting.

Reshaping the time-series data

Generally, two main data schema are used to accommodate repeated measures data - “Wide Format” and “Long Format”. Different functions work with different kinds of data input. We already have the wide format data. We make a set of long format data.

Reshape from wide to long

#reshaping wide to long
cortisol_long <- reshape(data=cortisol_wide, 
                         timevar=c("time"), 
                         idvar="id",
                         varying=c("cort_0","cort_1","cort_2","cort_3",
                                   "cort_4","cort_5","cort_6","cort_7","cort_8"),
                         direction="long", sep="_")
#sorting for easy viewing
# order by id and time
cortisol_long <- cortisol_long[order(cortisol_long$id,cortisol_long$time), ]

Looking at the top few rows of the long data.

head(cortisol_long,18)
##     id time cort
## 1.0  1    0  4.2
## 1.1  1    1  4.1
## 1.2  1    2  9.7
## 1.3  1    3 14.0
## 1.4  1    4 19.0
## 1.5  1    5 18.0
## 1.6  1    6 20.0
## 1.7  1    7 23.0
## 1.8  1    8 24.0
## 2.0  2    0  5.5
## 2.1  2    1  5.6
## 2.2  2    2 14.0
## 2.3  2    3 16.0
## 2.4  2    4 19.0
## 2.5  2    5 17.0
## 2.6  2    6 18.0
## 2.7  2    7 20.0
## 2.8  2    8 19.0

Plotting the time-series data

Examination of individual-level longitudinal plots provides for “intuition” about how individuals’ time-series differ.

#intraindividual change trajetories by id
ggplot(data = cortisol_long, aes(x = time, y = cort, group = id)) +
  geom_point(color="black") + 
  geom_line(color="black")  +
  xlab("Time") + 
  ylab("Cortisol") + ylim(0,30) +
  scale_x_continuous(breaks=seq(0,8,by=1)) +
  facet_wrap(vars(id))

Humans’ inherent abiity for visual pattern recognition faciliates identification of a few different kinds of patterns.

B. Clustering the time-series data and plotting

The objective in time-series clustering is to do the same - identify subgroups of persons based on the structure of the repeated measures (univariate time-series). Task = grouping a set of objects in such a way that objects in the same group/cluster are more similar (in some sense or another) to each other than to those in other groups/clusters. To do so, must define what it means for two or more observations to be similar or different - defined mathematically as a dissimilarity metric (distance metric).

The TSclust package has implemented a variety of different distance metrics for time series. See the details in this paper … Montero, P., & Vilar, J. A. (2014). TSclust: An R package for time series clustering. Journal of Statistical Software, 62(1), 1-43.

5 Steps for Clustering

1. Prepare data for TSclust

(Note that there are no missing data)

#making data into a matrix (without id variable)
cortmatrix <- as.matrix(cortisol_wide[ ,-1])
cortmatrix
##       cort_0 cort_1 cort_2 cort_3 cort_4 cort_5 cort_6 cort_7 cort_8
##  [1,]    4.2    4.1    9.7   14.0   19.0   18.0   20.0   23.0   24.0
##  [2,]    5.5    5.6   14.0   16.0   19.0   17.0   18.0   20.0   19.0
##  [3,]    4.0    3.8    7.5   12.0   14.0   13.0    9.1    8.2    7.9
##  [4,]    6.1    5.6   14.0   20.0   26.0   23.0   26.0   25.0   26.0
##  [5,]    4.6    4.4    7.2   12.3   15.8   16.1   17.0   17.8   19.1
##  [6,]    6.8    9.5   14.2   19.6   19.0   13.9   13.4   12.5   11.7
##  [7,]    7.4    9.2   14.0   18.0   19.0   16.0   16.0   18.0   18.0
##  [8,]    9.2   10.0   16.0   21.0   24.0   21.0   19.0   21.0   25.0
##  [9,]    3.9    3.3    9.4   16.0   18.1   14.3   13.7   13.8   13.9
## [10,]    9.3    8.5   11.5   17.0   21.6   23.1   23.7   22.6   24.7
## [11,]    6.0    5.8   12.2   17.5   22.6   19.4   14.3   16.3   14.1
## [12,]    5.1    5.5   12.4   17.2   19.9   16.9   13.8   12.8   13.5
## [13,]    6.3    6.0   16.7   24.9   27.1   18.5   15.9   14.1   12.2
## [14,]    3.8    3.6   12.7   16.4   15.1   11.2   12.1   13.8   11.8
## [15,]    5.1    5.2    9.9   12.2   13.4   12.0   12.6   15.3   13.6
## [16,]    5.8    5.2   11.9   17.9   20.5   20.2   17.7   20.1   17.4
## [17,]    5.5    4.6   10.1   19.5   23.0   21.7   19.7   19.6   19.8
## [18,]   11.8   11.3   12.4   18.6   19.2   14.4   13.0   12.2   10.7
## [19,]    3.0    3.0    9.0   19.0   21.0   15.0   11.0   12.0   12.0
## [20,]    3.0    3.0   10.0   18.0   18.0   13.0   10.0   10.0   10.0
## [21,]    6.0    5.0    8.0   15.0   19.0   11.0   10.0    9.0   10.0
## [22,]    3.0    2.0    7.0   15.0   21.0   18.0   16.0   17.0   21.0
## [23,]    3.0    3.0    7.0   15.0   17.0   17.0   14.0   17.0   16.0
## [24,]    3.0    3.0   10.0   17.0   21.0   26.0   21.0   23.0   21.0
## [25,]    3.0    2.0    6.0    7.0   13.0    9.0    7.0    8.0    7.0
## [26,]    7.0    6.0    9.0   15.0   15.0   13.0   14.0   15.0   16.0
## [27,]    5.0    5.0   14.0   21.0   25.0   20.0   15.0   16.0   17.0
## [28,]    4.0    3.0   16.0   24.0   25.0   19.0   17.0   18.0   17.0
## [29,]    5.0    4.0   13.0   18.0   21.0   20.0   22.0   20.0   23.0
## [30,]    3.0    3.0   14.0   23.0   21.0   18.0   15.0   17.0   24.0
## [31,]    2.0    2.0   10.0   20.0   21.0   14.0   10.0    9.0    8.0
## [32,]   12.0   11.0    8.0   22.0   18.0   12.0   11.0   10.0   12.0
## [33,]    3.0    3.0    6.0   14.0   18.0   17.0   17.0   17.0   17.0
## [34,]    3.0    3.0    6.0   17.0   12.0    8.0    6.0    7.0    6.0

2. Calculate distances/dissimilarities using chosen metric

#calculating dissimilarity matrix using TSclust diss() function
cort_dist <- diss(SERIES=cortmatrix, METHOD="DTW") #DTW = Dynamic Time Warping

#adding informative column names
names(cort_dist) <- cortisol_wide$id

#examine distance/dissimilarity matrix (only first 5 time-series for space reasons)
as.matrix(cort_dist)[1:5,1:5]
##      1    2     3     4    5
## 1  0.0 22.3  89.4  24.9 28.2
## 2 22.3  0.0  73.6  41.6 13.8
## 3 89.4 73.6   0.0 127.8 58.1
## 4 24.9 41.6 127.8   0.0 51.4
## 5 28.2 13.8  58.1  51.4  0.0

3. Cluster based on the distances/dissimilarities using method of choice (e.g., hierarchical clustering)

#perform hierachical clustering on the diss object
cort_hclust <- hclust(cort_dist, method="complete",members=names(cort_dist))
#show the resulting dendrogram
plot(cort_hclust, main="Cortisol Clustering")

##same procedure using agnes in the cluster package
#perform aglomorative nesting (hierarchical clustering) on the diss object
# Compute linkages
cort_agnesclust <- agnes(cort_dist, diss=TRUE, method="complete")
#show the resulting dendrogram
plot(cort_agnesclust, which.plot = 2, main = "Cortisol Clustering")

4. Determine the number of clusters and obtain cluster assignments

Demonstrate two different cuts (for 2 and 4 clusters)

##primary solution
#imposing number of clusters
cort_cluster2 <- cutree(cort_hclust, k = 2)
#examine cluster assignments
cort_cluster2
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  1  1  2  1  1  2  1  1  2  1  2  2  2  2  2  1  1  2  2  2  2  1  1  1  2 
## 26 27 28 29 30 31 32 33 34 
##  1  1  1  1  1  2  2  1  2
#examine number of individuals in each cluster
table(cort_cluster2)
## cort_cluster2
##  1  2 
## 18 16
##alternative solution
#imposing number of clusters
cort_cluster4 <- cutree(cort_agnesclust, k = 4)
#examine cluster assignments
cort_cluster4
##  [1] 1 1 2 3 1 4 1 3 4 3 4 4 4 4 4 1 1 4 4 4 4 1 1 1 2 1 1 1 1 1 4 4 1 2
#examine number of individuals in each cluster
table(cort_cluster4)
## cort_cluster4
##  1  2  3  4 
## 15  3  3 13

There are a variety of ways to determine the ideal number of clusters and to justify those choices (e.g., using statistical metrics). We do not go into those details here. Please see our Cluster Analysis tutorial for more detailed information about those steps.

5. Merge cluster assignments into original data, reshape, and visualize/interpret

Merging and reshaping.

#binding cluster solutions with the wide data
cortisol_widecluster <- cbind(cortisol_wide,cort_cluster2,cort_cluster4)

#reshaping to long data for plotting
#reshaping wide to long
cortisol_longcluster <- reshape(data=cortisol_widecluster, 
                         timevar=c("time"), 
                         idvar=c("id","cort_cluster2","cort_cluster4"),
                         varying=c("cort_0","cort_1","cort_2","cort_3",
                                   "cort_4","cort_5","cort_6","cort_7","cort_8"),
                         direction="long", sep="_")
#ordering by id and time for easy viewing
cortisol_longcluster <- cortisol_longcluster[order(cortisol_longcluster$id,cortisol_longcluster$time), ]

Plotting cluster solutions by ID.

#2 cluster solution
#intraindividual change trajetories by ID
ggplot(data = cortisol_longcluster, aes(x = time, y = cort, group = id)) +
  #geom_point(aes(x = time, y = cort,color=factor(cort_cluster2))) + 
  geom_line(aes(x = time, y = cort,color=factor(cort_cluster2))) +
  xlab("Time") + 
  ylab("Cortisol") + ylim(0,30) +
  scale_x_continuous(breaks=seq(0,8,by=1)) +
  facet_wrap(vars(id)) +
  guides(color="none")

#4 cluster solution
#intraindividual change trajetories by ID
ggplot(data = cortisol_longcluster, aes(x = time, y = cort, group = id)) +
  #geom_point(aes(x = time, y = cort,color=factor(cort_cluster4))) + 
  geom_line(aes(x = time, y = cort,color=factor(cort_cluster4))) +
  xlab("Time") + 
  ylab("Cortisol") + ylim(0,30) +
  scale_x_continuous(breaks=seq(0,8,by=1)) +
  facet_wrap(vars(id)) +
  guides(color="none")

Plotting clusters by group and color.

#intraindividual change trajetories by cluster
ggplot(data = cortisol_longcluster, aes(x = time, y = cort, group = id)) +
  #geom_point(aes(x = time, y = cort,color=factor(cort_cluster4))) + 
  geom_line(aes(x = time, y = cort,color=factor(cort_cluster4))) +
  xlab("Time") + 
  ylab("Cortisol") + ylim(0,30) +
  scale_x_continuous(breaks=seq(0,8,by=1)) +
  facet_wrap(vars(cort_cluster2)) +
  guides(color="none")

That was pretty cool! And we learned some new things about potential subgroups in these data!

Importantly, choosing other dissimilarity/distance metrics will lead to different clusters! Given that the specifics captured by the chosen dissimilarity/distance metric is what facilitates interpretation of individual differences, it is extremely useful to inform the choice with theory and/or examine a variety of metrics.

B. Reading in The EMA Time-series Data

Reading in the time-series data

Loading the AMIB N = 30 21-day data.

#set filepath for data file
filename <- "/Users/nilam/Dropbox/APA-ATI ILDWorkshop/AMIBdata/AMIB_datashare_2019_0501/AMIBshare_phase2_daily_2019_0501.csv"
#read in the .csv file using the url() function
AMIB_daily <- read.csv(file=filename,header=TRUE)

Selecting a few variables of potential interest.

#list of select variables
selectvars <- c("id", "day", 
                "slphrs", "lteq", "evalday", "posaff", "negaff")
#making long data
daily_long <- AMIB_daily[ ,selectvars]

# order by id and time for easy viewing
daily_long <- daily_long[order(daily_long$id,daily_long$day), ]

Looking at the top few rows of the long data.

head(daily_long,21)
##     id day slphrs lteq evalday posaff negaff
## 1  203   0    7.0    3       1    5.8    1.2
## 2  203   1    7.0   NA       0    5.5    1.1
## 3  203   2    9.0    3       1    4.8    1.1
## 4  203   3    9.0    0       1    4.7    1.1
## 5  203   4    8.0    9       0    3.2    1.6
## 6  203   5    6.5   12       0    2.4    3.6
## 7  203   6    8.0   12       1    4.2    1.3
## 8  203   7    6.5   18       0    4.4    1.0
## 9  203   8    8.0   27       1    4.7    1.0
## 10 203   9    9.0    9       1    4.5    1.0
## 11 203  10    9.0   12       1    4.6    1.0
## 12 203  11    7.5    9       1    4.5    1.3
## 13 203  12    7.0    9       1    5.1    1.2
## 14 203  13    4.0    9       0    4.0    1.2
## 15 203  14    7.0    9       1    5.6    1.0
## 16 203  15    7.0    6       1    4.9    1.0
## 17 203  16    9.0   15       1    4.4    1.0
## 18 203  17   10.0   15       1    4.8    1.3
## 19 203  18    7.5    6       0    3.9    1.1
## 20 203  19    9.0    6       0    2.7    1.3
## 21 203  20   10.0    6       1    3.9    1.3

Plotting the time-series data

Examination of individual-level longitudinal plots provides for “intuition” about how individuals’ time-series differ.

#intraindividual change trajetories by ID
#Negative Affect
ggplot(data = daily_long, aes(x = day, group = id)) +
  geom_line(aes(y = negaff), color="black") +
  xlab("Day") + 
  ylab("Negative Affect") + #ylim(0,30) +
  scale_x_continuous(breaks=seq(0,21,by=7)) +
  facet_wrap(vars(id)) +
  guides(color="none")
## Warning: Removed 1 rows containing missing values (geom_path).

Reshaping to wide data

#reshaping long to wide
daily_wide_negaff <- reshape(data=daily_long, 
                    timevar=c("day"), 
                    idvar=c("id"),
                    v.names=c("negaff"),
                    direction="wide", sep="_",
                    drop=c("slphrs","lteq","evalday","posaff"))

Removing cases with missing data.

#checking for missing
nomiss <- complete.cases(daily_wide_negaff) 
table(nomiss)
## nomiss
## FALSE  TRUE 
##     7    23
#new data set with only complete cases (with a general naming)
daily_wide_univar <- daily_wide_negaff[which(nomiss==TRUE), ]

Looking at the top few rows of the wide data.

head(daily_wide_univar,6)
##      id negaff_0 negaff_1 negaff_2 negaff_3 negaff_4 negaff_5 negaff_6
## 1   203      1.2      1.1      1.1      1.1      1.6      3.6      1.3
## 23  204      3.6      3.0      3.2      5.6      2.8      1.9      3.1
## 45  205      3.3      3.0      3.6      3.6      4.2      2.5      4.2
## 67  208      1.6      1.3      2.1      1.5      1.9      1.2      1.2
## 89  211      2.2      1.8      1.6      1.5      2.1      2.0      1.6
## 126 218      1.1      1.4      1.0      1.0      1.1      1.1      1.1
##     negaff_7 negaff_8 negaff_9 negaff_10 negaff_11 negaff_12 negaff_13
## 1        1.0      1.0      1.0       1.0       1.3       1.2       1.2
## 23       3.9      5.3      1.8       1.7       2.3       1.4       3.0
## 45       3.1      2.2      2.8       2.9       2.3       3.0       3.2
## 67       1.3      1.2      1.3       1.4       1.2       1.7       1.8
## 89       1.3      1.7      1.2       1.7       1.5       2.7       1.4
## 126      1.1      1.1      1.1       1.1       1.1       1.4       1.1
##     negaff_14 negaff_15 negaff_16 negaff_17 negaff_18 negaff_19 negaff_20
## 1         1.0       1.0       1.0       1.3       1.1       1.3       1.3
## 23        2.5       1.5       2.8       5.7       2.5       4.0       5.0
## 45        2.7       3.6       3.2       3.4       3.6       3.1       2.6
## 67        1.3       1.2       1.6       1.3       1.3       1.8       1.3
## 89        1.4       1.4       1.3       1.9       2.2       1.6       1.2
## 126       1.2       1.0       1.1       1.1       1.1       1.1       1.1
##     negaff_21
## 1       1.000
## 23      3.700
## 45      2.275
## 67      1.400
## 89      1.300
## 126     1.100

Note that the data are in wide format.

The TSclust procedures require that the data are structured either as (a) a matrix object with all the individual time-series in separate rows (“wide data”) OR (b) a data.frame object with all the individual time-series in separate columns (“non-standard long data”) - in both cases without an id or time index.
##B. Clustering the time-series data and plotting

The objective in time-series clustering is to do the same - identify subgroups of persons based on the structure of the repeated measures (univariate time-series). Task = grouping a set of objects in such a way that objects in the same group/cluster are more similar (in some sense or another) to each other than to those in other groups/clusters. To do so, must define what it means for two or more observations to be similar or different - defined mathematically as a dissimilarity metric (distance metric).

5 Steps for Clustering

1. Prepare data for TSclust

(Note that there are no missing data)

#making data into a matrix (without id variable)
datamatrix <- as.matrix(daily_wide_univar[ ,-1])
datamatrix
##     negaff_0 negaff_1 negaff_2 negaff_3 negaff_4 negaff_5 negaff_6
## 1       1.20      1.1     1.10      1.1      1.6      3.6      1.3
## 23      3.60      3.0     3.20      5.6      2.8      1.9      3.1
## 45      3.30      3.0     3.60      3.6      4.2      2.5      4.2
## 67      1.60      1.3     2.10      1.5      1.9      1.2      1.2
## 89      2.20      1.8     1.60      1.5      2.1      2.0      1.6
## 126     1.10      1.4     1.00      1.0      1.1      1.1      1.1
## 162     5.50      3.2     4.40      5.2      2.9      4.4      3.5
## 184     3.20      2.4     3.90      3.0      3.5      3.0      3.2
## 206     1.30      2.2     2.55      1.6      1.8      1.9      3.2
## 228     2.40      2.0     3.90      2.2      2.3      2.3      1.8
## 250     1.60      1.5     3.00      1.9      1.0      1.2      1.1
## 294     3.70      2.7     3.10      2.1      2.3      1.5      2.6
## 316     3.60      2.8     1.50      3.3      2.6      1.8      1.7
## 360     2.00      1.5     3.80      1.4      1.7      1.3      2.5
## 418     4.00      2.3     2.70      2.1      1.1      1.7      3.6
## 440     2.45      2.9     3.40      4.0      2.0      3.3      2.2
## 462     2.60      3.0     3.00      3.2      3.6      2.9      2.7
## 484     5.80      3.1     4.30      2.2      2.3      2.1      2.2
## 506     2.90      2.4     2.80      2.2      2.5      3.7      2.8
## 528     2.30      1.5     1.70      3.1      2.9      1.6      1.6
## 550     2.70      2.0     2.90      2.8      1.8      1.6      2.9
## 572     3.10      1.5     1.80      2.3      1.3      2.0      1.7
## 608     4.00      4.0     2.00      2.0      2.2      2.3      2.3
##     negaff_7 negaff_8 negaff_9 negaff_10 negaff_11 negaff_12 negaff_13
## 1        1.0    1.000      1.0       1.0       1.3     1.200       1.2
## 23       3.9    5.300      1.8       1.7       2.3     1.400       3.0
## 45       3.1    2.200      2.8       2.9       2.3     3.000       3.2
## 67       1.3    1.200      1.3       1.4       1.2     1.700       1.8
## 89       1.3    1.700      1.2       1.7       1.5     2.700       1.4
## 126      1.1    1.100      1.1       1.1       1.1     1.400       1.1
## 162      2.4    4.700      5.0       5.7       4.7     2.200       2.6
## 184      4.3    2.900      1.9       2.7       3.7     2.300       2.4
## 206      1.8    1.300      1.3       2.3       1.7     2.000       1.5
## 228      2.8    3.400      1.9       1.2       1.4     1.300       2.1
## 250      2.6    2.400      3.0       2.2       1.4     1.300       1.6
## 294      2.2    1.900      1.3       2.6       2.5     2.200       1.2
## 316      1.4    1.775      2.8       1.0       3.5     1.900       1.7
## 360      2.8    3.500      2.3       1.0       1.7     1.200       2.0
## 418      2.6    1.600      1.8       1.7       2.0     2.600       2.6
## 440      3.0    2.600      2.8       3.6       2.0     3.100       3.9
## 462      2.0    2.400      2.3       2.1       1.9     1.800       2.0
## 484      2.0    2.100      3.4       2.5       3.0     2.325       1.5
## 506      2.5    3.300      4.2       3.4       2.4     2.000       1.0
## 528      2.0    1.800      1.1       2.5       3.0     2.100       2.0
## 550      2.6    3.300      2.5       1.5       3.7     2.700       2.7
## 572      1.4    2.000      2.8       2.1       2.5     1.700       4.1
## 608      2.3    3.200      2.6       3.0       3.4     5.000       3.7
##     negaff_14 negaff_15 negaff_16 negaff_17 negaff_18 negaff_19 negaff_20
## 1        1.00       1.0       1.0       1.3       1.1       1.3      1.30
## 23       2.50       1.5       2.8       5.7       2.5       4.0      5.00
## 45       2.70       3.6       3.2       3.4       3.6       3.1      2.60
## 67       1.30       1.2       1.6       1.3       1.3       1.8      1.30
## 89       1.40       1.4       1.3       1.9       2.2       1.6      1.20
## 126      1.20       1.0       1.1       1.1       1.1       1.1      1.10
## 162      1.70       1.8       1.3       3.3       3.3       3.5      1.90
## 184      2.20       1.9       2.0       5.0       2.9       2.9      3.00
## 206      1.70       1.6       1.8       1.4       2.4       1.7      1.40
## 228      2.00       2.6       2.6       2.6       1.9       2.1      2.20
## 250      2.00       2.0       1.4       3.0       1.4       1.2      2.10
## 294      2.30       2.7       1.7       1.4       1.4       1.7      2.30
## 316      1.50       2.8       2.6       1.6       1.0       1.3      2.80
## 360      1.90       2.5       2.1       1.4       1.3       1.1      1.20
## 418      2.90       2.7       1.1       1.0       1.6       1.0      1.70
## 440      2.70       4.6       2.6       2.2       2.1       3.1      2.70
## 462      1.95       3.4       3.0       3.1       1.9       2.0      2.20
## 484      2.30       2.1       1.8       1.8       2.0       1.8      2.30
## 506      1.00       1.2       1.7       2.5       2.4       3.0      1.70
## 528      1.40       1.0       2.2       2.1       3.3       1.1      1.50
## 550      3.00       1.3       1.3       1.2       1.4       2.5      1.80
## 572      2.20       2.0       2.0       1.2       2.5       1.7      4.15
## 608      3.40       2.7       1.9       2.1       2.7       1.9      2.20
##     negaff_21
## 1       1.000
## 23      3.700
## 45      2.275
## 67      1.400
## 89      1.300
## 126     1.100
## 162     2.100
## 184     3.300
## 206     1.500
## 228     1.800
## 250     3.400
## 294     1.700
## 316     2.300
## 360     1.800
## 418     1.800
## 440     3.700
## 462     1.800
## 484     1.900
## 506     2.200
## 528     1.100
## 550     1.400
## 572     1.500
## 608     2.200

2. Calculate distances/dissimilarities using chosen metric

#calculating dissimilarity matrix using TSclust diss() function
data_diss <- diss(SERIES=datamatrix, METHOD="DTW") #DTW = Dynamic Time Warping

#adding informative column names
names(data_diss) <- daily_wide_univar$id

#examine distance/dissimilarity matrix (only first 5 time-series for space reasons)
as.matrix(data_diss)[1:5,1:5]
##        203    204    205  208  211
## 203  0.000 46.300 41.375  8.4 11.6
## 204 46.300  0.000 23.025 39.3 34.7
## 205 41.375 23.025  0.000 41.9 28.4
## 208  8.400 39.300 41.900  0.0  5.5
## 211 11.600 34.700 28.400  5.5  0.0

3. Cluster based on the distances/dissimilarities using method of choice (e.g., hierarchical clustering)

#perform hierachical clustering on the diss object
data_hclust <- hclust(data_diss, method="ward.D",members=names(data_diss))
#show the resulting dendrogram
plot(data_hclust, main="TS  Clustering")

##same procedure using agnes in the cluster package
#perform aglomorative nesting (hierarchical clustering) on the diss object
# Compute linkages
data_agnesclust <- agnes(data_diss, diss=TRUE, method="complete")
#show the resulting dendrogram
plot(data_agnesclust, which.plot = 2, main = "TS Clustering")

4. Determine the number of clusters and obtain cluster assignments

Demonstrate two different cuts (for 2 and 4 clusters)

##primary solution
#imposing number of clusters
data_cluster2 <- cutree(data_hclust, k = 2)
#examine cluster assignments
data_cluster2
## 203 204 205 208 211 218 226 239 244 301 306 316 321 324 340 341 344 402 
##   1   2   2   1   1   1   2   2   1   2   1   2   2   1   2   2   2   2 
## 403 409 418 424 439 
##   2   1   2   2   2
#examine number of individuals in each cluster
table(data_cluster2)
## data_cluster2
##  1  2 
##  8 15
##alternative solution
#imposing number of clusters
data_cluster4 <- cutree(data_agnesclust, k = 4)
#examine cluster assignments
data_cluster4
##  [1] 1 2 3 1 1 1 3 3 1 4 1 4 4 1 4 3 3 3 4 1 4 4 3
#examine number of individuals in each cluster
table(data_cluster4)
## data_cluster4
## 1 2 3 4 
## 8 1 7 7

There are a variety of ways to determine the ideal number of clusters and to justify those choices (e.g., using statistical metrics). We do not go into those details here. Please see our Cluster Analysis tutorial for more detailed information about those steps.

5. Merge cluster assignments into original data, reshape, and visualize/interpret

Merging and reshaping.

#binding cluster solutions with the wide data
daily_widecluster <- cbind(daily_wide_univar,data_cluster2,data_cluster4)

#reshaping to long data for plotting
#reshaping wide to long
varnames <- 
daily_longcluster <- reshape(data=daily_widecluster, 
                         timevar=c("day"), 
                         idvar=c("id","data_cluster2","data_cluster4"),
                         varying= dput(names(daily_widecluster[ ,2:23])),
                         direction="long", sep="_")
## c("negaff_0", "negaff_1", "negaff_2", "negaff_3", "negaff_4", 
## "negaff_5", "negaff_6", "negaff_7", "negaff_8", "negaff_9", "negaff_10", 
## "negaff_11", "negaff_12", "negaff_13", "negaff_14", "negaff_15", 
## "negaff_16", "negaff_17", "negaff_18", "negaff_19", "negaff_20", 
## "negaff_21")
#ordering by id and day for easy viewing
daily_longcluster <- daily_longcluster[order(daily_longcluster$id,daily_longcluster$day), ]

Plotting cluster solutions by ID.

#2 cluster solution
#intraindividual change trajetories by ID
ggplot(data = daily_longcluster, aes(x = day, group = id)) +
  geom_line(aes(y = negaff,color=factor(data_cluster2))) +
  xlab("Day") + 
  ylab("Negative Affect") + #ylim(0,30) +
  scale_x_continuous(breaks=seq(0,21,by=7)) +
  facet_wrap(vars(id)) +
  guides(color="none")

#4 cluster solution
#intraindividual change trajetories by ID
ggplot(data = daily_longcluster, aes(x = day, group = id)) +
  geom_line(aes(y = negaff,color=factor(data_cluster4))) +
  xlab("Day") + 
  ylab("Negative Affect") + #ylim(0,30) +
  scale_x_continuous(breaks=seq(0,21,by=7)) +
  facet_wrap(vars(id)) +
  guides(color="none")

Plotting clusters by group and color.

#intraindividual change trajetories by cluster
ggplot(data = daily_longcluster, aes(x = day, group = id)) +
  geom_line(aes(y = negaff,color=factor(data_cluster2))) +
  xlab("Day") + 
  ylab("Negative Affect") + #ylim(0,30) +
  scale_x_continuous(breaks=seq(0,21,by=7)) +
  facet_wrap(vars(data_cluster4)) +
  guides(color="none")

That was pretty cool! And we learned some new things about potential subgroups in these data!

Importantly, choosing other dissimilarity/distance metrics will lead to different clusters! Given that the specifics captured by the chosen dissimilarity/distance metric is what facilitates interpretation of individual differences, it is extremely useful to inform the choice with theory and/or examine a variety of metrics.