Overview

Continuing our exploration of how Data Mining Methods can be applied in analyis of intensive longitudinal data obtained in experience sampling studies (daily diary, EMA, ambulatory assessment, etc.), we illustrate use of Unsupervised Learning methods for identifying structure in such data. In particular, we illustrate how cluster analysis methods can be used to identify groups of individuals whose time-series data are similar in some way (i.e., exhibit similar dynamic characteristics of some sort). The exploration is facilitated by the functions in the TSclust package for calculating dissimilarity/distance between time-series.

We make use of two data sets …
(1) The Cortisol Data, 9-occasion data that exhibit strong shapes, and
(2) The AMIB (Phase 2) Data, 21-occasion data that exhibit fluctuations. Importantly, in both data sets, the time variable is aligned at the same t = 0 for everyone.

Outline

This script covers …

A. Reading in The Cortisol Data
B. Clustering the time-series data and plotting C. Reading in The AMIB Data (Phase 2 daily) D. Clustering the time-series data and plotting

Preliminaries

Loading libraries used in this script.

#general packages
library(psych)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
#cluster packages
library(cluster) #clustering
library(TSclust) #time series clustering

A. Reading in The Cortisol Data

Reading in the time-series data

Loading the Cortisol Data - public data.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/TheCortisolData.csv"
#read in the .csv file using the url() function
cortisol_wide <- read.csv(file=url(filepath),header=TRUE)

Looking at the top few rows of the wide data.

head(cortisol_wide,6)
##   id cort_0 cort_1 cort_2 cort_3 cort_4 cort_5 cort_6 cort_7 cort_8
## 1  1    4.2    4.1    9.7   14.0   19.0   18.0   20.0   23.0   24.0
## 2  2    5.5    5.6   14.0   16.0   19.0   17.0   18.0   20.0   19.0
## 3  3    4.0    3.8    7.5   12.0   14.0   13.0    9.1    8.2    7.9
## 4  4    6.1    5.6   14.0   20.0   26.0   23.0   26.0   25.0   26.0
## 5  5    4.6    4.4    7.2   12.3   15.8   16.1   17.0   17.8   19.1
## 6  6    6.8    9.5   14.2   19.6   19.0   13.9   13.4   12.5   11.7

Note that the data are in wide format.

The TSclust procedures require that the data are structured either as (a) a matrix object with all the individual time-series in separate rows (“wide data”) OR (b) a data.frame object with all the individual time-series in separate columns (“non-standard long data”) - in both cases without an id or time index. For convenience, we also create standard long data that can be used for plotting.

Reshaping the time-series data

Generally, two main data schema are used to accommodate repeated measures data - “Wide Format” and “Long Format”. Different functions work with different kinds of data input. We already have the wide format data. We make a set of long format data.

Reshape from wide to long

#reshaping wide to long
cortisol_long <- reshape(data=cortisol_wide, 
                         timevar=c("time"), 
                         idvar="id",
                         varying=c("cort_0","cort_1","cort_2","cort_3",
                                   "cort_4","cort_5","cort_6","cort_7","cort_8"),
                         direction="long", sep="_")
#sorting for easy viewing
# order by id and time
cortisol_long <- cortisol_long[order(cortisol_long$id,cortisol_long$time), ]

Looking at the top few rows of the long data.

head(cortisol_long,18)
##     id time cort
## 1.0  1    0  4.2
## 1.1  1    1  4.1
## 1.2  1    2  9.7
## 1.3  1    3 14.0
## 1.4  1    4 19.0
## 1.5  1    5 18.0
## 1.6  1    6 20.0
## 1.7  1    7 23.0
## 1.8  1    8 24.0
## 2.0  2    0  5.5
## 2.1  2    1  5.6
## 2.2  2    2 14.0
## 2.3  2    3 16.0
## 2.4  2    4 19.0
## 2.5  2    5 17.0
## 2.6  2    6 18.0
## 2.7  2    7 20.0
## 2.8  2    8 19.0

Plotting the time-series data

Examination of individual-level longitudinal plots provides for “intuition” about how individuals’ time-series differ.

#intraindividual change trajetories by id
ggplot(data = cortisol_long, aes(x = time, y = cort, group = id)) +
  geom_point(color="black") + 
  geom_line(color="black")  +
  xlab("Time") + 
  ylab("Cortisol") + ylim(0,30) +
  scale_x_continuous(breaks=seq(0,8,by=1)) +
  facet_wrap(vars(id))