Sequence Analysis: Clustering on Edit Distance

Overview

Sequence analysis utilizes repeated-measures data to examine patterns within and across categorical time series.

Outline

Introduction to Sequence Analysis.
Data Management and Descriptives.
Creating Sequences.
Establishing a Cost Matrix and Sequence Analysis.
Cluster Determination.
Examine Group Differences among Clusters.

0. Introduction to Sequence Analysis.

Sequence analysis is a descriptive analytic technique to capture within-sequence patterns and allow for between-sequence comparisons. This analytic technique has previously been used in biology to identify and group DNA sequences (i.e., categorical sequences depicting the order of the four nucleotides-A, C, T, and G) that are similar, and in sociology to examine occupational trajectories (e.g., Halpin & Cban, 1998), dance rituals (MacIndoe & Abbott, 2004), and residential mobility (Stovel & Bolan, 2004). In sum, sequence analysis is suitable for series of categorical data to identify potential patterns, to group participants based upon the similarity of their patterns, and to examine differences across pattern groups (e.g., in age, personality).

In this tutorial, we will be walking through an example that examines the order of the type of social interaction partner (i.e., co-worker, friend, romantic partner, etc.) a participant has over the course of a week, and whether the resulting clusters of patterns of interaction are associated to any of the Big 5 personality characteristics (openness, conscientiousness, extraversion, agreeableness, and neuroticism). Data come from one of our example data sets - where 184 individuals reported on up to seven social interactions they had each day over the course of a week.

Load libraries and read in data.

#loading needed libraries
library(cluster)
library(dplyr)
library(ggplot2)
library(psych)
library(reshape)
library(reshape2)
library(stats)
library(TraMineR)
library(TraMineRextras)

#set filepath for repeated measures of happiness data file
# filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/gridsequence_simulation_data.csv" 
# data <- read.csv(file=url(filepath),header=TRUE)
# head(data)


#read in repeated measures data
data <- read.csv(file = url("https://quantdev.ssri.psu.edu/sites/qdev/files/amib_partners.csv"), head=TRUE, sep=",")

#read in between-person personality data
btwn <- read.csv(file = url("https://quantdev.ssri.psu.edu/sites/qdev/files/amib_personality.csv"), head=TRUE, sep=",")

1. Data Management and Descriptives.

Management.

Depending on the format of your data set, some data management may be necessary. The final product should be two data sets:

“data” contains repeated measures of the variable of interest (in this case, interaction partner category). There should be a column that contains a participant ID variable, a column the contains a continuous measure of time (“occasion”), and a column for the type of interaction partner.

Let’s take a quick peek at the data.

#repeated measures data
head(data)

##    id partner_status occasion
## 1 101           <NA>        1
## 2 101         friend        2
## 3 101         friend        3
## 4 101         friend        4
## 5 101       roommate        5
## 6 101         friend        6

names(data)

## [1] "id"             "partner_status" "occasion"

str(data)

## 'data.frame':    7568 obs. of  3 variables:
##  $ id            : int  101 101 101 101 101 101 101 101 101 101 ...
##  $ partner_status: Factor w/ 9 levels "acquaintance",..: NA 3 3 3 6 3 3 3 1 3 ...
##  $ occasion      : int  1 2 3 4 5 6 7 8 9 10 ...

We see that we have a column indicating participant id, an integer variable that indicates measurement occasion, and a variable that indicates type of interaction partner.

“btwn” contains participant-level, time-invariant variables. These are the variables in which you will test between-group differences in Step 5 of sequence analysis. This data file should include a column for participant ID, and columns with the between-person variables of interest (in this case, the Big 5 personality characteristics).

Let’s take a look at our data.

#between-person personality data
head(btwn)

##    id bfi_e bfi_a bfi_c bfi_n bfi_o
## 1 101   3.5   1.5   4.0   2.0   4.0
## 2 103   4.0   4.5   3.5   2.5   5.0
## 3 104   3.0   4.5   4.5   2.5   3.0
## 4 105   3.5   3.5   3.0   3.5   4.5
## 5 106   3.0   3.5   5.0   1.5   3.0
## 6 107   5.0   4.0   5.0   1.5   4.0

names(btwn)

## [1] "id"    "bfi_e" "bfi_a" "bfi_c" "bfi_n" "bfi_o"

str(btwn)

## 'data.frame':    184 obs. of  6 variables:
##  $ id   : int  101 103 104 105 106 107 108 109 110 111 ...
##  $ bfi_e: num  3.5 4 3 3.5 3 5 3.5 3.5 3 3 ...
##  $ bfi_a: num  1.5 4.5 4.5 3.5 3.5 4 3 3 3.5 3.5 ...
##  $ bfi_c: num  4 3.5 4.5 3 5 5 5 3 3.5 3 ...
##  $ bfi_n: num  2 2.5 2.5 3.5 1.5 1.5 4.5 4 3.5 3.5 ...
##  $ bfi_o: num  4 5 3 4.5 3 4 3 3 5 2 ...

All looks good!

Descriptives.

We begin by getting a feel for our data. Let’s examine:
(1) how many participants we have in the data set,
(2) how many “occasions” there are for each participant, and
(3) the frequency of interaction with a type of interaction partner across all participants.

Number of participants.

length(unique(data$id))

## [1] 184

length(unique(btwn$id))

## [1] 184

There are 184 participants in both data sets.

Number of occasions (i.e., social interactions) for each participant.

num_occ <- data %>%
           group_by(id) %>%
           summarise(count=n(), occasion = first(occasion))

describe(num_occ$count)

##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 184 41.13 13.62     43   42.45 17.79  10  56    46 -0.53    -0.94
##    se
## X1  1

#plot
ggplot(data = num_occ, aes(x = count)) +
  geom_histogram(binwidth = 5, fill = "white", color="black") + 
  xlim(0, 60) +
  ylim(0, 60) +
  labs(x = "Number of Social Interactions")

The average participant had approximately 41 social interactions (M = 41.13, SD = 13.62), with participants ranging from 10 to 56 social interactions over the course of a week.

The number of total interactions for each interaction partner type.

partner_table <- table(data$partner_status)
partner_table

## 
##     acquaintance         coworker           friend           parent 
##              608              118             4137              159 
## romantic_partner         roommate          sibling       supervisee 
##              457             1248               93               18 
##       supervisor 
##              202

We can see that participants overall had the most interactions with friends and the fewest interactions with supervisees. Conceptually, this makes sense given that we are analyzing data from college students.

2. Creating Sequences.

In this step, we:
(1) re-format the repeated measures data from long to wide,
(2) create an “alphabet” that represents each of our categories,
(3) and formally create and plot the categorical sequence.

Re-formatting the data from long to wide.

data_wide <- dcast(data, id ~ occasion, value.var = "partner_status")

#add "occ_" to each column heading
colnames(data_wide)[2:57] <- paste("occ", colnames(data_wide[, 2:57]), sep = "_")

Create alphabet.
These characters represent each possible category within the variable of interest. The actual naming of these values is not important, but we are going to name them in such a way that facilitates interpretation.

#this object contains the numbers (i.e., categories) that appear in the data set.
partner_alphabet <- c("supervisor", "coworker", "supervisee", "friend", "acquaintance", "romantic_partner", "parent", "sibling", "roommate")

#this object allows for more helpful labels if applicable 
partner_labels <- c("supervisor", "coworker", "supervisee", "friend", "acquaintance", "romantic_partner", "parent", "sibling", "roommate")

Formally create sequences.
First we assign colors to each of the categories (this is not necessary since there is a default color palette, but this give us more control).

supervisor <- "#FF0000"       #red
coworker <- "#FFA500"         #orange
supervisee <- "#FFFF00"       #yellow
friend <- "#008000"           #green
acquaintance <- "#0000FF"     #blue
romantic_partner <- "#800080" #purple
parent <- "#FFC0CB"           #pink
sibling <- "#000000"          #black
roommate <- "#40E0D0"         #turquoise

Next, we create an object that contains all of the sequences.

partner_seq <- seqdef(data_wide,                      #data   
                      var = 2:57,                     #columns containing repeated measures data
                      alphabet = partner_alphabet,    #alphabet  
                      labels = partner_labels,        #labels
                      xtstep = 6,                     #steps between tick marks
                      cpal=c(supervisor, coworker, 
                             supervisee, friend, 
                             acquaintance, 
                             romantic_partner, 
                             parent, sibling, 
                             roommate))               #color palette

##  [>] found missing values ('NA') in sequence data

##  [>] preparing 184 sequences

##  [>] coding void elements with '%' and missing values with '*'

##  [>] 9 distinct states appear in the data:

##      1 = acquaintance

##      2 = coworker

##      3 = friend

##      4 = parent

##      5 = romantic_partner

##      6 = roommate

##      7 = sibling

##      8 = supervisee

##      9 = supervisor

##  [>] state coding:

##        [alphabet]       [label]          [long label]

##      1  supervisor       supervisor       supervisor

##      2  coworker         coworker         coworker

##      3  supervisee       supervisee       supervisee

##      4  friend           friend           friend

##      5  acquaintance     acquaintance     acquaintance

##      6  romantic_partner romantic_partner romantic_partner

##      7  parent           parent           parent

##      8  sibling          sibling          sibling

##      9  roommate         roommate         roommate

##  [>] 184 sequences in the data set

##  [>] min/max sequence length: 10/56

Plot the sequences.

seqIplot(partner_seq, withlegend = FALSE, title="Type of Social Interaction Partner")

##  [!] In rmarkdown::render() : title is deprecated, use main instead.

##  [!] In rmarkdown::render() : withlegend is deprecated, use with.legend instead.

3. Establishing a Cost Matrix and Sequence Analysis.

Sequence analysis aims to minimize the “cost” of transforming one sequence into another and relies on an optimal matching algorithm. There are costs for inserting, deleting, and substituting letters, as well as costs for missingness. The researcher establishes a cost matrix, and often use standards, such as insertion/deletion costs of 1.0 and missingness costs of half the highest cost within the matrix.

There are a number of ways to determine substitution costs. Typically, substitution costs are established as the distance between cells. However, we do not have an ordinal scale for the categories (although, we could order social interaction partners by inferred closeness, e.g., stranger, …, spouse). In this case, we use a constant cost matrix (i.e., the distance between any type of social interaction partner is the same). If we were to use a theoretical rationale to sort interaction partner types that were more or less similar, we could use Manhattan (city-block) distance or Euclidian distance. Finally, the substitution cost matrix will be (n+1) by (n+1) with n = number of cells in the grid, since we add a right-most column and a bottom row to represent missingness costs (half of the highest cost, which in this case is half of 2).

Here, we establish our cost matrix.

costmatrix <- seqsubm(partner_seq, 
                      method="CONSTANT", 
                      cval = 2, 
                      with.missing=TRUE,
                      miss.cost=1, 
                      time.varying=FALSE, 
                      weighted=TRUE,
                      transition="both", 
                      lag=1)

##  [>] creating 10x10 substitution-cost matrix using 2 as constant value

costmatrix

##                    supervisor-> coworker-> supervisee-> friend->
## supervisor->                  0          2            2        2
## coworker->                    2          0            2        2
## supervisee->                  2          2            0        2
## friend->                      2          2            2        0
## acquaintance->                2          2            2        2
## romantic_partner->            2          2            2        2
## parent->                      2          2            2        2
## sibling->                     2          2            2        2
## roommate->                    2          2            2        2
## *->                           1          1            1        1
##                    acquaintance-> romantic_partner-> parent-> sibling->
## supervisor->                    2                  2        2         2
## coworker->                      2                  2        2         2
## supervisee->                    2                  2        2         2
## friend->                        2                  2        2         2
## acquaintance->                  0                  2        2         2
## romantic_partner->              2                  0        2         2
## parent->                        2                  2        0         2
## sibling->                       2                  2        2         0
## roommate->                      2                  2        2         2
## *->                             1                  1        1         1
##                    roommate-> *->
## supervisor->                2   1
## coworker->                  2   1
## supervisee->                2   1
## friend->                    2   1
## acquaintance->              2   1
## romantic_partner->          2   1
## parent->                    2   1
## sibling->                   2   1
## roommate->                  0   1
## *->                         1   0

Next, we use an optimal matching technique for sequence analysis. The output of sequence analysis is a n x n (n = number of participants) dissimilarity matrix with the cost of transforming one sequence into the corresponding sequence in each cell of the matrix.

dist_om <- seqdist(partner_seq,         #sequence object
                   method = "OM",       #optimal matching
                   indel = 1.0,         #insert/deletion costs set to 1
                   sm = costmatrix,     #substitution cost matrix
                   with.missing = TRUE)

##  [>] including missing values as an additional state

##  [>] 184 sequences with 10 distinct states

##  [>] checking 'sm' (one value for each state, triangle inequality)

##  [>] 184 distinct sequences

##  [>] min/max sequence length: 10/56

##  [>] computing distances using the OM metric

##  [>] elapsed time: 0.122 secs

#printing out the top left corner of the dissimilarity matrix
dist_om[1:10, 1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0   55   71   34   57   40   39   41   45    39
##  [2,]   55    0   59   31   37   24   54   25   25    33
##  [3,]   71   59    0   60   55   50   81   61   52    49
##  [4,]   34   31   60    0   38   23   34   18   26    19
##  [5,]   57   37   55   38    0   35   59   34   36    38
##  [6,]   40   24   50   23   35    0   41   19   23    22
##  [7,]   39   54   81   34   59   41    0   39   45    37
##  [8,]   41   25   61   18   34   19   39    0   24    23
##  [9,]   45   25   52   26   36   23   45   24    0    26
## [10,]   39   33   49   19   38   22   37   23   26     0

4. Cluster Determination.

We next take the distance matrix obtained in the step three to determine an appropriate number of clusters. Several clustering techniques are available, but we use hierarchical cluster analysis using Ward’s single linkage method. Other possible methods include k-mediods clustering or latent mixture models. After determining the number of clusters that work for the data, we create an object that contains cluster membership for each dyad (which will be used in the final step) and plot the clusters.

Conduct hierarchical cluster analysis.

clusterward1 <- agnes(dist_om, diss = TRUE, method = "ward")
plot(clusterward1, which.plot = 2)

In this example, the resulting dendrogram indicated three clusters. We reached this conclusion by examining the length of the vertical lines (longer vertical line indicates greater difference between groups) and the number of participants within each group (we didn’t want a group with too few participants). After selecting a three cluster solution, we plotted the sequences of the three clusters for visual comparison.

#cutting dendrogram (or tree) by the number of determined groups (in this case, 3)
cl3 <- cutree(clusterward1, k = 3) 

#turning cut points into a factor variable and labeling them
cl3fac <- factor(cl3, labels = paste("Type", 1:3)) 

#plot
seqplot(partner_seq, group = cl3fac, type="I", sortv = "from.start",with.legend = FALSE, border = NA)

It appears that “Type 1” participants interact primarily with friends, “Type 2” participants interact with a variety of social partner types, and “Type 3” usually interact with friends and roommates. In the next steps, we will formally test whether the participants within these clusters differ on any theoretically meaningful variables.

5. Examine Group Differences among Clusters.

The final step of sequence analysis is to examine group differences among the clusters. One can use a variety of methods to examine group differences, and the choice of method will depend on the number of clusters chosen and the research question. For example, if only two clusters are chosen and one wants to examine the clusters as a predictor variable, then one would use the cluster membership variable as a predictor in a logistic regression. In this case, we use analysis of variance (ANOVA) to examine group differences.

We examined whether the Big 5 characteristics differed by cluster membership, which represented type of social interaction partner patterns. As you can see below, the only significant difference between clusters in this sample was in levels of extraversion.

#adding grouping variables to participant-level data set
btwn$cl3 <- cl3

#examining differences in openness
open_results <- aov(btwn$bfi_o ~ factor(btwn$cl3)) 
summary(open_results)
TukeyHSD(open_results) #post hoc test if needed

#examining differences in conscientiousness
con_results <- aov(btwn$bfi_c ~ factor(btwn$cl3)) 
summary(con_results)
TukeyHSD(con_results) #post hoc test if needed

#examining differences in extraversion
ext_results <- aov(btwn$bfi_e ~ factor(btwn$cl3)) 
summary(ext_results)
TukeyHSD(ext_results) #post hoc test if needed

#examining differences in agreeableness
agree_results <- aov(btwn$bfi_a ~ factor(btwn$cl3)) 
summary(agree_results)
TukeyHSD(agree_results) #post hoc test if needed

#examining differences in neuroticism
neuro_results <- aov(btwn$bfi_n ~ factor(btwn$cl3)) 
summary(neuro_results)
TukeyHSD(neuro_results) #post hoc test if needed

Although there are not many differnces, we do see that the clusters differ with respect to Extraversion! That makes some sense.

Cautions

Although there are several distinct advantages of sequence analysis, there are several limitations and considerations to the process, which include:

The length of time series needed (dependent on the process under examination, but could be lengthy).
The need for an ordinal or categorical variable.
The determination of the cost matrices (which in turn effects the prioritization of left/right shifts vs. substitution of letters in the sequence).
The extent of missingness.

Conclusion

Theories of interpersonal dynamics and social interaction emphasize the need to study within-person dynamics. Sequence analysis is an approach that allows researchers to capture within-person dynamics and to make between-person comparisons using repeated-measures data.

Testing out the missing data/shorter sequences issue.

I’m going to only examine the first 20 interactions for each participant and see if we get the same results.