Overview

This tutorial continues our walk through calculation of intraindividual variability metrics. Specifically, this tutorial demonstrates calculation of intraindividual summary statistics using equations that get translated into functions. As with other measures, the derived metrics are useful for articulating a variety of dynamic characteristics from experience sampling data, and as demonstrated at the end of this session, examining how those characteristics are related to other individual differences variables.

Outline

In this session we cover …

A. Review description of data and calculation of basic intaindividual variability (IIV) metrics.
B. Calculating some additional IIV metrics (MSSD, PAC) based on their formulas.
C. Examining how the IIV metrics are related to other individual differences variables using regressions.

Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
library(data.table)
library(entropy)
library(plyr)
library(zoo)

Prelim - Reading in Repeated Measures Data

Note that we are working from a long file. For your own data, there may be some steps to get to this point.

#Setting the working directory
setwd("~/Desktop/Fall_2017")  #Person 1 Computer
#setwd("~/Desktop/Fall_2017")  #Person 2 Computer

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBbrief_raw_daily1.csv"
#read in the .csv file using the url() function
daily <- read.csv(file=url(filepath),header=TRUE)

#Little bit of clean-up
var.names.daily <- tolower(colnames(daily))
colnames(daily)<-var.names.daily

Everything we do today uses a long file.

A: Looking at distributions and within-person time-series

Lets look at some descriptions of today’s variables.

#getting a list of the variable names
names(daily)
##  [1] "id"      "day"     "date"    "slphrs"  "weath"   "lteq"    "pss"    
##  [8] "se"      "swls"    "evalday" "posaff"  "negaff"  "temp"    "hum"    
## [15] "wind"    "bar"     "prec"
#sample descriptives
describe(daily$posaff)
##    vars    n mean  sd median trimmed  mad min max range  skew kurtosis
## X1    1 1441 4.12 1.1    4.2    4.15 1.19   1   7     6 -0.25    -0.33
##      se
## X1 0.03
describe(daily$negaff)
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 1441 2.45 1.04    2.2    2.34 1.04   1 6.9   5.9 0.96     0.77
##      se
## X1 0.03
#histograms
ggplot(data=daily, aes(x=posaff)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).

ggplot(data=daily, aes(x=negaff)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Negative Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).

We can see the differences in the sample-level distributions (all persons AND all occasions) between the two variables.

And lets look at the longitudinal (“spaghetti”) plots …
1. Positive Affect

#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
ggplot(data = daily[which(daily$id >=527),], aes(x = day, y = posaff, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=daily[which(daily$id >=527 & daily$posaff !="NA"),]) +
  xlab("Day") + 
  ylab("Positive Affect") + #ylim(1,7) +
  scale_x_continuous(breaks=seq(0,7,by=1)) 

That looks nice. Although note that the specification for the y-axis is not the full 1 to 7 range of the variable.

  1. Negative Affect
ggplot(data = daily[which(daily$id >=527),], aes(x = day, y = negaff, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=daily[which(daily$id >=527 & daily$se !="NA"),]) +
  xlab("Day") + 
  ylab("Negative Affect") + #ylim(1,5) +
  scale_x_continuous(breaks=seq(0,7,by=1)) 

That looks nice. Although note that the specification for the y-axis is not the full 1 to 7 range of the variable.

Lets look at the within-person distributions of posaff and negaff.

#Density distribution by id
ggplot(data=daily[which(daily$id >=527),], aes(x=posaff)) + 
  geom_density(aes(group=factor(id), colour=factor(id))) +
  xlab("Positive Afect") 

#Density distribution by id
ggplot(data=daily[which(daily$id >=527),], aes(x=negaff)) + 
  geom_density(aes(group=factor(id), colour=factor(id))) +
  xlab("Negative Afect") 

Was expecting that the negaff distributions would show more skew, like in the lecture examples, but perhaps we don’t have enough data (i.e., only 8 occasions) to really get a “distribution”. The number of within-person samples may be too low.

Nonetheless, lets go ahead and calculate the variety of “IIV” summaries as we did last time. We use the data.table package. We use the entropy() function from the entropy package.

#converting to a data.table object
daily.dt <- data.table(daily)

#Computing Individual parameters
#collecting summary info by id
#note: posaff is a continuous variable; evalday is a binary variable
indiv.statbasic <- daily.dt[,list(imean.posaff=mean(posaff, na.rm=TRUE),     #continuous variable
                                  isd.posaff=sd(posaff, na.rm=TRUE),
                                  imean.negaff=mean(negaff, na.rm=TRUE),     #continuous variable
                                  isd.negaff=sd(negaff, na.rm=TRUE),
                                  icount.posaff=sum(!is.na(posaff)),         #data coverage
                                  icount.negaff=sum(!is.na(negaff)),
                                  prop.evalday=mean(evalday, na.rm=TRUE),    #binary variable
                                  icount.evalday=sum(!is.na(evalday)),
                                  ientropy.se=entropy(table(se, useNA="no")) #categorical variable
                                  ),
                                  by=id]

#An alternative using the plyr package
#count.statbasic <- ddply(daily, "id", summarize, icount=sum(!is.na(posaff)))

And lets look at the sample-level descriptives of the individual-level summaries.

describe(indiv.statbasic)
##                vars   n   mean     sd median trimmed    mad    min    max
## id                1 190 318.29 130.44 321.50  318.99 151.23 101.00 532.00
## imean.posaff      2 190   4.10   0.73   4.10    4.10   0.63   2.23   6.04
## isd.posaff        3 187   0.81   0.35   0.78    0.79   0.33   0.18   2.17
## imean.negaff      4 190   2.48   0.73   2.41    2.43   0.72   1.11   5.09
## isd.negaff        5 187   0.74   0.36   0.68    0.70   0.30   0.12   2.83
## icount.posaff     6 190   7.58   1.22   8.00    7.89   0.00   1.00   8.00
## icount.negaff     7 190   7.58   1.22   8.00    7.89   0.00   1.00   8.00
## prop.evalday      8 190   0.69   0.19   0.71    0.69   0.21   0.00   1.00
## icount.evalday    9 190   7.48   1.33   8.00    7.83   0.00   1.00   8.00
## ientropy.se      10 190   0.75   0.35   0.69    0.76   0.42   0.00   1.56
##                 range  skew kurtosis   se
## id             431.00 -0.04    -1.09 9.46
## imean.posaff     3.81  0.04     0.08 0.05
## isd.posaff       1.98  0.84     1.08 0.03
## imean.negaff     3.98  0.68     0.45 0.05
## isd.negaff       2.70  1.51     5.23 0.03
## icount.posaff    7.00 -3.99    16.55 0.09
## icount.negaff    7.00 -3.99    16.55 0.09
## prop.evalday     1.00 -0.42     0.33 0.01
## icount.evalday   7.00 -3.45    12.32 0.10
## ientropy.se      1.56 -0.16    -0.11 0.03

There are clearly individual differences, even if the intraindividual summary variables are calculated on “small ensembles”.

B: Calculating some additional IIV metrics (MSSD, PAC) based on their formulas.

We can think of the process of calculating intraindividual variability metrics as a form of feature extraction. The features (i.e., new variables) are extracted from each individuals’ ensemble of scores.

Lets calculate some more features.

Mean Square of Successive Differences (MSSD)

\[\delta_{i}^2 = \frac{1}{T-1} \sum_{t=2}^{T} (y_{i(t)}-y_{i(t-1)})^2 \]

Sidebar: How to write equations in R Markdown - a helpful little guide … http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html

The psych package has mssd() and rmssd() functions that are wrappers of functions in the matrixStats package. They do a few weird things on their own - so along the way we wrote our own function.

#Here is an example of what an mssd function might look like 
my.mssd <- function(data)
{
    diffToNext<-data[2:length(data)]-data[1:(length(data)-1)] #this computes the difference between each value and the next
    diffToNext2<-diffToNext^2                  #this squares the difference
    SSdiff<- sum(diffToNext2,na.rm=TRUE)       #this takes the sum of the squared differences
    denominator<-sum(!is.na(diffToNext))       #this computes the number of non-missing elements (denominator)
                                               #which corresponds to the t-1 value
    mssd<-SSdiff/denominator                   #this computes the MSSD
    return(mssd)
}

We can apply it to one person’s data.

my.mssd(daily[which(daily$id == 101),]$negaff)
## [1] 0.3385714

Lets check it against the mssd() function in the psych package.

mssd(daily[which(daily$id == 101),]$negaff)
## [1] 0.395

This used to work, and we would get the same answer. After some investigation, the difference is that in our formula we divide by \(T-1\). However, in themssd function in the psych package, they divide by \(T-2\).

Ok - lets apply to all person’s data

#using the plyr package
mssd.stats <- ddply(daily, "id", summarize, imssd.posaff=mssd(posaff),
                                            imssd.negaff=mssd(negaff),
                                            my.mssd.posaff=my.mssd(posaff),
                                            my.mssd.negaff=my.mssd(negaff))

#Correction for values of infinity 
mssd.stats$imssd.posaff <- ifelse(mssd.stats$imssd.posaff =="Inf",NA, mssd.stats$imssd.posaff)
mssd.stats$imssd.negaff <- ifelse(mssd.stats$imssd.negaff =="Inf",NA, mssd.stats$imssd.negaff)

Lets merge those scores into the indiv.statbasic data, and look at the correlations with the sd.

indiv.stats <- merge(indiv.statbasic,mssd.stats,by="id")
cor(indiv.stats[,c("isd.posaff","isd.negaff","imssd.posaff","imssd.negaff","my.mssd.negaff")], use="pairwise.complete.obs")
##                isd.posaff isd.negaff imssd.posaff imssd.negaff
## isd.posaff      1.0000000  0.5542910    0.8133892    0.5291318
## isd.negaff      0.5542910  1.0000000    0.5413603    0.8399991
## imssd.posaff    0.8133892  0.5413603    1.0000000    0.6073310
## imssd.negaff    0.5291318  0.8399991    0.6073310    1.0000000
## my.mssd.negaff  0.5338433  0.8453501    0.6149334    0.9978299
##                my.mssd.negaff
## isd.posaff          0.5338433
## isd.negaff          0.8453501
## imssd.posaff        0.6149334
## imssd.negaff        0.9978299
## my.mssd.negaff      1.0000000

Here, the MSSD is not that different from SD (correlations > .8), but it is a bit different.
Remember that we only have time-series of up to length 8. Things may look more different when the time-series are longer. With short time-series we cannot really discern the feature that MSSD helps with (see lecture notes).

Also Note that even after removing the Inf values the psych package’s mssd() function is not perfectly correlated with our my.mssd function.
Why?

Ah … the pesky bits with NAs and how to treat them.
Conclusion = always good to check what the canned functions do as well!

One other check on the distribution.

#histogram
ggplot(data=mssd.stats, aes(x=imssd.posaff)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "MSSD Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Reminder to be careful when using these variables as outcome measures. Check the distributional assumptions of your model.

Probability of Acute Change (PAC)

\[PAC_i = \frac{1}{T-1} \sum_{t=1}^{T-1} AC_{i(t)} \] where \(AC_{i(t)} = 1\) when \(y_{i(t+1)}-y{i(t) \ge c}\), \(c\) is a predetermined cutpoint (i.e., threshold),
and \(AC_{i(t)} = 0\) otherwise

Lets code that up.
We work up the code in steps.
1. We make a small problem. Subselect to one person’s data.

d1 <- daily[which(daily$id == 101),c("day","negaff")]
  1. Create the lag variable
    One would think it is lag()
d1$lag.negaff <- lag(d1$negaff,k=1)
d1
##   day negaff lag.negaff
## 1   0    3.0        3.0
## 2   1    2.3        2.3
## 3   2    1.0        1.0
## 4   3    1.3        1.3
## 5   4    1.1        1.1
## 6   5    1.0        1.0
## 7   6    1.2        1.2
## 8   7    1.1        1.1

But that is not what we want. It turns out that the lag() function does not work the way we expect.
After some searching we find a couple of ways to do it. One of which requires the zoo time-series package.

dzoo <- zoo(d1)
dzoo$lag.negaff <- lag(dzoo$negaff,-1,na.pad = TRUE)
dzoo
##   day negaff lag.negaff
## 1   0    3.0         NA
## 2   1    2.3        3.0
## 3   2    1.0        2.3
## 4   3    1.3        1.0
## 5   4    1.1        1.3
## 6   5    1.0        1.1
## 7   6    1.2        1.0
## 8   7    1.1        1.2

Yes! That is more like it.

  1. Calculate the AC variable.
c = .3
dzoo$ac.negaff = ifelse((dzoo$negaff-dzoo$lag.negaff) >= c,1,0)
c
## [1] 0.3
dzoo$ac.negaff
##  1  2  3  4  5  6  7  8 
## NA  0  0  1  0  0  0  0
  1. And finally calculate the PAC summary.
pac <- sum(dzoo$ac.negaff/(length(dzoo$ac.negaff)-1),na.rm=TRUE)
pac
## [1] 0.1428571

Yay!

Now lets make it into a function

#Lines for testing
#data <- daily[which(daily$id == 219),c("negaff")]
#data
my.pac <- function(data, c)
{
    data.zoo <- zoo(data)                          #this converts to zoo object
    data.zoo.lag <- lag(data.zoo,-1,na.pad = TRUE) #this creates the lag variable
    ac = ifelse((data.zoo-data.zoo.lag) >= c,1,0)  #this calculates the binary acute change (deals with NA)
    denominator<-sum(!is.na(ac))         #this computes the number of non-missing elements (denominator)
    pac <- sum(ac/denominator,na.rm=TRUE)     #this calculates the pac
    return(pac)
}
#Test on prior problem
my.pac(d1$negaff,.3)
## [1] 0.1428571

Same answer. Yay!

Now lets try to use it to summarize every individual’s data (using the plyr package).

pac.stats <- ddply(daily, "id", summarize, ipac.negaff=my.pac(negaff,c=.3))

Looking pretty good.
Not quite sure that the formula treats missing observations in the exact way that is desired. So should be a bit careful in interpreting the pac = 0.000. To be clear, this is not a problem with the function per se, but with conceptualization of how to treat missing. We may want to add a “correction” in that aligns with our notion of what a pac = 0.000 is in the absence of data.

And we merge theipac.negaff scores in with the other summaries.

indiv.stats <- merge(indiv.stats,pac.stats,by="id")

And check the distribution.

#histogram
ggplot(data=indiv.stats, aes(x=ipac.negaff)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "PAC Negative Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looks like the distirbution may fill in more when working with longer time-series.

We have expanded the set of tools that we can used to measure intraindividual variability constructs - or what Ram & Gerstorf called dynamic characteristics.