This tutorial continues our walk through calculation of intraindividual variability metrics. Specifically, this tutorial demonstrates calculation of intraindividual summary statistics using equations that get translated into functions. As with other measures, the derived metrics are useful for articulating a variety of dynamic characteristics from experience sampling data, and as demonstrated at the end of this session, examining how those characteristics are related to other individual differences variables.
In this session we cover …
A. Review description of data and calculation of basic intaindividual variability (IIV) metrics.
B. Calculating some additional IIV metrics (MSSD, PAC) based on their formulas.
C. Examining how the IIV metrics are related to other individual differences variables using regressions.
library(psych)
library(ggplot2)
library(data.table)
library(entropy)
library(plyr)
library(zoo)
Note that we are working from a long file. For your own data, there may be some steps to get to this point.
#Setting the working directory
setwd("~/Desktop/Fall_2017") #Person 1 Computer
#setwd("~/Desktop/Fall_2017") #Person 2 Computer
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBbrief_raw_daily1.csv"
#read in the .csv file using the url() function
daily <- read.csv(file=url(filepath),header=TRUE)
#Little bit of clean-up
var.names.daily <- tolower(colnames(daily))
colnames(daily)<-var.names.daily
Everything we do today uses a long file.
Lets look at some descriptions of today’s variables.
#getting a list of the variable names
names(daily)
## [1] "id" "day" "date" "slphrs" "weath" "lteq" "pss"
## [8] "se" "swls" "evalday" "posaff" "negaff" "temp" "hum"
## [15] "wind" "bar" "prec"
#sample descriptives
describe(daily$posaff)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1441 4.12 1.1 4.2 4.15 1.19 1 7 6 -0.25 -0.33
## se
## X1 0.03
describe(daily$negaff)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1441 2.45 1.04 2.2 2.34 1.04 1 6.9 5.9 0.96 0.77
## se
## X1 0.03
#histograms
ggplot(data=daily, aes(x=posaff)) +
geom_histogram(fill="white", color="black") +
labs(x = "Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).
ggplot(data=daily, aes(x=negaff)) +
geom_histogram(fill="white", color="black") +
labs(x = "Negative Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).
We can see the differences in the sample-level distributions (all persons AND all occasions) between the two variables.
And lets look at the longitudinal (“spaghetti”) plots …
1. Positive Affect
#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
ggplot(data = daily[which(daily$id >=527),], aes(x = day, y = posaff, group = id, color=factor(id))) +
geom_point() +
geom_line(data=daily[which(daily$id >=527 & daily$posaff !="NA"),]) +
xlab("Day") +
ylab("Positive Affect") + #ylim(1,7) +
scale_x_continuous(breaks=seq(0,7,by=1))
That looks nice. Although note that the specification for the y-axis is not the full 1 to 7 range of the variable.
ggplot(data = daily[which(daily$id >=527),], aes(x = day, y = negaff, group = id, color=factor(id))) +
geom_point() +
geom_line(data=daily[which(daily$id >=527 & daily$se !="NA"),]) +
xlab("Day") +
ylab("Negative Affect") + #ylim(1,5) +
scale_x_continuous(breaks=seq(0,7,by=1))
That looks nice. Although note that the specification for the y-axis is not the full 1 to 7 range of the variable.
Lets look at the within-person distributions of posaff
and negaff
.
#Density distribution by id
ggplot(data=daily[which(daily$id >=527),], aes(x=posaff)) +
geom_density(aes(group=factor(id), colour=factor(id))) +
xlab("Positive Afect")
#Density distribution by id
ggplot(data=daily[which(daily$id >=527),], aes(x=negaff)) +
geom_density(aes(group=factor(id), colour=factor(id))) +
xlab("Negative Afect")
Was expecting that the negaff
distributions would show more skew, like in the lecture examples, but perhaps we don’t have enough data (i.e., only 8 occasions) to really get a “distribution”. The number of within-person samples may be too low.
Nonetheless, lets go ahead and calculate the variety of “IIV” summaries as we did last time. We use the data.table
package. We use the entropy()
function from the entropy
package.
#converting to a data.table object
daily.dt <- data.table(daily)
#Computing Individual parameters
#collecting summary info by id
#note: posaff is a continuous variable; evalday is a binary variable
indiv.statbasic <- daily.dt[,list(imean.posaff=mean(posaff, na.rm=TRUE), #continuous variable
isd.posaff=sd(posaff, na.rm=TRUE),
imean.negaff=mean(negaff, na.rm=TRUE), #continuous variable
isd.negaff=sd(negaff, na.rm=TRUE),
icount.posaff=sum(!is.na(posaff)), #data coverage
icount.negaff=sum(!is.na(negaff)),
prop.evalday=mean(evalday, na.rm=TRUE), #binary variable
icount.evalday=sum(!is.na(evalday)),
ientropy.se=entropy(table(se, useNA="no")) #categorical variable
),
by=id]
#An alternative using the plyr package
#count.statbasic <- ddply(daily, "id", summarize, icount=sum(!is.na(posaff)))
And lets look at the sample-level descriptives of the individual-level summaries.
describe(indiv.statbasic)
## vars n mean sd median trimmed mad min max
## id 1 190 318.29 130.44 321.50 318.99 151.23 101.00 532.00
## imean.posaff 2 190 4.10 0.73 4.10 4.10 0.63 2.23 6.04
## isd.posaff 3 187 0.81 0.35 0.78 0.79 0.33 0.18 2.17
## imean.negaff 4 190 2.48 0.73 2.41 2.43 0.72 1.11 5.09
## isd.negaff 5 187 0.74 0.36 0.68 0.70 0.30 0.12 2.83
## icount.posaff 6 190 7.58 1.22 8.00 7.89 0.00 1.00 8.00
## icount.negaff 7 190 7.58 1.22 8.00 7.89 0.00 1.00 8.00
## prop.evalday 8 190 0.69 0.19 0.71 0.69 0.21 0.00 1.00
## icount.evalday 9 190 7.48 1.33 8.00 7.83 0.00 1.00 8.00
## ientropy.se 10 190 0.75 0.35 0.69 0.76 0.42 0.00 1.56
## range skew kurtosis se
## id 431.00 -0.04 -1.09 9.46
## imean.posaff 3.81 0.04 0.08 0.05
## isd.posaff 1.98 0.84 1.08 0.03
## imean.negaff 3.98 0.68 0.45 0.05
## isd.negaff 2.70 1.51 5.23 0.03
## icount.posaff 7.00 -3.99 16.55 0.09
## icount.negaff 7.00 -3.99 16.55 0.09
## prop.evalday 1.00 -0.42 0.33 0.01
## icount.evalday 7.00 -3.45 12.32 0.10
## ientropy.se 1.56 -0.16 -0.11 0.03
There are clearly individual differences, even if the intraindividual summary variables are calculated on “small ensembles”.
We can think of the process of calculating intraindividual variability metrics as a form of feature extraction. The features (i.e., new variables) are extracted from each individuals’ ensemble of scores.
Lets calculate some more features.
\[\delta_{i}^2 = \frac{1}{T-1} \sum_{t=2}^{T} (y_{i(t)}-y_{i(t-1)})^2 \]
Sidebar: How to write equations in R Markdown - a helpful little guide … http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html
The psych
package has mssd()
and rmssd()
functions that are wrappers of functions in the matrixStats
package. They do a few weird things on their own - so along the way we wrote our own function.
#Here is an example of what an mssd function might look like
my.mssd <- function(data)
{
diffToNext<-data[2:length(data)]-data[1:(length(data)-1)] #this computes the difference between each value and the next
diffToNext2<-diffToNext^2 #this squares the difference
SSdiff<- sum(diffToNext2,na.rm=TRUE) #this takes the sum of the squared differences
denominator<-sum(!is.na(diffToNext)) #this computes the number of non-missing elements (denominator)
#which corresponds to the t-1 value
mssd<-SSdiff/denominator #this computes the MSSD
return(mssd)
}
We can apply it to one person’s data.
my.mssd(daily[which(daily$id == 101),]$negaff)
## [1] 0.3385714
Lets check it against the mssd()
function in the psych
package.
mssd(daily[which(daily$id == 101),]$negaff)
## [1] 0.395
This used to work, and we would get the same answer. After some investigation, the difference is that in our formula we divide by \(T-1\). However, in themssd
function in the psych
package, they divide by \(T-2\).
Ok - lets apply to all person’s data
#using the plyr package
mssd.stats <- ddply(daily, "id", summarize, imssd.posaff=mssd(posaff),
imssd.negaff=mssd(negaff),
my.mssd.posaff=my.mssd(posaff),
my.mssd.negaff=my.mssd(negaff))
#Correction for values of infinity
mssd.stats$imssd.posaff <- ifelse(mssd.stats$imssd.posaff =="Inf",NA, mssd.stats$imssd.posaff)
mssd.stats$imssd.negaff <- ifelse(mssd.stats$imssd.negaff =="Inf",NA, mssd.stats$imssd.negaff)
Lets merge those scores into the indiv.statbasic
data, and look at the correlations with the sd.
indiv.stats <- merge(indiv.statbasic,mssd.stats,by="id")
cor(indiv.stats[,c("isd.posaff","isd.negaff","imssd.posaff","imssd.negaff","my.mssd.negaff")], use="pairwise.complete.obs")
## isd.posaff isd.negaff imssd.posaff imssd.negaff
## isd.posaff 1.0000000 0.5542910 0.8133892 0.5291318
## isd.negaff 0.5542910 1.0000000 0.5413603 0.8399991
## imssd.posaff 0.8133892 0.5413603 1.0000000 0.6073310
## imssd.negaff 0.5291318 0.8399991 0.6073310 1.0000000
## my.mssd.negaff 0.5338433 0.8453501 0.6149334 0.9978299
## my.mssd.negaff
## isd.posaff 0.5338433
## isd.negaff 0.8453501
## imssd.posaff 0.6149334
## imssd.negaff 0.9978299
## my.mssd.negaff 1.0000000
Here, the MSSD is not that different from SD (correlations > .8), but it is a bit different.
Remember that we only have time-series of up to length 8. Things may look more different when the time-series are longer. With short time-series we cannot really discern the feature that MSSD helps with (see lecture notes).
Also Note that even after removing the Inf values the psych
package’s mssd()
function is not perfectly correlated with our my.mssd
function.
Why?
Ah … the pesky bits with NAs and how to treat them.
Conclusion = always good to check what the canned functions do as well!
One other check on the distribution.
#histogram
ggplot(data=mssd.stats, aes(x=imssd.posaff)) +
geom_histogram(fill="white", color="black") +
labs(x = "MSSD Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Reminder to be careful when using these variables as outcome measures. Check the distributional assumptions of your model.
\[PAC_i = \frac{1}{T-1} \sum_{t=1}^{T-1} AC_{i(t)} \] where \(AC_{i(t)} = 1\) when \(y_{i(t+1)}-y{i(t) \ge c}\), \(c\) is a predetermined cutpoint (i.e., threshold),
and \(AC_{i(t)} = 0\) otherwise
Lets code that up.
We work up the code in steps.
1. We make a small problem. Subselect to one person’s data.
d1 <- daily[which(daily$id == 101),c("day","negaff")]
lag()
d1$lag.negaff <- lag(d1$negaff,k=1)
d1
## day negaff lag.negaff
## 1 0 3.0 3.0
## 2 1 2.3 2.3
## 3 2 1.0 1.0
## 4 3 1.3 1.3
## 5 4 1.1 1.1
## 6 5 1.0 1.0
## 7 6 1.2 1.2
## 8 7 1.1 1.1
But that is not what we want. It turns out that the lag()
function does not work the way we expect.
After some searching we find a couple of ways to do it. One of which requires the zoo
time-series package.
dzoo <- zoo(d1)
dzoo$lag.negaff <- lag(dzoo$negaff,-1,na.pad = TRUE)
dzoo
## day negaff lag.negaff
## 1 0 3.0 NA
## 2 1 2.3 3.0
## 3 2 1.0 2.3
## 4 3 1.3 1.0
## 5 4 1.1 1.3
## 6 5 1.0 1.1
## 7 6 1.2 1.0
## 8 7 1.1 1.2
Yes! That is more like it.
c = .3
dzoo$ac.negaff = ifelse((dzoo$negaff-dzoo$lag.negaff) >= c,1,0)
c
## [1] 0.3
dzoo$ac.negaff
## 1 2 3 4 5 6 7 8
## NA 0 0 1 0 0 0 0
pac <- sum(dzoo$ac.negaff/(length(dzoo$ac.negaff)-1),na.rm=TRUE)
pac
## [1] 0.1428571
Yay!
Now lets make it into a function
#Lines for testing
#data <- daily[which(daily$id == 219),c("negaff")]
#data
my.pac <- function(data, c)
{
data.zoo <- zoo(data) #this converts to zoo object
data.zoo.lag <- lag(data.zoo,-1,na.pad = TRUE) #this creates the lag variable
ac = ifelse((data.zoo-data.zoo.lag) >= c,1,0) #this calculates the binary acute change (deals with NA)
denominator<-sum(!is.na(ac)) #this computes the number of non-missing elements (denominator)
pac <- sum(ac/denominator,na.rm=TRUE) #this calculates the pac
return(pac)
}
#Test on prior problem
my.pac(d1$negaff,.3)
## [1] 0.1428571
Same answer. Yay!
Now lets try to use it to summarize every individual’s data (using the plyr
package).
pac.stats <- ddply(daily, "id", summarize, ipac.negaff=my.pac(negaff,c=.3))
Looking pretty good.
Not quite sure that the formula treats missing observations in the exact way that is desired. So should be a bit careful in interpreting the pac = 0.000. To be clear, this is not a problem with the function per se, but with conceptualization of how to treat missing. We may want to add a “correction” in that aligns with our notion of what a pac = 0.000 is in the absence of data.
And we merge theipac.negaff
scores in with the other summaries.
indiv.stats <- merge(indiv.stats,pac.stats,by="id")
And check the distribution.
#histogram
ggplot(data=indiv.stats, aes(x=ipac.negaff)) +
geom_histogram(fill="white", color="black") +
labs(x = "PAC Negative Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looks like the distirbution may fill in more when working with longer time-series.