This tutorial walks through calculation of a few intraindividual variability metrics. Specifically, this tutorial demonstrates calculation of intraindividual means, standard deviations, counts, proportions, and entropy. These metrics are useful for articulating a variety of dynamic characteristics from experience sampling data - a sometimes useful set of interindividual differences constructs
In this session we cover …
A. Looking at distributions and within-person time-series (continuous, categorical, binary)
B. Calculating within-person summaries (imean, isd, etc.)
C. Calculating using a formula - (ientropy)
library(psych)
library(ggplot2)
library(data.table)
library(plyr)
library(entropy)
Note that we are working from a long file. For your own data, there may be some steps to get to this point.
#Setting the working directory
setwd("~/Desktop/Fall_2017") #Person 1 Computer
#setwd("~/Desktop/Fall_2017") #Person 2 Computer
#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBbrief_raw_daily1.csv"
#read in the .csv file using the url() function
daily <- read.csv(file=url(filepath),header=TRUE)
#Little bit of clean-up
var.names.daily <- tolower(colnames(daily))
colnames(daily)<-var.names.daily
Everything we do today uses a long file
A reminder …
1. How many unique persons are there in these data?
#Make a vector of the unique ids
idlist <- unique(daily$id)
#length of the id vector
length(idlist)
## [1] 190
Ok so, N = 190.
#Make a vector of the unique ids
daylist <- unique(daily$day)
#length of the id vector
length(daylist)
## [1] 8
Ok so, T = 8.
Lets look at some descriptions of today’s variables.
#getting a list of the variable names
names(daily)
## [1] "id" "day" "date" "slphrs" "weath" "lteq" "pss"
## [8] "se" "swls" "evalday" "posaff" "negaff" "temp" "hum"
## [15] "wind" "bar" "prec"
#sample descriptives
describe(daily$posaff)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1441 4.12 1.1 4.2 4.15 1.19 1 7 6 -0.25 -0.33
## se
## X1 0.03
describe(daily$se)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1445 3.43 0.99 3 3.47 1.48 1 5 4 -0.41 -0.11
## se
## X1 0.03
describe(daily$evalday)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1421 0.69 0.46 1 0.73 0 0 1 1 -0.8 -1.36 0.01
#histograms
ggplot(data=daily, aes(x=posaff)) +
geom_histogram(fill="white", color="black") +
labs(x = "Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).
ggplot(data=daily, aes(x=se)) +
geom_histogram(fill="white", color="black") +
labs(x = "Self Esteem")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
ggplot(data=daily, aes(x=evalday)) +
geom_histogram(fill="white", color="black") +
labs(x = "Evaluation Day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 37 rows containing non-finite values (stat_bin).
The histograms display the various measurement scales of these variables (interval, ordinal, binary).
And lets look at the longitudinal (“spaghetti”) plots …
1. Positive Affect
#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = posaff, group = id, color=factor(id))) +
geom_point() +
geom_line(data=daily[which(daily$id <= 105 & daily$posaff !="NA"),]) +
xlab("Day") +
ylab("Positive Affect") + #ylim(1,7) +
scale_x_continuous(breaks=seq(0,7,by=1))
That looks nice.
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = se, group = id, color=factor(id))) +
geom_point() +
geom_line(data=daily[which(daily$id <= 105 & daily$se !="NA"),]) +
xlab("Day") +
ylab("Self-Esteem") + ylim(1,5) +
scale_x_continuous(breaks=seq(0,7,by=1))
That looks eh. Ordinal/Nominal measures = harder to manage (see the density attempt from last time).
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = evalday, group = id, color=factor(id))) +
geom_point() +
geom_line(data=daily[which(daily$id <= 105 & daily$evalday !="NA"),]) +
xlab("Day") +
ylab("Evaluation Day") + #ylim(1,7) +
scale_x_continuous(breaks=seq(0,7,by=1))
## Warning: Removed 4 rows containing missing values (geom_point).
Ugh. That does not look nice. Lets try again.
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = id, fill=factor(evalday))) +
geom_tile() +
xlab("Day") +
ylab("ID") + #ylim(1,7) +
scale_x_continuous(breaks=seq(0,7,by=1))
That’s a more informative plot! Conclusion = Categorical time-series need to be plotted differently.
Intraindividual Variability Summaries
Lets look at the within-person distributions of posaff
.
#Density distribution by id
ggplot(data=daily[which(daily$id <=105),], aes(x=posaff)) +
geom_density(aes(group=factor(id), colour=factor(id))) +
xlab("Positive Afect")
These distributions might be characterized by their “moments” (mean, variance, skew, kurtosis).
So, each indvidual can get a mean, variance etc. score.
Lets calculate those summaries.
To do so, we need some “by id” processing.
The fastest way (which is good for big data sets) is using the data.table
package.
We also use the skew()
and kurtosi()
functions from the psych
package (can also use library(moments)
).
#Computing Individual parameters
#converting to a data.table object
daily.dt <- data.table(daily)
#collecting summary info by id
#note: posaff is a continuous variable; evalday is a binary variable
indiv.statbasic <- daily.dt[,list(imean.posaff=mean(posaff, na.rm=TRUE),
isd.posaff=sd(posaff, na.rm=TRUE),
iskew.posaff=skew(posaff, na.rm=TRUE),
ikurt.posaff=kurtosi(posaff, na.rm=TRUE),
icount.posaff=sum(!is.na(posaff)),
prop.evalday=mean(evalday, na.rm=TRUE),
icount.evalday=sum(!is.na(evalday))
),
by=id]
#An alternative
#library(plyr)
#count.statbasic <- ddply(daily, "id", summarize, icount=sum(!is.na(posaff)))
And lets look at the sample-level descriptives of the individual-level summaries.
describe(indiv.statbasic)
## vars n mean sd median trimmed mad min max
## id 1 190 318.29 130.44 321.50 318.99 151.23 101.00 532.00
## imean.posaff 2 190 4.10 0.73 4.10 4.10 0.63 2.23 6.04
## isd.posaff 3 187 0.81 0.35 0.78 0.79 0.33 0.18 2.17
## iskew.posaff 4 187 -0.12 0.51 -0.12 -0.11 0.50 -1.59 1.44
## ikurt.posaff 5 187 -1.24 0.61 -1.36 -1.31 0.44 -2.75 1.10
## icount.posaff 6 190 7.58 1.22 8.00 7.89 0.00 1.00 8.00
## prop.evalday 7 190 0.69 0.19 0.71 0.69 0.21 0.00 1.00
## icount.evalday 8 190 7.48 1.33 8.00 7.83 0.00 1.00 8.00
## range skew kurtosis se
## id 431.00 -0.04 -1.09 9.46
## imean.posaff 3.81 0.04 0.08 0.05
## isd.posaff 1.98 0.84 1.08 0.03
## iskew.posaff 3.04 -0.23 0.38 0.04
## ikurt.posaff 3.85 1.27 2.67 0.04
## icount.posaff 7.00 -3.99 16.55 0.09
## prop.evalday 1.00 -0.42 0.33 0.01
## icount.evalday 7.00 -3.45 12.32 0.10
Ok, we’ve got standard intraindividual variability measures for …
continuous variables = iMean, iSD, iSkew, iKurtosis;
binary variables = iCount, iProportion.
What about nominal variables with more than 2 categories categories? (That self-esteem problem we have =:-)
Variability for nominal data = Shannon’s entropy
A formula for Shannon’s entorpy is …
\(H_{i} = - \sum_{j=1}^m p_{ij}lnp_{ij}\)
A few steps in calculation.
1. What is our nominal variable of interest?
se
(daily self-esteem)
2. How many categories are there?
m = number of categories
m <- length(unique(daily$se))
m
## [1] 6
Note that NA (i.e., missing) may or may not be considered a category. For clarity, lets look at the table of counts …
table(daily$se, useNA = "always")
##
## 1 2 3 4 5 <NA>
## 62 160 503 532 188 13
table(daily$se)
##
## 1 2 3 4 5
## 62 160 503 532 188
For our purposes, we will not consider NA as a unique category, but as non-observation.
#using library(plyr)
#First, we get the frequency counts across categories for each person (numerator)
counts <- ddply(daily, "id", summarize, icounts.se=table(se))
head(counts,7)
## id icounts.se
## 1 101 1
## 2 101 4
## 3 101 3
## 4 102 1
## 5 102 1
## 6 102 2
## 7 102 4
#Second, we get the total number of days observed for each person (denominator)
numobs <- ddply(daily, "id", summarize, inumobs.se=sum(!is.na(se)))
head(numobs)
## id inumobs.se
## 1 101 8
## 2 102 8
## 3 103 6
## 4 104 8
## 5 105 8
## 6 106 8
#putting them together - by id
freqs <- merge(counts,numobs,by="id")
freqs$iproportions <- freqs$icounts.se/freqs$inumobs.se
head(freqs,7)
## id icounts.se inumobs.se iproportions
## 1 101 1 8 0.125
## 2 101 4 8 0.500
## 3 101 3 8 0.375
## 4 102 1 8 0.125
## 5 102 1 8 0.125
## 6 102 2 8 0.250
## 7 102 4 8 0.500
entropies <- ddply(freqs, "id", summarize, ientropy.se=-sum(iproportions*log(iproportions)))
#can also scale this to be between zero and 1 if like.
head(entropies)
## id ientropy.se
## 1 101 0.9743148
## 2 102 1.2130076
## 3 103 1.0114043
## 4 104 0.9743148
## 5 105 0.9743148
## 6 106 0.9002561
Lets check if we got it right. There is an entropy()
function in the entropy
library.
#using library(entropy)
#the entropy() function needs a vector of counts
entropy(table(daily[which(daily$id==101),]$se))
## [1] 0.9743148
entropy(table(daily[which(daily$id==102),]$se))
## [1] 1.213008
entropy(table(daily[which(daily$id==103),]$se))
## [1] 1.011404
Looking good! The first three match perfectly.
For more detailed information on calculation and use of diversity (e.g., entropy) metrics, see the more detailed tutorial here … https://quantdev.ssri.psu.edu/tutorials/diversity-type-intraindividual-variability-calculation
Lets merge the entropies into our collection of istats.
indiv.statbasic <- merge(indiv.statbasic,entropies,by="id")
names(indiv.statbasic)
## [1] "id" "imean.posaff" "isd.posaff" "iskew.posaff"
## [5] "ikurt.posaff" "icount.posaff" "prop.evalday" "icount.evalday"
## [9] "ientropy.se"
And lets look at the pairs
pairs.panels(indiv.statbasic[,c("imean.posaff","isd.posaff","ientropy.se","prop.evalday")],
scale=FALSE, hist.col="gray", main=paste("Intraindividual Summaries"))
From there we can look more closely at the distribution of our new intraindividual variability (imean, isd, iproportion, icount, etc.) variables. This is important if the constructs measured by the variables are out “outcomes” (which we usually like to to be normally distirbuted). And we can engage transformation or reformulation as needed.
Histogram of ientropy.se
#histogram
ggplot(data=indiv.statbasic, aes(x=ientropy.se)) +
geom_histogram(fill="white", color="black") +
labs(x = "Entropy (self-esteem)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A bit of an interesting distribution (bit of zero inflation, cannot go below zero). But, hey. We just created a new outcome variable, self-esteem lability! And it turns out there are only a few papers on it (see e.g., Molloy, Ram, & Gest, 2011) - so, totally worth exploring further.
Next steps would be to examine how the intraindivdual variability metrics are related to other interindividual differences variables (e.g., using ANOVA, regression, etc.).
Yay for Intraindividual Variability and the discovery of new dynamic characteristics!!