Overview

This tutorial walks through calculation of a few intraindividual variability metrics. Specifically, this tutorial demonstrates calculation of intraindividual means, standard deviations, counts, proportions, and entropy. These metrics are useful for articulating a variety of dynamic characteristics from experience sampling data - a sometimes useful set of interindividual differences constructs

Outline

In this session we cover …

A. Looking at distributions and within-person time-series (continuous, categorical, binary)
B. Calculating within-person summaries (imean, isd, etc.)
C. Calculating using a formula - (ientropy)

Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
library(data.table)
library(plyr)
library(entropy)

Prelim - Reading in Repeated Measures Data

Note that we are working from a long file. For your own data, there may be some steps to get to this point.

#Setting the working directory
setwd("~/Desktop/Fall_2017")  #Person 1 Computer
#setwd("~/Desktop/Fall_2017")  #Person 2 Computer

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBbrief_raw_daily1.csv"
#read in the .csv file using the url() function
daily <- read.csv(file=url(filepath),header=TRUE)

#Little bit of clean-up
var.names.daily <- tolower(colnames(daily))
colnames(daily)<-var.names.daily

Everything we do today uses a long file

A reminder …
1. How many unique persons are there in these data?

#Make a vector of the unique ids
idlist <- unique(daily$id)
#length of the id vector
length(idlist)
## [1] 190

Ok so, N = 190.

  1. How many observations are there in these data?
#Make a vector of the unique ids
daylist <- unique(daily$day)
#length of the id vector
length(daylist)
## [1] 8

Ok so, T = 8.

A: Looking at distributions and within-person time-series (continuous, categorical, binary)

Lets look at some descriptions of today’s variables.

#getting a list of the variable names
names(daily)
##  [1] "id"      "day"     "date"    "slphrs"  "weath"   "lteq"    "pss"    
##  [8] "se"      "swls"    "evalday" "posaff"  "negaff"  "temp"    "hum"    
## [15] "wind"    "bar"     "prec"
#sample descriptives
describe(daily$posaff)
##    vars    n mean  sd median trimmed  mad min max range  skew kurtosis
## X1    1 1441 4.12 1.1    4.2    4.15 1.19   1   7     6 -0.25    -0.33
##      se
## X1 0.03
describe(daily$se)
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 1445 3.43 0.99      3    3.47 1.48   1   5     4 -0.41    -0.11
##      se
## X1 0.03
describe(daily$evalday)
##    vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 1421 0.69 0.46      1    0.73   0   0   1     1 -0.8    -1.36 0.01
#histograms
ggplot(data=daily, aes(x=posaff)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Positive Affect")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (stat_bin).

ggplot(data=daily, aes(x=se)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Self Esteem")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

ggplot(data=daily, aes(x=evalday)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Evaluation Day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 37 rows containing non-finite values (stat_bin).

The histograms display the various measurement scales of these variables (interval, ordinal, binary).

And lets look at the longitudinal (“spaghetti”) plots …
1. Positive Affect

#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = posaff, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=daily[which(daily$id <= 105 & daily$posaff !="NA"),]) +
  xlab("Day") + 
  ylab("Positive Affect") + #ylim(1,7) +
  scale_x_continuous(breaks=seq(0,7,by=1)) 

That looks nice.

  1. Self Esteem
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = se, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=daily[which(daily$id <= 105 & daily$se !="NA"),]) +
  xlab("Day") + 
  ylab("Self-Esteem") + ylim(1,5) +
  scale_x_continuous(breaks=seq(0,7,by=1)) 

That looks eh. Ordinal/Nominal measures = harder to manage (see the density attempt from last time).

  1. Evaluation Day
ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = evalday, group = id, color=factor(id))) +
  geom_point() + 
  geom_line(data=daily[which(daily$id <= 105 & daily$evalday !="NA"),]) +
  xlab("Day") + 
  ylab("Evaluation Day") + #ylim(1,7) +
  scale_x_continuous(breaks=seq(0,7,by=1))
## Warning: Removed 4 rows containing missing values (geom_point).

Ugh. That does not look nice. Lets try again.

ggplot(data = daily[which(daily$id <=105),], aes(x = day, y = id, fill=factor(evalday))) +
  geom_tile() +
  xlab("Day") + 
  ylab("ID") + #ylim(1,7) +
  scale_x_continuous(breaks=seq(0,7,by=1))

That’s a more informative plot! Conclusion = Categorical time-series need to be plotted differently.

B: Calculating within-person summaries (imean, isd, etc.)

Intraindividual Variability Summaries

Lets look at the within-person distributions of posaff.

#Density distribution by id
ggplot(data=daily[which(daily$id <=105),], aes(x=posaff)) + 
  geom_density(aes(group=factor(id), colour=factor(id))) +
  xlab("Positive Afect") 

These distributions might be characterized by their “moments” (mean, variance, skew, kurtosis).
So, each indvidual can get a mean, variance etc. score.

Lets calculate those summaries.
To do so, we need some “by id” processing.
The fastest way (which is good for big data sets) is using the data.table package.

We also use the skew() and kurtosi() functions from the psych package (can also use library(moments)).

#Computing Individual parameters

#converting to a data.table object
daily.dt <- data.table(daily)

#collecting summary info by id
#note: posaff is a continuous variable; evalday is a binary variable
indiv.statbasic <- daily.dt[,list(imean.posaff=mean(posaff, na.rm=TRUE), 
                                  isd.posaff=sd(posaff, na.rm=TRUE),
                                  iskew.posaff=skew(posaff, na.rm=TRUE),
                                  ikurt.posaff=kurtosi(posaff, na.rm=TRUE),
                                  icount.posaff=sum(!is.na(posaff)),
                                  prop.evalday=mean(evalday, na.rm=TRUE),
                                  icount.evalday=sum(!is.na(evalday))
                                  ),
                                  by=id]
#An alternative
#library(plyr)
#count.statbasic <- ddply(daily, "id", summarize, icount=sum(!is.na(posaff)))

And lets look at the sample-level descriptives of the individual-level summaries.

describe(indiv.statbasic)
##                vars   n   mean     sd median trimmed    mad    min    max
## id                1 190 318.29 130.44 321.50  318.99 151.23 101.00 532.00
## imean.posaff      2 190   4.10   0.73   4.10    4.10   0.63   2.23   6.04
## isd.posaff        3 187   0.81   0.35   0.78    0.79   0.33   0.18   2.17
## iskew.posaff      4 187  -0.12   0.51  -0.12   -0.11   0.50  -1.59   1.44
## ikurt.posaff      5 187  -1.24   0.61  -1.36   -1.31   0.44  -2.75   1.10
## icount.posaff     6 190   7.58   1.22   8.00    7.89   0.00   1.00   8.00
## prop.evalday      7 190   0.69   0.19   0.71    0.69   0.21   0.00   1.00
## icount.evalday    8 190   7.48   1.33   8.00    7.83   0.00   1.00   8.00
##                 range  skew kurtosis   se
## id             431.00 -0.04    -1.09 9.46
## imean.posaff     3.81  0.04     0.08 0.05
## isd.posaff       1.98  0.84     1.08 0.03
## iskew.posaff     3.04 -0.23     0.38 0.04
## ikurt.posaff     3.85  1.27     2.67 0.04
## icount.posaff    7.00 -3.99    16.55 0.09
## prop.evalday     1.00 -0.42     0.33 0.01
## icount.evalday   7.00 -3.45    12.32 0.10

Ok, we’ve got standard intraindividual variability measures for …
continuous variables = iMean, iSD, iSkew, iKurtosis;
binary variables = iCount, iProportion.

What about nominal variables with more than 2 categories categories? (That self-esteem problem we have =:-)

C: Calculating using a formula - (ientropy)

Variability for nominal data = Shannon’s entropy

A formula for Shannon’s entorpy is …
\(H_{i} = - \sum_{j=1}^m p_{ij}lnp_{ij}\)

A few steps in calculation.
1. What is our nominal variable of interest?
se (daily self-esteem)
2. How many categories are there?
m = number of categories

m <- length(unique(daily$se))
m
## [1] 6

Note that NA (i.e., missing) may or may not be considered a category. For clarity, lets look at the table of counts …

table(daily$se, useNA = "always")
## 
##    1    2    3    4    5 <NA> 
##   62  160  503  532  188   13
table(daily$se)
## 
##   1   2   3   4   5 
##  62 160 503 532 188

For our purposes, we will not consider NA as a unique category, but as non-observation.

  1. We need proportions to put in our formula.
    proportion(m) = count(m)/numobs
#using library(plyr)
#First, we get the frequency counts across categories for each person (numerator)
counts <- ddply(daily, "id", summarize, icounts.se=table(se))
head(counts,7)
##    id icounts.se
## 1 101          1
## 2 101          4
## 3 101          3
## 4 102          1
## 5 102          1
## 6 102          2
## 7 102          4
#Second, we get the total number of days observed for each person (denominator)
numobs <- ddply(daily, "id", summarize, inumobs.se=sum(!is.na(se)))
head(numobs)
##    id inumobs.se
## 1 101          8
## 2 102          8
## 3 103          6
## 4 104          8
## 5 105          8
## 6 106          8
#putting them together - by id
freqs <- merge(counts,numobs,by="id")
freqs$iproportions <- freqs$icounts.se/freqs$inumobs.se
head(freqs,7)
##    id icounts.se inumobs.se iproportions
## 1 101          1          8        0.125
## 2 101          4          8        0.500
## 3 101          3          8        0.375
## 4 102          1          8        0.125
## 5 102          1          8        0.125
## 6 102          2          8        0.250
## 7 102          4          8        0.500
  1. Calculate entropy from the proportions.
entropies <- ddply(freqs, "id", summarize, ientropy.se=-sum(iproportions*log(iproportions)))
#can also scale this to be between zero and 1 if like.
head(entropies)
##    id ientropy.se
## 1 101   0.9743148
## 2 102   1.2130076
## 3 103   1.0114043
## 4 104   0.9743148
## 5 105   0.9743148
## 6 106   0.9002561

Lets check if we got it right. There is an entropy() function in the entropy library.

#using library(entropy) 
#the entropy() function needs a vector of counts
entropy(table(daily[which(daily$id==101),]$se))
## [1] 0.9743148
entropy(table(daily[which(daily$id==102),]$se))
## [1] 1.213008
entropy(table(daily[which(daily$id==103),]$se))
## [1] 1.011404

Looking good! The first three match perfectly.

This example (ientropy) illustrates how we can identify a specific feature of variability, write out the formula, and then calculate/quantify it – from scratch! Like baking a cake!

For more detailed information on calculation and use of diversity (e.g., entropy) metrics, see the more detailed tutorial here … https://quantdev.ssri.psu.edu/tutorials/diversity-type-intraindividual-variability-calculation

Lets merge the entropies into our collection of istats.

indiv.statbasic <- merge(indiv.statbasic,entropies,by="id")
names(indiv.statbasic)
## [1] "id"             "imean.posaff"   "isd.posaff"     "iskew.posaff"  
## [5] "ikurt.posaff"   "icount.posaff"  "prop.evalday"   "icount.evalday"
## [9] "ientropy.se"

We are developing a little library of measures that can used to measure intraindividual variability - or what Ram & Gerstorf (2009) called dynamic characteristics.

And lets look at the pairs

pairs.panels(indiv.statbasic[,c("imean.posaff","isd.posaff","ientropy.se","prop.evalday")], 
             scale=FALSE, hist.col="gray", main=paste("Intraindividual Summaries"))

From there we can look more closely at the distribution of our new intraindividual variability (imean, isd, iproportion, icount, etc.) variables. This is important if the constructs measured by the variables are out “outcomes” (which we usually like to to be normally distirbuted). And we can engage transformation or reformulation as needed.

Histogram of ientropy.se

#histogram
ggplot(data=indiv.statbasic, aes(x=ientropy.se)) +
  geom_histogram(fill="white", color="black") + 
  labs(x = "Entropy (self-esteem)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A bit of an interesting distribution (bit of zero inflation, cannot go below zero). But, hey. We just created a new outcome variable, self-esteem lability! And it turns out there are only a few papers on it (see e.g., Molloy, Ram, & Gest, 2011) - so, totally worth exploring further.

Next steps would be to examine how the intraindivdual variability metrics are related to other interindividual differences variables (e.g., using ANOVA, regression, etc.).

Yay for Intraindividual Variability and the discovery of new dynamic characteristics!!