Overview

Repeated measures data, often obtained from experience sampling or daily diary studies, require analytic methods that accommodate the inherent nesting in the data. Multilevel and structural equation modeling approaches are typically used to analyze repeated measures data. However, data mining and machine learning methods, particularly decision trees, are now being modified to accommodate repeated measures data (e.g., SEM trees, Brandmaier et al., 2013; mixed-effects regression trees, Hajjem et al., 2011, 2014; nonlinear longitudinal recursive partitioning, Stegmann et al., 2018).

In this tutorial, we illustrate how random effects regression trees (Random Effects/EM (RE-EM) trees, Sela & Simonoff, 2012) can be applied to EMA-type data for knowledge discovery.

Random effects regression trees have a continuous outcome variable (e.g., negative affect measured on a Likert or slider scale). Random effects regression trees can accommodate a sizable number of predictor variables, and these predictors can be categorical, interval, or continuous.

Ultimately, the model aims to sort each instance (or data point) into a node, with the instances within a node being as similar as possible. Note that nodes are compromised of observations, not individuals. To determine the composition of a node, random effects regression trees use a recursive algorithm to minimize the sum of squares of the node.

In our example, we will use data from an experience sampling study that examines daily interactions, emotions, and behaviors. Specifically, we will use the AMIB data set that followed 190 adults for eight days - thus repeated measures across days are nested within individuals.

Outline

In this tutorial, we will cover…

  1. Introduction to the research questions, model, and data.
  2. An Example applying Random Effect Regression Trees to Daily Diary Data.
  3. Model Evaluation Random Effect Regression Tree.
  4. Cautions and considerations.
  5. Conclusion.

1. Introduction to the Research Questions and Data.

The Research Questions.

We are going to address:

  • Are people’s personality, patterns of daily behaviors (e.g., sleep hours and quality, perceived stress, physical activity), and positive affect predictive of their daily negative affect?
    • What variables are best at distinguishing differences in daily negative affect?
    • Furthermore, are these selected variables stable when changing the parameters that constrain tree growth (e.g., maximum tree depth)?

Notice that no pattern of association is hypothesized between the predictors and the outcome; this is one advantage of regression trees - the different splits allow for higher-order interactions and thus non-linear associations between the predictors and outcome variables.

The Data.

The data are organized as follows:

  • There are N (number of individuals) = 190 individuals, a majority of whom completed all eight days of the daily diary study. In total, we have 1,418 rows of data - an average of 7.46 reports per individual.

The data set has additional columns we will not use, so we will subset to only the variables we need, which are an ID variable, time variable, predictors variables, and outcome variable. Specifically…

  • Columns:
    • Participant ID (id)
    • Time (e.g., day within the daily diary study; day)
    • Big Five Inventory: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
    • Sleep hours (slphrs)
    • Sleep quality (slpwell)
    • Physical activity (lteq)
    • Perceived stress (pss)
    • Positive affect (posaff)
    • Negative affect (negaff)

    • for a total of 8 columns in this data set

Preliminaries

Loading Libraries

Loading libraries used in this script.

library(psych) # for describing the data
library(plyr) #for data manipulation
library(ggplot2) # for data visualization
library(party) #for tree models
library(partykit) #for tree models
library(rattle) #for tree models
library(REEMtree) #for random effect tree models

Loading Data

Load and merge person-level and day-level data.

# Set filepath for person-level data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_persons_2019_0501.csv"
# Read in the .csv file using the url() function
AMIB_persons <- read.csv(file = url(filepath), header = TRUE)
#names(amib_person)

# Set filepath for daily data file
filepath1 <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_daily_2019_0501.csv" 
# Read in the .csv file using the url() function
AMIB_daily <- read.csv(file = url(filepath1), header = TRUE)
#names(amib_daily)

# Merge daily and person-level data
amib <- merge(AMIB_persons, AMIB_daily, by = "id")

Data Preparation

Reduce to the variables of interest.

amib <- amib[, c("id", "day", "bfi_e", "bfi_a", "bfi_c", "bfi_n", "bfi_o", "slphrs", "slpwell", "pss", "lteq", "posaff")]

#head(amib)

Reverse code pss so higher values indicate higher perceived stress.

#reverse coding the pss variable into a new stress variable
amib$stress <- 4 - amib$pss

Split the variables into “trait” (between-person differences) and “state” (within-person deviations) components. Specifically, the time-varying predictors, sleep hours, sleep quality, perceived stress, physical activity, and positive affect are split into two varaibles: VARIABLE_trait is the sample-mean centered between-person component, and VARIABLE_state is the person-centered within-person component. FOr convenience of plotting we also split the outcome variable, posaff.

Calculate the trait variables

# Calculate intraindividual means
amib_imeans <- ddply(amib, "id", summarize,
                     posaff_trait = mean(posaff, na.rm=TRUE),
                     slphrs_trait = mean(slphrs, na.rm=TRUE),
                     slpwell_trait = mean(slpwell, na.rm=TRUE),
                     stress_trait = mean(stress, na.rm=TRUE),
                     lteq_trait = mean(lteq, na.rm=TRUE))

# Merge into long data set
amib <- merge(amib, amib_imeans, by="id")   

Calculate state variables.

#calculate state variables
amib$posaff_state <- amib$posaff - amib$posaff_trait
amib$slphrs_state <- amib$slphrs - amib$slphrs_trait
amib$slpwell_state <- amib$slpwell - amib$slpwell_trait
amib$stress_state <- amib$stress - amib$stress_trait
amib$lteq_state <- amib$lteq - amib$lteq_trait

# Describe data
describe(amib)
##               vars    n   mean     sd median trimmed    mad    min    max
## id               1 1458 322.53 129.08 324.00  324.20 151.23 101.00 532.00
## day              2 1458   3.48   2.30   3.00    3.47   2.97   0.00   7.00
## bfi_e            3 1458   3.39   1.00   3.50    3.41   0.74   1.00   5.00
## bfi_a            4 1458   3.61   0.88   3.50    3.69   0.74   1.00   5.00
## bfi_c            5 1458   3.76   0.85   4.00    3.78   0.74   1.50   5.00
## bfi_n            6 1458   2.97   0.96   3.00    2.98   1.48   1.00   5.00
## bfi_o            7 1458   3.60   0.96   3.50    3.65   0.74   1.00   5.00
## slphrs           8 1428   7.17   1.81   7.00    7.18   1.48   0.00  18.00
## slpwell          9 1439   2.49   1.09   3.00    2.56   1.48   0.00   4.00
## pss             10 1445   2.61   0.68   2.75    2.64   0.74   0.00   4.00
## lteq            11 1433  12.44  10.37   9.00   11.18   8.90   0.00  58.00
## posaff          12 1441   4.12   1.10   4.20    4.15   1.19   1.00   7.00
## stress          13 1445   1.39   0.68   1.25    1.36   0.74   0.00   4.00
## posaff_trait    14 1458   4.11   0.73   4.10    4.11   0.63   2.23   6.04
## slphrs_trait    15 1458   7.16   0.94   7.20    7.20   0.91   4.12   9.31
## slpwell_trait   16 1458   2.49   0.65   2.50    2.50   0.56   0.38   4.00
## stress_trait    17 1458   1.39   0.47   1.41    1.39   0.51   0.19   2.56
## lteq_trait      18 1458  12.60   8.35  10.75   11.88   8.52   0.00  45.33
## posaff_state    19 1441   0.00   0.82   0.05    0.01   0.70  -3.17   3.29
## slphrs_state    20 1428   0.00   1.55   0.00   -0.01   1.30  -8.75   9.25
## slpwell_state   21 1439   0.00   0.88   0.12    0.03   0.74  -3.29   2.57
## stress_state    22 1445   0.00   0.49  -0.03   -0.02   0.46  -1.75   2.12
## lteq_state      23 1433   0.00   6.44  -0.25   -0.11   5.37 -23.75  28.88
##                range  skew kurtosis   se
## id            431.00 -0.07    -1.06 3.38
## day             7.00  0.01    -1.24 0.06
## bfi_e           4.00 -0.20    -0.58 0.03
## bfi_a           4.00 -0.69     0.10 0.02
## bfi_c           3.50 -0.11    -0.90 0.02
## bfi_n           4.00 -0.08    -0.79 0.03
## bfi_o           4.00 -0.36    -0.43 0.03
## slphrs         18.00  0.10     1.83 0.05
## slpwell         4.00 -0.52    -0.36 0.03
## pss             4.00 -0.35     0.13 0.02
## lteq           58.00  1.07     0.94 0.27
## posaff          6.00 -0.25    -0.33 0.03
## stress          4.00  0.35     0.13 0.02
## posaff_trait    3.81  0.05     0.09 0.02
## slphrs_trait    5.19 -0.40     0.28 0.02
## slpwell_trait   3.62 -0.31     0.55 0.02
## stress_trait    2.38 -0.08    -0.21 0.01
## lteq_trait     45.33  0.85     0.60 0.22
## posaff_state    6.46 -0.14     0.82 0.02
## slphrs_state   18.00  0.19     3.31 0.04
## slpwell_state   5.86 -0.31     0.34 0.02
## stress_state    3.88  0.36     0.79 0.01
## lteq_state     52.62  0.18     1.52 0.17

Note that we did not sample-center the person-level variables. Tree models generally work with the raw data. The state variables are by definition person-centered.

However, the data cannot have missingness. Thus, we reduce to only the complete cases.

Reduce to complete cases.

amib <- amib[complete.cases(amib), ]
head(amib)
##    id day bfi_e bfi_a bfi_c bfi_n bfi_o slphrs slpwell  pss lteq posaff
## 1 101   0   3.5   1.5     4     2     4    6.0       1 2.50   10    3.9
## 2 101   1   3.5   1.5     4     2     4    2.0       1 2.75   10    3.8
## 3 101   2   3.5   1.5     4     2     4    9.0       3 3.50   10    5.1
## 4 101   3   3.5   1.5     4     2     4    7.5       3 3.00    9    5.6
## 5 101   4   3.5   1.5     4     2     4    8.0       4 2.75   18    4.3
## 6 101   5   3.5   1.5     4     2     4    8.0       3 2.75   19    3.9
##   stress posaff_trait slphrs_trait slpwell_trait stress_trait lteq_trait
## 1   1.50       4.5625       6.9375          2.75       1.0625     13.875
## 2   1.25       4.5625       6.9375          2.75       1.0625     13.875
## 3   0.50       4.5625       6.9375          2.75       1.0625     13.875
## 4   1.00       4.5625       6.9375          2.75       1.0625     13.875
## 5   1.25       4.5625       6.9375          2.75       1.0625     13.875
## 6   1.25       4.5625       6.9375          2.75       1.0625     13.875
##   posaff_state slphrs_state slpwell_state stress_state lteq_state
## 1      -0.6625      -0.9375         -1.75       0.4375     -3.875
## 2      -0.7625      -4.9375         -1.75       0.1875     -3.875
## 3       0.5375       2.0625          0.25      -0.5625     -3.875
## 4       1.0375       0.5625          0.25      -0.0625     -4.875
## 5      -0.2625       1.0625          1.25       0.1875      4.125
## 6      -0.6625       1.0625          0.25       0.1875      5.125

Plotting the Data.

Before we begin running our models, it is always a good idea to look at our data.

Our random effects regression tree has a number of predictors, but let’s focus on examining the association of positive affect and perceived stress on a subset of participants as an example. Plotting the outcome variable, posaff.

#Intraindividual change plot
ggplot(data = amib, 
       aes(x = day, group = id, color = factor(id), legend = FALSE)) +
  geom_line(aes(x = day, y = posaff), lty = 1, size = 1, alpha = .5) +
  xlab("Day") + 
  ylab("Positive Affect (outcome)") + 
  scale_x_continuous(breaks=seq(0,7,by=1)) +
  theme_classic() +
  theme(legend.position = "none")