Overview

Repeated measures data, often obtained from experience sampling or daily diary studies, require analytic methods that accommodate the inherent nesting in the data. Multilevel and structural equation modeling approaches are typically used to analyze repeated measures data. However, data mining and machine learning methods, particularly decision trees, are now being modified to accommodate repeated measures data (e.g., SEM trees, Brandmaier et al., 2013; mixed-effects regression trees, Hajjem et al., 2011, 2014; nonlinear longitudinal recursive partitioning, Stegmann et al., 2018).

In this tutorial, we illustrate how random effects regression trees (Random Effects/EM (RE-EM) trees, Sela & Simonoff, 2012) can be applied to EMA-type data for knowledge discovery.

Random effects regression trees have a continuous outcome variable (e.g., negative affect measured on a Likert or slider scale). Random effects regression trees can accommodate a sizable number of predictor variables, and these predictors can be categorical, interval, or continuous.

Ultimately, the model aims to sort each instance (or data point) into a node, with the instances within a node being as similar as possible. Note that nodes are compromised of observations, not individuals. To determine the composition of a node, random effects regression trees use a recursive algorithm to minimize the sum of squares of the node.

In our example, we will use data from an experience sampling study that examines daily interactions, emotions, and behaviors. Specifically, we will use the AMIB data set that followed 190 adults for eight days - thus repeated measures across days are nested within individuals.

Outline

In this tutorial, we will cover…

  1. Introduction to the research questions, model, and data.
  2. An Example applying Random Effect Regression Trees to Daily Diary Data.
  3. Model Evaluation Random Effect Regression Tree.
  4. Cautions and considerations.
  5. Conclusion.

1. Introduction to the Research Questions and Data.

The Research Questions.

We are going to address:

  • Are people’s personality, patterns of daily behaviors (e.g., sleep hours and quality, perceived stress, physical activity), and positive affect predictive of their daily negative affect?
    • What variables are best at distinguishing differences in daily negative affect?
    • Furthermore, are these selected variables stable when changing the parameters that constrain tree growth (e.g., maximum tree depth)?

Notice that no pattern of association is hypothesized between the predictors and the outcome; this is one advantage of regression trees - the different splits allow for higher-order interactions and thus non-linear associations between the predictors and outcome variables.

The Data.

The data are organized as follows:

  • There are N (number of individuals) = 190 individuals, a majority of whom completed all eight days of the daily diary study. In total, we have 1,418 rows of data - an average of 7.46 reports per individual.

The data set has additional columns we will not use, so we will subset to only the variables we need, which are an ID variable, time variable, predictors variables, and outcome variable. Specifically…

  • Columns:
    • Participant ID (id)
    • Time (e.g., day within the daily diary study; day)
    • Big Five Inventory: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
    • Sleep hours (slphrs)
    • Sleep quality (slpwell)
    • Physical activity (lteq)
    • Perceived stress (pss)
    • Positive affect (posaff)
    • Negative affect (negaff)

    • for a total of 8 columns in this data set

Preliminaries

Loading Libraries

Loading libraries used in this script.

library(psych) # for describing the data
library(plyr) #for data manipulation
library(ggplot2) # for data visualization
library(party) #for tree models
library(partykit) #for tree models
library(rattle) #for tree models
library(REEMtree) #for random effect tree models

Loading Data

Load and merge person-level and day-level data.

# Set filepath for person-level data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_persons_2019_0501.csv"
# Read in the .csv file using the url() function
AMIB_persons <- read.csv(file = url(filepath), header = TRUE)
#names(amib_person)

# Set filepath for daily data file
filepath1 <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_daily_2019_0501.csv" 
# Read in the .csv file using the url() function
AMIB_daily <- read.csv(file = url(filepath1), header = TRUE)
#names(amib_daily)

# Merge daily and person-level data
amib <- merge(AMIB_persons, AMIB_daily, by = "id")

Data Preparation

Reduce to the variables of interest.

amib <- amib[, c("id", "day", "bfi_e", "bfi_a", "bfi_c", "bfi_n", "bfi_o", "slphrs", "slpwell", "pss", "lteq", "posaff")]

#head(amib)

Reverse code pss so higher values indicate higher perceived stress.

#reverse coding the pss variable into a new stress variable
amib$stress <- 4 - amib$pss

Split the variables into “trait” (between-person differences) and “state” (within-person deviations) components. Specifically, the time-varying predictors, sleep hours, sleep quality, perceived stress, physical activity, and positive affect are split into two varaibles: VARIABLE_trait is the sample-mean centered between-person component, and VARIABLE_state is the person-centered within-person component. FOr convenience of plotting we also split the outcome variable, posaff.

Calculate the trait variables

# Calculate intraindividual means
amib_imeans <- ddply(amib, "id", summarize,
                     posaff_trait = mean(posaff, na.rm=TRUE),
                     slphrs_trait = mean(slphrs, na.rm=TRUE),
                     slpwell_trait = mean(slpwell, na.rm=TRUE),
                     stress_trait = mean(stress, na.rm=TRUE),
                     lteq_trait = mean(lteq, na.rm=TRUE))

# Merge into long data set
amib <- merge(amib, amib_imeans, by="id")   

Calculate state variables.

#calculate state variables
amib$posaff_state <- amib$posaff - amib$posaff_trait
amib$slphrs_state <- amib$slphrs - amib$slphrs_trait
amib$slpwell_state <- amib$slpwell - amib$slpwell_trait
amib$stress_state <- amib$stress - amib$stress_trait
amib$lteq_state <- amib$lteq - amib$lteq_trait

# Describe data
describe(amib)
##               vars    n   mean     sd median trimmed    mad    min    max
## id               1 1458 322.53 129.08 324.00  324.20 151.23 101.00 532.00
## day              2 1458   3.48   2.30   3.00    3.47   2.97   0.00   7.00
## bfi_e            3 1458   3.39   1.00   3.50    3.41   0.74   1.00   5.00
## bfi_a            4 1458   3.61   0.88   3.50    3.69   0.74   1.00   5.00
## bfi_c            5 1458   3.76   0.85   4.00    3.78   0.74   1.50   5.00
## bfi_n            6 1458   2.97   0.96   3.00    2.98   1.48   1.00   5.00
## bfi_o            7 1458   3.60   0.96   3.50    3.65   0.74   1.00   5.00
## slphrs           8 1428   7.17   1.81   7.00    7.18   1.48   0.00  18.00
## slpwell          9 1439   2.49   1.09   3.00    2.56   1.48   0.00   4.00
## pss             10 1445   2.61   0.68   2.75    2.64   0.74   0.00   4.00
## lteq            11 1433  12.44  10.37   9.00   11.18   8.90   0.00  58.00
## posaff          12 1441   4.12   1.10   4.20    4.15   1.19   1.00   7.00
## stress          13 1445   1.39   0.68   1.25    1.36   0.74   0.00   4.00
## posaff_trait    14 1458   4.11   0.73   4.10    4.11   0.63   2.23   6.04
## slphrs_trait    15 1458   7.16   0.94   7.20    7.20   0.91   4.12   9.31
## slpwell_trait   16 1458   2.49   0.65   2.50    2.50   0.56   0.38   4.00
## stress_trait    17 1458   1.39   0.47   1.41    1.39   0.51   0.19   2.56
## lteq_trait      18 1458  12.60   8.35  10.75   11.88   8.52   0.00  45.33
## posaff_state    19 1441   0.00   0.82   0.05    0.01   0.70  -3.17   3.29
## slphrs_state    20 1428   0.00   1.55   0.00   -0.01   1.30  -8.75   9.25
## slpwell_state   21 1439   0.00   0.88   0.12    0.03   0.74  -3.29   2.57
## stress_state    22 1445   0.00   0.49  -0.03   -0.02   0.46  -1.75   2.12
## lteq_state      23 1433   0.00   6.44  -0.25   -0.11   5.37 -23.75  28.88
##                range  skew kurtosis   se
## id            431.00 -0.07    -1.06 3.38
## day             7.00  0.01    -1.24 0.06
## bfi_e           4.00 -0.20    -0.58 0.03
## bfi_a           4.00 -0.69     0.10 0.02
## bfi_c           3.50 -0.11    -0.90 0.02
## bfi_n           4.00 -0.08    -0.79 0.03
## bfi_o           4.00 -0.36    -0.43 0.03
## slphrs         18.00  0.10     1.83 0.05
## slpwell         4.00 -0.52    -0.36 0.03
## pss             4.00 -0.35     0.13 0.02
## lteq           58.00  1.07     0.94 0.27
## posaff          6.00 -0.25    -0.33 0.03
## stress          4.00  0.35     0.13 0.02
## posaff_trait    3.81  0.05     0.09 0.02
## slphrs_trait    5.19 -0.40     0.28 0.02
## slpwell_trait   3.62 -0.31     0.55 0.02
## stress_trait    2.38 -0.08    -0.21 0.01
## lteq_trait     45.33  0.85     0.60 0.22
## posaff_state    6.46 -0.14     0.82 0.02
## slphrs_state   18.00  0.19     3.31 0.04
## slpwell_state   5.86 -0.31     0.34 0.02
## stress_state    3.88  0.36     0.79 0.01
## lteq_state     52.62  0.18     1.52 0.17

Note that we did not sample-center the person-level variables. Tree models generally work with the raw data. The state variables are by definition person-centered.

However, the data cannot have missingness. Thus, we reduce to only the complete cases.

Reduce to complete cases.

amib <- amib[complete.cases(amib), ]
head(amib)
##    id day bfi_e bfi_a bfi_c bfi_n bfi_o slphrs slpwell  pss lteq posaff
## 1 101   0   3.5   1.5     4     2     4    6.0       1 2.50   10    3.9
## 2 101   1   3.5   1.5     4     2     4    2.0       1 2.75   10    3.8
## 3 101   2   3.5   1.5     4     2     4    9.0       3 3.50   10    5.1
## 4 101   3   3.5   1.5     4     2     4    7.5       3 3.00    9    5.6
## 5 101   4   3.5   1.5     4     2     4    8.0       4 2.75   18    4.3
## 6 101   5   3.5   1.5     4     2     4    8.0       3 2.75   19    3.9
##   stress posaff_trait slphrs_trait slpwell_trait stress_trait lteq_trait
## 1   1.50       4.5625       6.9375          2.75       1.0625     13.875
## 2   1.25       4.5625       6.9375          2.75       1.0625     13.875
## 3   0.50       4.5625       6.9375          2.75       1.0625     13.875
## 4   1.00       4.5625       6.9375          2.75       1.0625     13.875
## 5   1.25       4.5625       6.9375          2.75       1.0625     13.875
## 6   1.25       4.5625       6.9375          2.75       1.0625     13.875
##   posaff_state slphrs_state slpwell_state stress_state lteq_state
## 1      -0.6625      -0.9375         -1.75       0.4375     -3.875
## 2      -0.7625      -4.9375         -1.75       0.1875     -3.875
## 3       0.5375       2.0625          0.25      -0.5625     -3.875
## 4       1.0375       0.5625          0.25      -0.0625     -4.875
## 5      -0.2625       1.0625          1.25       0.1875      4.125
## 6      -0.6625       1.0625          0.25       0.1875      5.125

Plotting the Data.

Before we begin running our models, it is always a good idea to look at our data.

Our random effects regression tree has a number of predictors, but let’s focus on examining the association of positive affect and perceived stress on a subset of participants as an example. Plotting the outcome variable, posaff.

#Intraindividual change plot
ggplot(data = amib, 
       aes(x = day, group = id, color = factor(id), legend = FALSE)) +
  geom_line(aes(x = day, y = posaff), lty = 1, size = 1, alpha = .5) +
  xlab("Day") + 
  ylab("Positive Affect (outcome)") + 
  scale_x_continuous(breaks=seq(0,7,by=1)) +
  theme_classic() +
  theme(legend.position = "none")

Plotting select predictor-outcome relations

#time-invariant predictor
ggplot(data = amib, #subset(amib, id <= 120), 
       aes(x = bfi_n, y = posaff_trait,
           group = id, color = factor(id), legend = FALSE)) +
  geom_point(alpha = .5) +
  geom_smooth(aes(group=1), method=lm, se=FALSE, fullrange=TRUE, lty=1, size=2, color="black") +
  xlab("Neuroticism") + 
  ylab("Positive Affect (outcome)") + 
  theme_classic() +
  theme(legend.position = "none")

#time-invariant predictor
ggplot(data = amib, #subset(amib, id <= 120), 
       aes(x = slphrs_trait, y = posaff_trait,
           group = id, color = factor(id), legend = FALSE)) +
  geom_point(alpha = .5) +
  geom_smooth(aes(group=1), method=lm, se=FALSE, fullrange=TRUE, lty=1, size=2, color="black") +
  xlab("Sleep Hours (Trait)") + 
  ylab("Positive Affect (outcome)") + 
  theme_classic() +
  theme(legend.position = "none")

#time-varying predictor
ggplot(data = amib, 
       aes(x = stress_state, y = posaff,
           group = id, color = factor(id), legend = FALSE)) +
  geom_point(color="gray40",alpha = .2) +
  geom_smooth(method=lm, se=FALSE, fullrange=FALSE, lty=1, size=.5, color="gray40") +
  #geom_smooth(aes(group=1), method=loess, se=FALSE, fullrange=FALSE, lty=1, size=2, color="blue") +
  xlab("Stress (State)") + 
  ylab("Positive Affect (outcome)") + 
  theme_classic() +
  theme(legend.position = "none")

#time-varying predictor
ggplot(data = amib, 
       aes(x = lteq_state, y = posaff,
           group = id, color = factor(id), legend = FALSE)) +
  geom_point(color="gray40",alpha = .2) +
  geom_smooth(method=lm, se=FALSE, fullrange=FALSE, lty=1, size=.5, color="gray40") +
  #geom_smooth(aes(group=1), method=loess, se=FALSE, fullrange=FALSE, lty=1, size=2, color="blue") +
  xlab("Physical Activity (State)") + 
  ylab("Positive Affect (outcome)") + 
  theme_classic() +
  theme(legend.position = "none")

There are many more predcitors, both time-invariant and time-varying that could be plotted. All will be used as predictors in the tree model.

2. Random Effects Regression Tree Model.

We’ll construct a random effects regression tree looking at how people’s personality (as measured by the Big 5 Inventory) and daily behaviors (sleep hours and quality, perceived stress, physical activity) predict their daily negative affect.

The general form used to specify the model is: * <- REEMtree(DV ~ covariate1 + covariate2 + … + covariate_n, data = , random = ~ 1 + cv…| )

Now, let’s run REEMtree on the AMIB data and save results to ‘amibtree result’.

#setting seed for replication
set.seed(1234)

#fitting tree model
model_tree1 <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o + 
                              slphrs_trait +  slpwell_trait + stress_trait + lteq_trait +
                              slphrs_state +  slpwell_state + stress_state + lteq_state, 
                       data = amib, 
                       random = ~ 1 | id)

# Check if output object is a REEMtree (if it isn't, that means something did not run properly)
is.REEMtree(model_tree1)
## [1] TRUE
# Print statistical output
print(model_tree1)
## [1] "*** RE-EM Tree ***"
## n= 1406 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1406 994.13080 4.098919  
##    2) stress_state>=0.3169643 322 221.04880 3.389975  
##      4) stress_state>=0.9114583 56  28.24047 2.606309 *
##      5) stress_state< 0.9114583 266 150.99770 3.555312  
##       10) day>=3.5 137  74.19569 3.293816 *
##       11) day< 3.5 129  57.40909 3.838430 *
##    3) stress_state< 0.3169643 1084 563.17120 4.309509  
##      6) stress_state>=-0.4508929 861 378.13540 4.183755  
##       12) day>=2.5 547 216.51340 4.040918 *
##       13) day< 2.5 314 131.93690 4.422332 *
##      7) stress_state< -0.4508929 223 118.84920 4.798873 *
## [1] "Estimated covariance matrix of random effects:"
##             (Intercept)
## (Intercept)   0.4603121
## [1] "Estimated variance of errors: 0.487691800778259"
## [1] "Log likelihood:  -1693.51847760945"
# Print fit statistics,AIC and BIC, for the model
AIC(model_tree1)
## [1] 3403.037
BIC(model_tree1)
## [1] 3444.991

The model is much easier to interpret in visual format

Plotting tree model

# Plot model object after making into a tree usign tree()
fancyRpartPlot(tree(model_tree1),
               digits = 2,
               sub="",
               main="")

Based upon the above plot, it appears that daily perceived stress (stress_state), and day (which is a proxy for day of week) are predictive of daily positive affect. The values in the terminal nodes represent the average positive affect for that node.

Also, note this tree has a depth of 3 (root node is counted as zero) and 6 resulting nodes.

We can also run the same tree with specifications on the tree growth parameters, such as complexity parameter (i.e., the threshold for stopping a split based upon whether the split sufficiently changes the overall fit of the model). Let’s try a modification and see how the resulting tree differs.

model_tree2 <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o + 
                              slphrs_trait +  slpwell_trait + stress_trait + lteq_trait +
                              slphrs_state +  slpwell_state + stress_state + lteq_state, 
                       data = amib, 
                       tree.control = rpart.control(cp = .8),
                       random = ~ 1 | id) #correlation=corAR1())

# Print statistical output
print(model_tree2)
## [1] "*** RE-EM Tree ***"
## n= 1406 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1406 994.13080 4.098919  
##    2) stress_state>=0.3169643 322 221.04880 3.389975  
##      4) stress_state>=0.9114583 56  28.24047 2.606092 *
##      5) stress_state< 0.9114583 266 150.99770 3.555312  
##       10) day>=3.5 137  74.19569 3.294439 *
##       11) day< 3.5 129  57.40909 3.834876 *
##    3) stress_state< 0.3169643 1084 563.17120 4.309509  
##      6) stress_state>=-0.4508929 861 378.13540 4.183755  
##       12) day>=2.5 547 216.51340 4.043073  
##         24) slpwell_trait< 3.1875 481 186.68680 3.998788 *
##         25) slpwell_trait>=3.1875 66  13.88846 4.351871 *
##       13) day< 2.5 314 131.93690 4.426232 *
##      7) stress_state< -0.4508929 223 118.84920 4.797659 *
## [1] "Estimated covariance matrix of random effects:"
##             (Intercept)
## (Intercept)   0.4368402
## [1] "Estimated variance of errors: 0.487992231838285"
## [1] "Log likelihood:  -1690.34928771704"
# Print fit statistics,AIC and BIC, for the model
AIC(model_tree2)
## [1] 3398.699
BIC(model_tree2)
## [1] 3445.89
# Plot tree
fancyRpartPlot(tree(model_tree2),
               digits = 2,
               sub="",
               main="")

Also, note this tree has a depth of 4 (root node is counted as zero) and 7 resulting nodes.

3. Model Evaluation.

As we can see in the example above, the results of the regression tree can change depending on constraints put on the tree. The available data also may effect the structure of the tree. Creating training and test sets is one way of examining how the model, and how well it captures the underlying “processes” may generalize to other data. The hope is that the structure learned from the training data results in comparable error/fit when applied to the test data.

Given that we are working with repeated measures nested within persons, we will separate persons into training and test sets (rather than seprating on occasions).

Selecting a random sub-sample of participants for the training data.

# Select random sample of 80% (i.e., 152 of 190 people) of AMIB data
set.seed(1234)
#making id list
ids_train <- sample(unique(amib$id), size=20)
#making training data set
train_amib <- amib[amib$id %in% ids_train, ]

Estimate REEMtree model using ONLY the train_amib data set.

#estimating tree model
set.seed(1267)
model_tree1cv <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o + 
                              slphrs_trait +  slpwell_trait + stress_trait + lteq_trait +
                              slphrs_state +  slpwell_state + stress_state + lteq_state,
                          data = train_amib, 
                          tree.control = rpart.control(cp = 0.01),
                          random = ~ 1 | id)

Make predictions for the entire data set using the model that was trained on the training subset of data (we place directly into the original data),

amib$posaff_predict1 <- predict(model_tree1cv, amib, EstimateRandomEffects = FALSE)

We then calculate the squared residuals for both the test (out of sample) and train (data used to fit the model) sets (i.e., for all the data).

# Calculate the squared residual by subtracting the actual value from the predicted value and then squaring the difference
amib$posaff_sqresidual1 <- (amib$posaff_predict1 - amib$posaff) ^2

Separate the squared residuals for the “outside” (test) and “inside” (train) subsets of the data

# Subset the in-sample (i.e., training set)residuals using same filter (ids_train)
sqresidual_inside <- amib[amib$id %in% ids_train, c("posaff_sqresidual1") ] 

# Subset the out-of-sample (i.e., test set) residuals as the not-filter (!)
sqresidual_outside <- amib[!c(amib$id %in% ids_train), c("posaff_sqresidual1")]

Finally, evaluate/compare the mean squared residuals for the training and test sets.

# Calculate mean squared residual
rmseTree_inside <- mean(sqresidual_inside^.5)
rmseTree_outside <- mean(sqresidual_outside^.5)

# Print values to console
rmseTree_inside
## [1] 0.6381801
rmseTree_outside
## [1] 0.818699

This gives us some information to use in evaluation of how well the tree may generalize to other new data.

4. Cautions and Considerations.

While we only provide a brief description of the random effects regression tree, we want to a highlight a few consierations when using this model.

  1. The model demonstrated here allowed for random intercepts between persons - i.e., differences in overall level of positive affect. In principle there are other random effects than can be included.
  2. The model demonstrated here did not allow for an autoregressive error structure - i.e., correlations between error terms from day-to-day within-persons were ignored. However, autocorrelation can be easily accommodated within the REEMtree function.
  3. The model demonstrated here evaluated the fit of the model using a single training data set and a single test data set. Cross-validation - and iterative subsampling of the data many time may be useful.

5. Conclusion.

In sum, this tutorial outlines how to run a random effects regression tree using the REEMtree package. We provide brief explanation of (1) the underlying model, (2) the code to run this model in R, and (3) the interpretation of the results. More detailed information about these (and related analyses) can be found in Sela and Simonoff (2012).