Repeated measures data, often obtained from experience sampling or daily diary studies, require analytic methods that accommodate the inherent nesting in the data. Multilevel and structural equation modeling approaches are typically used to analyze repeated measures data. However, data mining and machine learning methods, particularly decision trees, are now being modified to accommodate repeated measures data (e.g., SEM trees, Brandmaier et al., 2013; mixed-effects regression trees, Hajjem et al., 2011, 2014; nonlinear longitudinal recursive partitioning, Stegmann et al., 2018).
In this tutorial, we illustrate how random effects regression trees (Random Effects/EM (RE-EM) trees, Sela & Simonoff, 2012) can be applied to EMA-type data for knowledge discovery.
Random effects regression trees have a continuous outcome variable (e.g., negative affect measured on a Likert or slider scale). Random effects regression trees can accommodate a sizable number of predictor variables, and these predictors can be categorical, interval, or continuous.
Ultimately, the model aims to sort each instance (or data point) into a node, with the instances within a node being as similar as possible. Note that nodes are compromised of observations, not individuals. To determine the composition of a node, random effects regression trees use a recursive algorithm to minimize the sum of squares of the node.
In our example, we will use data from an experience sampling study that examines daily interactions, emotions, and behaviors. Specifically, we will use the AMIB data set that followed 190 adults for eight days - thus repeated measures across days are nested within individuals.
In this tutorial, we will cover…
We are going to address:
Notice that no pattern of association is hypothesized between the predictors and the outcome; this is one advantage of regression trees - the different splits allow for higher-order interactions and thus non-linear associations between the predictors and outcome variables.
The data are organized as follows:
The data set has additional columns we will not use, so we will subset to only the variables we need, which are an ID variable, time variable, predictors variables, and outcome variable. Specifically…
id
)day
)slphrs
)slpwell
)lteq
)pss
)posaff
)Negative affect (negaff
)
for a total of 8 columns in this data set
Loading Libraries
Loading libraries used in this script.
library(psych) # for describing the data
library(plyr) #for data manipulation
library(ggplot2) # for data visualization
library(party) #for tree models
library(partykit) #for tree models
library(rattle) #for tree models
library(REEMtree) #for random effect tree models
Loading Data
Load and merge person-level and day-level data.
# Set filepath for person-level data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_persons_2019_0501.csv"
# Read in the .csv file using the url() function
AMIB_persons <- read.csv(file = url(filepath), header = TRUE)
#names(amib_person)
# Set filepath for daily data file
filepath1 <- "https://quantdev.ssri.psu.edu/sites/qdev/files/AMIBshare_daily_2019_0501.csv"
# Read in the .csv file using the url() function
AMIB_daily <- read.csv(file = url(filepath1), header = TRUE)
#names(amib_daily)
# Merge daily and person-level data
amib <- merge(AMIB_persons, AMIB_daily, by = "id")
Data Preparation
Reduce to the variables of interest.
amib <- amib[, c("id", "day", "bfi_e", "bfi_a", "bfi_c", "bfi_n", "bfi_o", "slphrs", "slpwell", "pss", "lteq", "posaff")]
#head(amib)
Reverse code pss
so higher values indicate higher perceived stress.
#reverse coding the pss variable into a new stress variable
amib$stress <- 4 - amib$pss
Split the variables into “trait” (between-person differences) and “state” (within-person deviations) components. Specifically, the time-varying predictors, sleep hours, sleep quality, perceived stress, physical activity, and positive affect are split into two varaibles: VARIABLE_trait is the sample-mean centered between-person component, and VARIABLE_state is the person-centered within-person component. FOr convenience of plotting we also split the outcome variable, posaff.
Calculate the trait variables
# Calculate intraindividual means
amib_imeans <- ddply(amib, "id", summarize,
posaff_trait = mean(posaff, na.rm=TRUE),
slphrs_trait = mean(slphrs, na.rm=TRUE),
slpwell_trait = mean(slpwell, na.rm=TRUE),
stress_trait = mean(stress, na.rm=TRUE),
lteq_trait = mean(lteq, na.rm=TRUE))
# Merge into long data set
amib <- merge(amib, amib_imeans, by="id")
Calculate state variables.
#calculate state variables
amib$posaff_state <- amib$posaff - amib$posaff_trait
amib$slphrs_state <- amib$slphrs - amib$slphrs_trait
amib$slpwell_state <- amib$slpwell - amib$slpwell_trait
amib$stress_state <- amib$stress - amib$stress_trait
amib$lteq_state <- amib$lteq - amib$lteq_trait
# Describe data
describe(amib)
## vars n mean sd median trimmed mad min max
## id 1 1458 322.53 129.08 324.00 324.20 151.23 101.00 532.00
## day 2 1458 3.48 2.30 3.00 3.47 2.97 0.00 7.00
## bfi_e 3 1458 3.39 1.00 3.50 3.41 0.74 1.00 5.00
## bfi_a 4 1458 3.61 0.88 3.50 3.69 0.74 1.00 5.00
## bfi_c 5 1458 3.76 0.85 4.00 3.78 0.74 1.50 5.00
## bfi_n 6 1458 2.97 0.96 3.00 2.98 1.48 1.00 5.00
## bfi_o 7 1458 3.60 0.96 3.50 3.65 0.74 1.00 5.00
## slphrs 8 1428 7.17 1.81 7.00 7.18 1.48 0.00 18.00
## slpwell 9 1439 2.49 1.09 3.00 2.56 1.48 0.00 4.00
## pss 10 1445 2.61 0.68 2.75 2.64 0.74 0.00 4.00
## lteq 11 1433 12.44 10.37 9.00 11.18 8.90 0.00 58.00
## posaff 12 1441 4.12 1.10 4.20 4.15 1.19 1.00 7.00
## stress 13 1445 1.39 0.68 1.25 1.36 0.74 0.00 4.00
## posaff_trait 14 1458 4.11 0.73 4.10 4.11 0.63 2.23 6.04
## slphrs_trait 15 1458 7.16 0.94 7.20 7.20 0.91 4.12 9.31
## slpwell_trait 16 1458 2.49 0.65 2.50 2.50 0.56 0.38 4.00
## stress_trait 17 1458 1.39 0.47 1.41 1.39 0.51 0.19 2.56
## lteq_trait 18 1458 12.60 8.35 10.75 11.88 8.52 0.00 45.33
## posaff_state 19 1441 0.00 0.82 0.05 0.01 0.70 -3.17 3.29
## slphrs_state 20 1428 0.00 1.55 0.00 -0.01 1.30 -8.75 9.25
## slpwell_state 21 1439 0.00 0.88 0.12 0.03 0.74 -3.29 2.57
## stress_state 22 1445 0.00 0.49 -0.03 -0.02 0.46 -1.75 2.12
## lteq_state 23 1433 0.00 6.44 -0.25 -0.11 5.37 -23.75 28.88
## range skew kurtosis se
## id 431.00 -0.07 -1.06 3.38
## day 7.00 0.01 -1.24 0.06
## bfi_e 4.00 -0.20 -0.58 0.03
## bfi_a 4.00 -0.69 0.10 0.02
## bfi_c 3.50 -0.11 -0.90 0.02
## bfi_n 4.00 -0.08 -0.79 0.03
## bfi_o 4.00 -0.36 -0.43 0.03
## slphrs 18.00 0.10 1.83 0.05
## slpwell 4.00 -0.52 -0.36 0.03
## pss 4.00 -0.35 0.13 0.02
## lteq 58.00 1.07 0.94 0.27
## posaff 6.00 -0.25 -0.33 0.03
## stress 4.00 0.35 0.13 0.02
## posaff_trait 3.81 0.05 0.09 0.02
## slphrs_trait 5.19 -0.40 0.28 0.02
## slpwell_trait 3.62 -0.31 0.55 0.02
## stress_trait 2.38 -0.08 -0.21 0.01
## lteq_trait 45.33 0.85 0.60 0.22
## posaff_state 6.46 -0.14 0.82 0.02
## slphrs_state 18.00 0.19 3.31 0.04
## slpwell_state 5.86 -0.31 0.34 0.02
## stress_state 3.88 0.36 0.79 0.01
## lteq_state 52.62 0.18 1.52 0.17
Note that we did not sample-center the person-level variables. Tree models generally work with the raw data. The state variables are by definition person-centered.
However, the data cannot have missingness. Thus, we reduce to only the complete cases.
Reduce to complete cases.
amib <- amib[complete.cases(amib), ]
head(amib)
## id day bfi_e bfi_a bfi_c bfi_n bfi_o slphrs slpwell pss lteq posaff
## 1 101 0 3.5 1.5 4 2 4 6.0 1 2.50 10 3.9
## 2 101 1 3.5 1.5 4 2 4 2.0 1 2.75 10 3.8
## 3 101 2 3.5 1.5 4 2 4 9.0 3 3.50 10 5.1
## 4 101 3 3.5 1.5 4 2 4 7.5 3 3.00 9 5.6
## 5 101 4 3.5 1.5 4 2 4 8.0 4 2.75 18 4.3
## 6 101 5 3.5 1.5 4 2 4 8.0 3 2.75 19 3.9
## stress posaff_trait slphrs_trait slpwell_trait stress_trait lteq_trait
## 1 1.50 4.5625 6.9375 2.75 1.0625 13.875
## 2 1.25 4.5625 6.9375 2.75 1.0625 13.875
## 3 0.50 4.5625 6.9375 2.75 1.0625 13.875
## 4 1.00 4.5625 6.9375 2.75 1.0625 13.875
## 5 1.25 4.5625 6.9375 2.75 1.0625 13.875
## 6 1.25 4.5625 6.9375 2.75 1.0625 13.875
## posaff_state slphrs_state slpwell_state stress_state lteq_state
## 1 -0.6625 -0.9375 -1.75 0.4375 -3.875
## 2 -0.7625 -4.9375 -1.75 0.1875 -3.875
## 3 0.5375 2.0625 0.25 -0.5625 -3.875
## 4 1.0375 0.5625 0.25 -0.0625 -4.875
## 5 -0.2625 1.0625 1.25 0.1875 4.125
## 6 -0.6625 1.0625 0.25 0.1875 5.125
Before we begin running our models, it is always a good idea to look at our data.
Our random effects regression tree has a number of predictors, but let’s focus on examining the association of positive affect and perceived stress on a subset of participants as an example. Plotting the outcome variable, posaff.
#Intraindividual change plot
ggplot(data = amib,
aes(x = day, group = id, color = factor(id), legend = FALSE)) +
geom_line(aes(x = day, y = posaff), lty = 1, size = 1, alpha = .5) +
xlab("Day") +
ylab("Positive Affect (outcome)") +
scale_x_continuous(breaks=seq(0,7,by=1)) +
theme_classic() +
theme(legend.position = "none")
Plotting select predictor-outcome relations
#time-invariant predictor
ggplot(data = amib, #subset(amib, id <= 120),
aes(x = bfi_n, y = posaff_trait,
group = id, color = factor(id), legend = FALSE)) +
geom_point(alpha = .5) +
geom_smooth(aes(group=1), method=lm, se=FALSE, fullrange=TRUE, lty=1, size=2, color="black") +
xlab("Neuroticism") +
ylab("Positive Affect (outcome)") +
theme_classic() +
theme(legend.position = "none")
#time-invariant predictor
ggplot(data = amib, #subset(amib, id <= 120),
aes(x = slphrs_trait, y = posaff_trait,
group = id, color = factor(id), legend = FALSE)) +
geom_point(alpha = .5) +
geom_smooth(aes(group=1), method=lm, se=FALSE, fullrange=TRUE, lty=1, size=2, color="black") +
xlab("Sleep Hours (Trait)") +
ylab("Positive Affect (outcome)") +
theme_classic() +
theme(legend.position = "none")
#time-varying predictor
ggplot(data = amib,
aes(x = stress_state, y = posaff,
group = id, color = factor(id), legend = FALSE)) +
geom_point(color="gray40",alpha = .2) +
geom_smooth(method=lm, se=FALSE, fullrange=FALSE, lty=1, size=.5, color="gray40") +
#geom_smooth(aes(group=1), method=loess, se=FALSE, fullrange=FALSE, lty=1, size=2, color="blue") +
xlab("Stress (State)") +
ylab("Positive Affect (outcome)") +
theme_classic() +
theme(legend.position = "none")
#time-varying predictor
ggplot(data = amib,
aes(x = lteq_state, y = posaff,
group = id, color = factor(id), legend = FALSE)) +
geom_point(color="gray40",alpha = .2) +
geom_smooth(method=lm, se=FALSE, fullrange=FALSE, lty=1, size=.5, color="gray40") +
#geom_smooth(aes(group=1), method=loess, se=FALSE, fullrange=FALSE, lty=1, size=2, color="blue") +
xlab("Physical Activity (State)") +
ylab("Positive Affect (outcome)") +
theme_classic() +
theme(legend.position = "none")
There are many more predcitors, both time-invariant and time-varying that could be plotted. All will be used as predictors in the tree model.
We’ll construct a random effects regression tree looking at how people’s personality (as measured by the Big 5 Inventory) and daily behaviors (sleep hours and quality, perceived stress, physical activity) predict their daily negative affect.
The general form used to specify the model is: *
tree.control
argumentcv
argumentcorrelation
argumentNow, let’s run REEMtree on the AMIB data and save results to ‘amibtree result’.
#setting seed for replication
set.seed(1234)
#fitting tree model
model_tree1 <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o +
slphrs_trait + slpwell_trait + stress_trait + lteq_trait +
slphrs_state + slpwell_state + stress_state + lteq_state,
data = amib,
random = ~ 1 | id)
# Check if output object is a REEMtree (if it isn't, that means something did not run properly)
is.REEMtree(model_tree1)
## [1] TRUE
# Print statistical output
print(model_tree1)
## [1] "*** RE-EM Tree ***"
## n= 1406
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1406 994.13080 4.098919
## 2) stress_state>=0.3169643 322 221.04880 3.389975
## 4) stress_state>=0.9114583 56 28.24047 2.606309 *
## 5) stress_state< 0.9114583 266 150.99770 3.555312
## 10) day>=3.5 137 74.19569 3.293816 *
## 11) day< 3.5 129 57.40909 3.838430 *
## 3) stress_state< 0.3169643 1084 563.17120 4.309509
## 6) stress_state>=-0.4508929 861 378.13540 4.183755
## 12) day>=2.5 547 216.51340 4.040918 *
## 13) day< 2.5 314 131.93690 4.422332 *
## 7) stress_state< -0.4508929 223 118.84920 4.798873 *
## [1] "Estimated covariance matrix of random effects:"
## (Intercept)
## (Intercept) 0.4603121
## [1] "Estimated variance of errors: 0.487691800778259"
## [1] "Log likelihood: -1693.51847760945"
# Print fit statistics,AIC and BIC, for the model
AIC(model_tree1)
## [1] 3403.037
BIC(model_tree1)
## [1] 3444.991
The model is much easier to interpret in visual format
Plotting tree model
# Plot model object after making into a tree usign tree()
fancyRpartPlot(tree(model_tree1),
digits = 2,
sub="",
main="")
Based upon the above plot, it appears that daily perceived stress (stress_state), and day (which is a proxy for day of week) are predictive of daily positive affect. The values in the terminal nodes represent the average positive affect for that node.
Also, note this tree has a depth of 3 (root node is counted as zero) and 6 resulting nodes.
We can also run the same tree with specifications on the tree growth parameters, such as complexity parameter (i.e., the threshold for stopping a split based upon whether the split sufficiently changes the overall fit of the model). Let’s try a modification and see how the resulting tree differs.
model_tree2 <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o +
slphrs_trait + slpwell_trait + stress_trait + lteq_trait +
slphrs_state + slpwell_state + stress_state + lteq_state,
data = amib,
tree.control = rpart.control(cp = .8),
random = ~ 1 | id) #correlation=corAR1())
# Print statistical output
print(model_tree2)
## [1] "*** RE-EM Tree ***"
## n= 1406
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1406 994.13080 4.098919
## 2) stress_state>=0.3169643 322 221.04880 3.389975
## 4) stress_state>=0.9114583 56 28.24047 2.606092 *
## 5) stress_state< 0.9114583 266 150.99770 3.555312
## 10) day>=3.5 137 74.19569 3.294439 *
## 11) day< 3.5 129 57.40909 3.834876 *
## 3) stress_state< 0.3169643 1084 563.17120 4.309509
## 6) stress_state>=-0.4508929 861 378.13540 4.183755
## 12) day>=2.5 547 216.51340 4.043073
## 24) slpwell_trait< 3.1875 481 186.68680 3.998788 *
## 25) slpwell_trait>=3.1875 66 13.88846 4.351871 *
## 13) day< 2.5 314 131.93690 4.426232 *
## 7) stress_state< -0.4508929 223 118.84920 4.797659 *
## [1] "Estimated covariance matrix of random effects:"
## (Intercept)
## (Intercept) 0.4368402
## [1] "Estimated variance of errors: 0.487992231838285"
## [1] "Log likelihood: -1690.34928771704"
# Print fit statistics,AIC and BIC, for the model
AIC(model_tree2)
## [1] 3398.699
BIC(model_tree2)
## [1] 3445.89
# Plot tree
fancyRpartPlot(tree(model_tree2),
digits = 2,
sub="",
main="")
Also, note this tree has a depth of 4 (root node is counted as zero) and 7 resulting nodes.
As we can see in the example above, the results of the regression tree can change depending on constraints put on the tree. The available data also may effect the structure of the tree. Creating training and test sets is one way of examining how the model, and how well it captures the underlying “processes” may generalize to other data. The hope is that the structure learned from the training data results in comparable error/fit when applied to the test data.
Given that we are working with repeated measures nested within persons, we will separate persons into training and test sets (rather than seprating on occasions).
Selecting a random sub-sample of participants for the training data.
# Select random sample of 80% (i.e., 152 of 190 people) of AMIB data
set.seed(1234)
#making id list
ids_train <- sample(unique(amib$id), size=20)
#making training data set
train_amib <- amib[amib$id %in% ids_train, ]
Estimate REEMtree model using ONLY the train_amib
data set.
#estimating tree model
set.seed(1267)
model_tree1cv <- REEMtree(posaff ~ day + bfi_a + bfi_c + bfi_n + bfi_o +
slphrs_trait + slpwell_trait + stress_trait + lteq_trait +
slphrs_state + slpwell_state + stress_state + lteq_state,
data = train_amib,
tree.control = rpart.control(cp = 0.01),
random = ~ 1 | id)
Make predictions for the entire data set using the model that was trained on the training subset of data (we place directly into the original data),
amib$posaff_predict1 <- predict(model_tree1cv, amib, EstimateRandomEffects = FALSE)
We then calculate the squared residuals for both the test (out of sample) and train (data used to fit the model) sets (i.e., for all the data).
# Calculate the squared residual by subtracting the actual value from the predicted value and then squaring the difference
amib$posaff_sqresidual1 <- (amib$posaff_predict1 - amib$posaff) ^2
Separate the squared residuals for the “outside” (test) and “inside” (train) subsets of the data
# Subset the in-sample (i.e., training set)residuals using same filter (ids_train)
sqresidual_inside <- amib[amib$id %in% ids_train, c("posaff_sqresidual1") ]
# Subset the out-of-sample (i.e., test set) residuals as the not-filter (!)
sqresidual_outside <- amib[!c(amib$id %in% ids_train), c("posaff_sqresidual1")]
Finally, evaluate/compare the mean squared residuals for the training and test sets.
# Calculate mean squared residual
rmseTree_inside <- mean(sqresidual_inside^.5)
rmseTree_outside <- mean(sqresidual_outside^.5)
# Print values to console
rmseTree_inside
## [1] 0.6381801
rmseTree_outside
## [1] 0.818699
This gives us some information to use in evaluation of how well the tree may generalize to other new data.
While we only provide a brief description of the random effects regression tree, we want to a highlight a few consierations when using this model.
REEMtree
function.In sum, this tutorial outlines how to run a random effects regression tree using the REEMtree
package. We provide brief explanation of (1) the underlying model, (2) the code to run this model in R, and (3) the interpretation of the results. More detailed information about these (and related analyses) can be found in Sela and Simonoff (2012).