This tutorial is Part 5 of 5 showing how to do survival analysis with observational data (video recordings of participant behavior), using a study of children’s emotion regulation as an example (Cole et al., 2011). This collection of tutorials accompanies Lougheed, Benson, Cole, & Ram (in press).
In this tutorial, we will demonstrate how to conduct a recurring episode model with time-varying predictors. With this model, we examine within-child differences in children’s strategy use as time-varying predictors, to test if (a) children’s recurring anger expressions were more likely in the seconds they made bids compared to the seconds they did not, and (b) children’s recurring anger expressions were less likely in the seconds they distracted themselves compared to the seconds they did not.
Set working directory and read in data
setwd("~/Desktop/Data/Survival analysis")
wait36 <- read.csv("Recurring event_time-varying.csv",header=TRUE)
names(wait36)
## [1] "id" "second" "ang_event" "bids" "distraction"
## [6] "episode" "period" "start" "stop"
Load packages:
library(ggplot2)
library(survival)
library(eha)
library(plyr)
library(psych)
library(reshape)
Subset to relevant variables:
wait36d <- wait36[,c("id","second","ang_event","start", "stop", "bids", "distraction", "episode", "period")]
head(wait36d)
## id second ang_event start stop bids distraction episode period
## 1 1 1 0 0 1 0 0 1 1
## 2 1 2 0 1 2 0 0 1 2
## 3 1 3 0 2 3 0 0 1 3
## 4 1 4 0 3 4 0 0 1 4
## 5 1 5 0 4 5 0 0 1 5
## 6 1 6 0 5 6 0 0 1 6
Remove NAs from data:
wait36d <- na.omit(wait36d, cols= c("episode", "period", "start", "stop"))
It is important to first determine the kind of model that will be used to answer research questions, as other steps (e.g., data preparation) will depend on what type of model will be fit. Specifically, researchers need to decide whether models will:
This this model, we considered within-child differences in children’s strategies by examining if: (a) children’s recurring anger expressions were more likely in the seconds they made bids compared to the seconds they did not, and (b) children’s recurring anger expressions were less likely in the seconds they distracted themselves compared to the seconds they did not. The data were coded in 1-second intervals, which implies the use of a continuous time model. We will use a semi-parametric approach (Cox regression model), because we do not know the shape of the underlying distribution (which precludes the use of a parametric approach) and because we want to incorporate multiple predictors as time-varying predictors (which is for several reasons not ideal with a non-parametric approach).
Now, we need to examine several characteristics of the data before we start modeling. Specifically, we need to examine:
Calculate the number of anger episodes per child:
indiv.stats <- ddply(wait36d, "id", summarize, numepisodes = max(episode))
describe(indiv.stats$numepisodes)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 117 8.26 8.38 6 6.77 5.93 1 43 42 1.83 3.81 0.77
table(indiv.stats$numepisodes)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 15 16 17 18 19 20 23 24 25 33 34 41 43
## 21 12 10 12 3 7 2 10 2 3 5 7 6 2 2 1 1 2 1 2 2 1 1 1 1
Calculate the number of seconds of bids and focused distraction:
indiv.stats1 <- ddply(wait36d,"id",summarize,
bid_freq = sum(bids ==1, na.rm=TRUE),
distraction_freq = sum(distraction == 1, na.rm=TRUE),
anger_freq = sum(ang_event ==1, na.rm=TRUE))
describe(indiv.stats1[,-1])
## vars n mean sd median trimmed mad min max range
## bid_freq 1 117 34.74 24.43 32 32.25 19.27 0 157 157
## distraction_freq 2 117 101.75 84.41 84 93.57 90.44 0 321 321
## anger_freq 3 117 7.38 8.39 5 5.93 5.93 0 42 42
## skew kurtosis se
## bid_freq 1.49 4.26 2.26
## distraction_freq 0.74 -0.33 7.80
## anger_freq 1.87 4.09 0.78
Order by survival time and create an order variable:
indiv.stats3 <- ddply(wait36d,c("id","episode"),summarize,time2event = max(period))
indiv.stats3 <- indiv.stats3[order(indiv.stats3$episode, -indiv.stats3$time2event),]
indiv.stats3$order <- c(1:length(indiv.stats3$id))
indiv.stats3 <- na.omit(indiv.stats3)
Create ordered survival time plot:
#This finds the actual median episode time for each event (use these number for the vertical lines on the plot)
medianepi <- ddply(indiv.stats3,"episode",summarize, medep = median(time2event))
# Use this code for coloring the plot and figuring out where to put the episode numbers
medianepi <- ddply(indiv.stats3,"episode",summarize, medep = median(order))
medianepi <- na.omit(medianepi)
ggplot(data=indiv.stats3, aes(x=time2event, y=order)) +
geom_rect(mapping = aes(xmin=0,xmax=indiv.stats3$time2event,ymin=indiv.stats3$order,
ymax=indiv.stats3$order+1,fill=factor(episode))) +
geom_rect(xmin=56,xmax=56,ymin=1,ymax=117,colour="black",linetype="dashed") +
geom_rect(xmin=22,xmax=22,ymin=118,ymax=213,colour="black",linetype="dashed") +
geom_rect(xmin=17,xmax=17,ymin=214,ymax=297,colour="black",linetype="dashed") +
geom_rect(xmin=17.5,xmax=17.5,ymin=298,ymax=371,colour="black",linetype="dashed") +
geom_rect(xmin=19.5,xmax=19.5,ymin=372,ymax=433,colour="black",linetype="dashed") +
scale_y_continuous(breaks=medianepi$medep[c(1,5,10,15,20,25,43)], labels = c(1,5,10,15,20,25,43)) +
xlab("Time in Task (seconds)") + ylab("Episode") +
ggtitle("Survival Time by Episode") +
theme_classic() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
panel.background=element_blank(),
legend.position="none",
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))
We will now run our Cox regression models. First, we will fit a baseline model (intercept only; no predictors included) to examine the survival and cumulative hazard functions (the steps for plotting the functions can be found in the code for Model 3). Next, we will include child behaviors as time-varying predictors.
The equation for Model 4 is specified as,
\(h_{ij}(t) = h_{0}(t) exp(v_{i}) exp (\beta_{1}Bids_{ij}(t) + \beta_{2}Distract_{ij}(t) )\)
where \(v_{i}\) is the random effect for the cluster (child).
Fit baseline/null model (no predictors) using the survival
package.
Note that for models using discrete time data, the “exact” option should be used for handling ties rather than the “efron” method, which we use (and is the default), below.
coxph4 <- coxph(Surv(start, stop, ang_event) ~ frailty(id),
ties = c ("efron"),
data = wait36d)
summary(coxph4)
## Call:
## coxph(formula = Surv(start, stop, ang_event) ~ frailty(id), data = wait36d,
## ties = c("efron"))
##
## n= 45968, number of events= 863
##
## coef se(coef) se2 Chisq DF p
## frailty(id) 492.2 94.68 0
##
## Iterations: 8 outer, 48 Newton-Raphson
## Variance of random effect= 0.7689404 I-likelihood = -5104.3
## Degrees of freedom for terms= 94.7
## Concordance= 0.682 (se = 0.012 )
## Likelihood ratio test= 423.3 on 94.68 df, p=0
## Plot baseline function, using work around with no frailty term
coxph4.0 <- coxph(Surv(start, stop, ang_event) ~ 1,
ties = c ("efron"),
data = wait36d)
summary(coxph4.0)
## Call: coxph(formula = Surv(start, stop, ang_event) ~ 1, data = wait36d,
## ties = c("efron"))
##
## Null model
## log likelihood= -5183.159
## n= 45968
modelfit <- survfit(coxph4.0)
#Create dataframe with baseline model information:
modeldata <- data.frame(cbind(modelfit$time,modelfit$surv,modelfit$cumhaz,modelfit$upper,modelfit$lower))
names(modeldata) <- c("time","surv","cumhaz","upperCI","lowerCI")
ggplot(data=NULL) +
geom_line(data=modeldata,aes(x=time, y=surv),colour="black",linetype="solid",size=1) +
geom_line(data=modeldata,aes(x=time, y=upperCI),colour="black",linetype="dashed",size=.5) +
geom_line(data=modeldata,aes(x=time, y=lowerCI),colour="black",linetype="dashed",size=.5) +
scale_x_continuous(breaks=seq(0,480,120)) +
scale_y_continuous(breaks=seq(0,1,.25)) +
expand_limits(x=c(0,480),y=c(0,1)) +
xlab("Time in Task (seconds)") + ylab("Survival Function") +
theme_classic() +
theme(axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))
Fit the model, adding bids and distraction as predictors:
coxph4b <- coxph(Surv(start, stop, ang_event) ~bids + distraction + frailty(id),
ties = c ("efron"),
data = wait36d)
summary(coxph4b)
## Call:
## coxph(formula = Surv(start, stop, ang_event) ~ bids + distraction +
## frailty(id), data = wait36d, ties = c("efron"))
##
## n= 45968, number of events= 863
##
## coef se(coef) se2 Chisq DF p
## bids 0.9292 0.08548 0.08486 118.15 1.00 0.0e+00
## distraction -1.3222 0.17033 0.16940 60.25 1.00 8.3e-15
## frailty(id) 421.84 90.68 0.0e+00
##
## exp(coef) exp(-coef) lower .95 upper .95
## bids 2.5324 0.3949 2.1417 2.9943
## distraction 0.2666 3.7515 0.1909 0.3722
##
## Iterations: 8 outer, 51 Newton-Raphson
## Variance of random effect= 0.6454775 I-likelihood = -5000.4
## Degrees of freedom for terms= 1.0 1.0 90.7
## Concordance= 0.716 (se = 0.012 )
## Likelihood ratio test= 612.2 on 92.65 df, p=0
# Get AIC
extractAIC(coxph4b)
## [1] 92.64966 9939.44369
# Get log likelihood
coxph4b$loglik
## [1] -5183.159 -4877.072
The results of the likelihood ratio test for the model are reported above in the model summary.
Cox regression models with time-varying predictors do not require that the assumption be met (Willett & Singer, 1995) and so this assumption is not tested for this model.
See Step 5: Presentation and Interpretation of Results in the paper for the full report of these results. Our model shows that children have a higher hazard of anger during the seconds that they bid about the demands of the wait compared to seconds that they did not bid, whereas children have a lower hazard of anger in the seconds that they self-initiate focused distraction compared to the seconds they did not. See Lougheed, Benson, Cole, and Ram (in press) for suggestions on reporting these results in a manuscript.