This tutorial is Part 4 of 5 showing how to do survival analysis with observational data (video recordings of participant behavior), using a study of children’s emotion regulation as an example (Cole et al., 2011) . This collection of tutorials accompanies Lougheed, Benson, Cole, & Ram (in press).
In this tutorial, we will demonstrate how to conduct a recurring episode model with time-invariant predictors. With this model, we examine between-child differences in children’s temperament with a time-invariant predictor, to examine if the time until recurring anger expressions vary by children’s negative affectivty.
Set working directory and read in data
setwd("~/Desktop/Data/Survival analysis")
wait36 <- read.csv("Recurring event_time-invariant.csv",header=TRUE)
names(wait36)
## [1] "id" "ang_event"
## [3] "episode" "event_time"
## [5] "negative_affectivity30"
Load packages:
library(plyr)
library(psych)
library(ggplot2)
library(survival)
library(gridExtra)
library(grid)
library(reshape)
Subset to relevant variables:
wait36c <- wait36[,c("id","ang_event","episode", "event_time", "negative_affectivity30")]
head(wait36c)
## id ang_event episode event_time negative_affectivity30
## 1 1 1 1 39 2.79375
## 2 1 1 2 15 2.79375
## 3 1 1 3 39 2.79375
## 4 1 1 4 3 2.79375
## 5 1 1 5 346 2.79375
## 6 1 1 6 8 2.79375
Center predictor to facilitate interpretation:
wait36c$negative_affectivity30.c <- scale(wait36c$negative_affectivity30, scale=FALSE, center=TRUE)
describe(wait36c$negative_affectivity30)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 921 3.6 0.58 3.65 3.59 0.6 2.27 4.96 2.68 0.06 -0.7
## se
## X1 0.02
It is important to first determine the kind of model that will be used to answer research questions, as other steps (e.g., data preparation) will depend on what type of model will be fit. Specifically, researchers need to decide whether models will:
With this model, we considered between-child differences in children’s temperament by examining if the time until recurring anger expressions varied by children’s negative affectivity. The data were coded in 1-second intervals, which implies the use of a continuous time model. We will use a semi-parametric approach (Cox regression model), because we do not know the shape of the underlying distribution (which precludes the use of a parametric approach) and because we want to incorporate multiple predictors (which is for several reasons not ideal with a non-parametric approach).
Now, we need to examine several characteristics of the data before we start modeling. Specifically, we need to examine:
Calculate the number of anger episodes per child:
indiv.stats <- ddply(wait36c, "id", summarize, numepisodes = max(episode))
describe(indiv.stats$numepisodes)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 117 8.26 8.38 6 6.77 5.93 1 43 42 1.83 3.81 0.77
table(indiv.stats$numepisodes)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 15 16 17 18 19 20 23 24 25 33 34 41 43
## 21 12 10 12 3 7 2 10 2 3 5 7 6 2 2 1 1 2 1 2 2 1 1 1 1
Order by survival time and create an order variable:
indiv.stats3 <- ddply(wait36c,c("id","episode"),summarize,time2event = event_time)
indiv.stats3 <- indiv.stats3[order(indiv.stats3$episode, -indiv.stats3$time2event),]
indiv.stats3$order <- c(1:length(indiv.stats3$id))
indiv.stats3 <- na.omit(indiv.stats3)
Create ordered survival time plot:
# #This finds the actual median episode time for each event (use these number for the vertical lines on the plot)
medianepi <- ddply(indiv.stats3,"episode",summarize, medep = median(time2event))
# # Use this code for coloring the plot and figuring out where to put the episode numbers
medianepi <- ddply(indiv.stats3,"episode",summarize, medep = median(order))
medianepi <- na.omit(medianepi)
ggplot(data=indiv.stats3, aes(x=time2event, y=order)) +
geom_rect(mapping = aes(xmin=0,xmax=indiv.stats3$time2event,ymin=indiv.stats3$order,
ymax=indiv.stats3$order+1,fill=factor(episode))) +
geom_rect(xmin=56,xmax=56,ymin=1,ymax=117,colour="black",linetype="dashed") +
geom_rect(xmin=22,xmax=22,ymin=118,ymax=213,colour="black",linetype="dashed") +
geom_rect(xmin=17,xmax=17,ymin=214,ymax=297,colour="black",linetype="dashed") +
geom_rect(xmin=17.5,xmax=17.5,ymin=298,ymax=371,colour="black",linetype="dashed") +
geom_rect(xmin=19.5,xmax=19.5,ymin=372,ymax=433,colour="black",linetype="dashed") +
scale_y_continuous(breaks=medianepi$medep[c(1,5,10,15,20,25,43)], labels = c(1,5,10,15,20,25,43)) +
xlab("Time in Task (seconds)") + ylab("Episode") +
ggtitle("Survival Time by Episode") +
theme_classic() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
panel.background=element_blank(),
legend.position="none",
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))
We will now run our Cox regression models. First, we will fit a baseline model (intercept only; no predictors included) to examine the survival and cumulative hazard functions. Next, we will include child negative affectivity as a time-invariant predictor.
The equation for this model is specified as,
\(h_{ij} (t)= h_{0} (t) exp(v_{i}) exp (\beta_{1}Negative Affectivity_{i})\)
where \(v_{i}\) is the random effect for the cluster (child).
First, let’s look at the baseline survival function for the null model (no predictors). Here we won’t actually include the random effect for children so that we can obtain the plot using the survival
package.
coxph3.0 <- coxph(Surv(event_time,ang_event) ~ 1,
data = wait36c)
summary(coxph3.0)
## Call: coxph(formula = Surv(event_time, ang_event) ~ 1, data = wait36c)
##
## Null model
## log likelihood= -5254.73
## n= 966
modelfit <- survfit(coxph3.0)
#Create dataframe with baseline model information:
modeldata <- data.frame(cbind(modelfit$time,modelfit$surv,modelfit$cumhaz,modelfit$upper,modelfit$lower))
names(modeldata) <- c("time","surv","cumhaz","upperCI","lowerCI")
ggplot(data=NULL) +
geom_line(data=modeldata,aes(x=time, y=surv),colour="black",linetype="solid",size=1) +
geom_line(data=modeldata,aes(x=time, y=upperCI),colour="black",linetype="dashed",size=.5) +
geom_line(data=modeldata,aes(x=time, y=lowerCI),colour="black",linetype="dashed",size=.5) +
scale_x_continuous(breaks=seq(0,480,120)) +
scale_y_continuous(breaks=seq(0,1,.25)) +
expand_limits(x=c(0,480),y=c(0,1)) +
xlab("Time in Task (seconds)") + ylab("Survival Function") +
theme_classic() +
theme(axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))
Now let’s plot the cumulative hazard function:
ggplot(data=NULL) +
geom_line(data=modeldata,aes(x=time, y=cumhaz),colour="black",linetype="solid",size=1) +
scale_x_continuous(breaks=seq(0,480,120)) +
# scale_y_continuous(breaks=seq(0,1,.25)) +
#expand_limits(x=c(0,480),y=c(0,1)) +
xlab("Time in Task (seconds)") + ylab("Cumulative Hazard Function") +
theme_classic() +
theme(axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))
Now we will fit baseline/null model (no predictors) with the frailty term for child.
Note that for models using discrete time data, the “exact” option should be used for handling ties rather than the “efron” method, which we use (and is the default), below.
coxph3 <- coxph(Surv(event_time,ang_event) ~ frailty(id),
data = wait36c)
summary(coxph3)
## Call:
## coxph(formula = Surv(event_time, ang_event) ~ frailty(id), data = wait36c)
##
## n= 966, number of events= 873
##
## coef se(coef) se2 Chisq DF p
## frailty(id) 501.7 95.03 0
##
## Iterations: 8 outer, 48 Newton-Raphson
## Variance of random effect= 0.7760718 I-likelihood = -5174
## Degrees of freedom for terms= 95
## Concordance= 0.682 (se = 0.012 )
## Likelihood ratio test= 428.9 on 95.03 df, p=0
##### Commented section below out because it is causing problems knitting that do not have problems being run
# modelfit <- survfit(coxph3)
#
# #Create dataframe with baseline model information:
# modeldata <- data.frame(cbind(modelfit$time,modelfit$surv,modelfit$cumhaz,modelfit$upper,modelfit$lower))
# names(modeldata) <- c("time","surv","cumhaz","upperCI","lowerCI")
#
# ggplot(data=NULL) +
# geom_line(data=modeldata,aes(x=time, y=surv),colour="black",linetype="solid",size=1) +
# geom_line(data=modeldata,aes(x=time, y=upperCI),colour="black",linetype="dashed",size=.5) +
# geom_line(data=modeldata,aes(x=time, y=lowerCI),colour="black",linetype="dashed",size=.5) +
# scale_x_continuous(breaks=seq(0,480,120)) +
# scale_y_continuous(breaks=seq(0,1,.25)) +
# expand_limits(x=c(0,480),y=c(0,1)) +
# xlab("Time in Task (seconds)") + ylab("Survival Function") +
# theme_classic() +
# theme(axis.line.x = element_line(color = "black"),
# axis.line.y = element_line(color = "black"))
Fit the model, adding negative_affectivity30.c as the predictor, using the survival
package:
# Re run model using survival package and a frailty term. (http://r.789695.n4.nabble.com/conditional-gap-time-frailty-cox-model-for-recurrent-events-td4724335.html)
coxph3b <- coxph(Surv(event_time,ang_event) ~ negative_affectivity30.c + frailty(id),
data = wait36c)
summary(coxph3b)
## Call:
## coxph(formula = Surv(event_time, ang_event) ~ negative_affectivity30.c +
## frailty(id), data = wait36c)
##
## n= 921, number of events= 833
## (45 observations deleted due to missingness)
##
## coef se(coef) se2 Chisq DF p
## negative_affectivity30.c 0.009804 0.1775 0.0 1.0 0.96
## frailty(id) 417.2 90.2 0.00
##
## exp(coef) exp(-coef) lower .95 upper .95
## negative_affectivity30.c 1.01 0.9902 0.7131 1.43
##
## Iterations: 8 outer, 56 Newton-Raphson
## Variance of random effect= 0.8047803 I-likelihood = -4896.1
## Degrees of freedom for terms= -0.1 90.2
## Concordance= 0.681 (se = 0.012 )
## Likelihood ratio test= 414.7 on 90.14 df, p=0
# Get AIC
extractAIC(coxph3b)
## [1] 90.14065 9715.39210
# Get log likelihood
coxph3b$loglik
## [1] -4974.912 -4767.555
Now plot the results at high and low negative affectivity. First, get summary statsistics for the variable:
sum.stats <- data.frame(describe(wait36c[,c("negative_affectivity30.c")]))
Create objects to use for representing prototypical children:
NA.minus1 <- as.numeric(sum.stats[1,3] - sum.stats[1,4])
NA.plus1 <- as.numeric(sum.stats[1,3] + sum.stats[1,4])
Isolate data for prototypical cases:
# First need to do a work-around, re-run model without random effects
coxph3b.2 <- coxph(Surv(event_time,ang_event) ~ negative_affectivity30.c,
data = wait36c)
survfit(coxph3b.2, newdata=list(negative_affectivity30.c=NA.minus1))->cox1
survfit(coxph3b.2, newdata=list(negative_affectivity30.c=NA.plus1))->cox2
Merge back into one data frame:
coxmodel3b.2.surv <- data.frame(cbind(cox1$time,cox1$surv,cox2$surv))
colnames(coxmodel3b.2.surv) <- c("event_time","NA_minus1SD","NA_plus1SD")
Melt the data for plotting purposes:
coxmodel3b.2.surv.melt <- melt(coxmodel3b.2.surv, id="event_time")
colnames(coxmodel3b.2.surv.melt) <- c("event_time","predictor","survival")
PLot the survival function at different levels of the predictors. Note that, even with alpha set to .3 (transparency), the lines representing high and low negative affecvitity overlap so much that the plot only appears to show one line.
ggplot(coxmodel3b.2.surv.melt ,aes(x=event_time,y=survival,colour=factor(predictor))) +
geom_line(size=1, alpha = .3) +
scale_y_continuous(limits=c(0,1)) +
xlab("Time in Task (seconds)") + ylab("Survival") +
theme_classic() +
theme(axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"))