Overview

This tutorial is Part 3 of 5 showing how to do survival analysis with observational data (video recordings of participant behavior), using a study of children’s emotion regulation as an example (Cole et al., 2011). This collection of tutorials accompanies Lougheed, Benson, Cole, & Ram (in press).

In this tutorial, we will demonstrate how to conduct a single episode model with time-varying predictors. We examine intraindividual differences in children’s strategy use with time-varying predictors, to test if (a) children were more likely to express anger for the first time in the seconds they made bids compared to the seconds they did not make bids, and (b) children were less likely to express anger for the first time in the seconds they distracted themselves compared to the seconds they did not distract themselves.

Set-up

Set working directory and read in data

setwd("~/Desktop/Data/Survival analysis")

wait36 <- read.csv("wait36_Anger_Survival Analysis Formatted Data.csv",header=TRUE)
names(wait36)
##  [1] "id"           "second"       "anger"        "bids"        
##  [5] "distraction"  "timeOriginal" "period"       "episode"     
##  [9] "status"       "status2"      "status_time"  "event_time"  
## [13] "start"        "stop"         "ang_event"

Load packages:

library(plyr)
library(psych)
library(survival) 
library(eha)
library(ggplot2)

Subset to relevant variables:

wait36b <- wait36[which(wait36$episode==1),c("id","second","ang_event","bids","distraction","status_time",
                                             "start","stop")]

Outline

  • Step 1: Preliminary Considerations
  • Step 2: Data Preparation
  • Step 3: Data Description
  • Step 4: Model Building, Estimation, and Assessment of Fit to the Data
  • Step 5: Presentation and Interpretation of Results

Step 1: Preliminary Considerations

It is important to first determine the kind of model that will be used to answer research questions, as other steps (e.g., data preparation) will depend on what type of model will be fit. Specifically, researchers need to decide whether models will:

  • Be fit in discrete or continuous time
  • Include single or repeating occurrences of the dependent variable
  • Be fit with a non-parametric, semi-parametric, or parametric approach
  • Include predictors that are time invariant or time varying (or both)

With this model, we examine intraindividual differences in children’s strategies using time-varying predictors to test if (a) children were more likely to express anger for the first time in the seconds they made bids compared to the seconds they did not make bids, and (b) children were less likely to express anger for the first time in the seconds they distracted themselves compared to the seconds they did not distract themselves. The data were coded in 1-second intervals, which implies the use of a continuous time model. We will use a semi-parametric approach (Cox regression model), because we do not know the shape of the underlying distribution (which precludes the use of a parametric approach) and because we want to incorporate multiple predictors (which is for several reasons not ideal with a non-parametric approach).


Step 2: Data Preparation

See Part 1 for data preparation steps


Step 3: Data Description

Now we need to examine several characteristics of the data before we start modeling. Specifically, we need to examine:

  • How many cases in the data are right-censored. We already examined left-censoring in Step 2 so we do not need to examine that further in Step 3.
  • How many ties (cases with exactly the same survival time) are present in the data. If a researcher has data with continuous time, a large number of ties could be problematic but this is unlikely with observational data. If a large number of ties are present, the researcher could aggregate time into discrete values and use a discrete-time model to overcome the presence of many ties.
  • The median survival time. Standard descriptive statistics (mean, standard deviation) will not provide accurate information about survival analysis data because of censoring. The median survival time is used instead.
  • A plot of survival times to understand how survival times are distributed in the data.
  • Descriptive statistics for any predictors in the model.

See documentation for Part 2 for data description steps pertaining to anger.

Describing Time-Varying Predictors

We are using children’s bids about the task and focused distraction as time-varying predictors. We can examine the number of seconds during the task that each behavior is shown:

indiv.stats <- ddply(wait36b,"id",summarize,
                   bid_r_freq = sum(bids ==1, na.rm=TRUE),
                   distraction_freq = sum(distraction == 1, na.rm=TRUE),
                   ang_freq = sum(ang_event ==1, na.rm=TRUE))

describe(indiv.stats[,-1])
##                  vars   n  mean    sd median trimmed   mad min max range
## bid_r_freq          1 117 12.73 15.57      8   10.01 11.86   0  95    95
## distraction_freq    2 117 39.72 66.55      5   24.42  7.41   0 292   292
## ang_freq            3 117  0.83  0.38      1    0.91  0.00   0   1     1
##                   skew kurtosis   se
## bid_r_freq        2.05     6.02 1.44
## distraction_freq  1.89     2.75 6.15
## ang_freq         -1.73     0.99 0.03

Step 4: Model Building and Assessment of Fit to the Data

We will now run our Cox regression models. First, we will fit a baseline model (intercept only; no predictors included) to examine the survival and cumulative hazard functions. Next, we will include child behavior variables as predictors to test our research questions. Then, we will use diagnostics to examine model fit.

Cox Regression Model

The equation for the model is specified as,

\(h_{ik} = h_{0k} exp(\beta_{1}Bids_{ik} + \beta_{2}Distract_{ik})\)

Fit Cox Regression Model - Baseline

Note that for models using discrete time data, the “exact” option should be used for handling ties rather than the “efron” method, which we use, below.

Fit baseline/null model (no predictors) using the survival package:

wait36b2 <- na.omit(wait36b)

coxmodel2 <- coxph(Surv(start,stop,status_time) ~ 1,
                      method="efron",
                      data=wait36b2)
summary(coxmodel2)
## Call:  coxph(formula = Surv(start, stop, status_time) ~ 1, data = wait36b2, 
##     method = "efron")
## 
## Null model
##   log likelihood= -412.5931 
##   n= 15554

Fit Cox Regression Model - Adding Time-Varying Predictors

Fit the model, adding bids and distraction as predictors:

coxmodel2b <- coxph(Surv(start,stop,status_time) ~ bids + distraction,
                      method="efron",
                      data=wait36b2)
summary(coxmodel2b)
## Call:
## coxph(formula = Surv(start, stop, status_time) ~ bids + distraction, 
##     data = wait36b2, method = "efron")
## 
##   n= 15554, number of events= 102 
## 
##                coef exp(coef) se(coef)      z Pr(>|z|)    
## bids         1.2285    3.4161   0.2220  5.534 3.12e-08 ***
## distraction -1.5365    0.2151   0.4274 -3.595 0.000324 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## bids           3.4161     0.2927   2.21101    5.2781
## distraction    0.2151     4.6481   0.09309    0.4972
## 
## Concordance= 0.653  (se = 0.026 )
## Rsquare= 0.003   (max possible= 0.052 )
## Likelihood ratio test= 53.64  on 2 df,   p=2.256e-12
## Wald test            = 48.62  on 2 df,   p=2.765e-11
## Score (logrank) test = 60.71  on 2 df,   p=6.573e-14
extractAIC(coxmodel2b)
## [1]   2.0000 775.5512
coxmodel2b$loglik
## [1] -412.5931 -385.7756

Plot

We will plot the baseline hazard function for four hypothetical children who showed all possible combinations of time-varying predictors for the entire wait task: (1) Bids “off” and Distraction = “off”, (2) Bids “off” and Distraction “on”, (3) Bids “on” and Distraction “off”, and (4) Bids “on” and Distraction “off” for the entire task.

First, fit the model using the eha package to obtain the non-cumulative hazard function:

coxmodel2c <- coxreg(formula = Surv(start,stop,status_time) ~ bids + distraction,
data = wait36b,center=FALSE)
summary(coxmodel2c)
## Call:
## coxreg(formula = Surv(start, stop, status_time) ~ bids + distraction, 
##     data = wait36b, center = FALSE)
## 
## Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
## bids                0.096     1.229     3.416     0.222     0.000 
## distraction         0.299    -1.536     0.215     0.427     0.000 
## 
## Events                    102 
## Total time at risk         15554 
## Max. log. likelihood      -385.78 
## LR test statistic         53.64 
## Degrees of freedom        2 
## Overall p-value           2.25575e-12
modfit.df <- data.frame(matrix(unlist(coxmodel2c$hazards),ncol=2,nrow=71))
colnames(modfit.df) <- c("time","h0")

Generate the hypothetical cases:

#id = 1 is Bids "off" and Distraction = "off"; id = 2 is Bids "off" and Distraction "on";id = 3 is Bids "on" and Distraction "off"; and id = 4 is Bids "on" and Distraction "off"
id <- c(rep(1,71),rep(2,71),rep(3,71),rep(4,71)) 
time <- rep(modfit.df$time,4)
h0 <- rep(modfit.df$h0,4)
bids <- c(rep(0,71),rep(0,71),rep(1,71),rep(1,71))
distraction <- c(rep(0,71),rep(1,71),rep(0,71),rep(1,71))

proto <- data.frame(id,time,h0,bids,distraction)
surv.fun <- function(x, h0, bids, distraction) {
  #rename variable
  time <- x
  h0 <- h0
  #the parameters
  b1 = coxmodel2b$coefficients[1]  # bids
  b2 = coxmodel2b$coefficients[2]  # distraction
  #the model
  y = h0*exp(b1*bids+b2*distraction)
  return(y)
}

proto.pred <- surv.fun(proto$time, proto$h0, proto$bids, proto$distraction)

proto$outcome <- proto.pred

Plot the prototype curve. Note that in the resulting plot, “1”=“Bids Off, Distraction Off”, “2”=“Bids Off, Distraction On”, “3”=“Bids On, Distraction Off”, “4”= “Bids On, Distraction On”.

ggplot(data = proto, aes(x = time, y = outcome)) +
  ggtitle("Associations between time-varying predictors and hazard of anger") +
  geom_line(size = 1) +
  xlab("Time in Task (seconds)") + 
  ylab("Hazard Rate of Anger") + 
    expand_limits(y=c(0,.4)) +
  theme_classic() +
  theme(legend.title=element_blank(),
        legend.text=element_text(size=16,family="Times")) +
  facet_wrap(~id)

Evaluating Model Fit

Likelihood-Ratio Test

The results of the likelihood-ratio test can be found in the model summary above, but it can also be obtained this way:

anova(coxmodel2,coxmodel2b)
## Analysis of Deviance Table
##  Cox model: response is  Surv(start, stop, status_time)
##  Model 1: ~ 1
##  Model 2: ~ bids + distraction
##    loglik  Chisq Df P(>|Chi|)    
## 1 -412.59                        
## 2 -385.78 53.635  2 2.256e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 5: Presentation and Interpretation of Results

Our results show that children’s likelihood of expressing anger for the first time is higher in the seconds during which they make bids compared to the seconds they did not make bids. In addition, children’s likelihood of expressing anger for the first time is lower in the seconds children use focused distraction compared to the seconds they did not.