Overview

This tutorial is Part 2 of 5 showing how to do survival analysis with observational data (video recordings of participant behavior), using data from a study of children’s emotion regulation as an example (Cole et al., 2011). In this tutorial, we will demonstrate how to conduct a single episode model with time-invariant predictors. With this model, we examine interindividual differences in children’s negative affectivity, to test if (a) children higher in negative affectivity also expressed anger for the first time earlier in the wait task, and (b) children lower in negative affectivity also took longer to express anger for the first time.

Set-up

Set working directory and read in data

setwd("~/Desktop/Data/Survival analysis")

wait36 <- read.csv("Single event_time-invariant.csv",header=TRUE)
names(wait36)
## [1] "id"                     "status"                
## [3] "event_time"             "negative_affectivity30"

Load packages:

library(ggplot2)
library(psych)
library(survival)
library(eha)
library(reshape)

Subset to relevant variables:

wait36a <-wait36
head(wait36a)
##   id status event_time negative_affectivity30
## 1  1      1         39               2.793750
## 2  2      1        203               3.614583
## 3  3      1         80               4.347222
## 4  5      1         10               3.745960
## 5  6      1         14               3.933523
## 6  8      1         89               3.043750

Outline

  • Step 1: Preliminary Considerations
  • Step 2: Data Preparation
  • Step 3: Data Description
  • Step 4: Model Building and Assessment of Fit to the Data
  • Step 5: Presentation and Interpretation of Results

Step 1: Preliminary Considerations

It is important to first determine the kind of model that will be used to answer research questions, as other steps (e.g., data preparation) will depend on what type of model will be fit. Specifically, researchers need to decide whether models will:

  • Be fit in discrete or continuous time
  • Include single or repeating occurrences of the dependent variable
  • Be fit with a non-parametric, semi-parametric, or parametric approach
  • Include predictors that are time invariant or time varying (or both)

With this model, we examine interindividual differences in children’s negative affectivity, to test if (a) children higher in negative affectivity also expressed anger for the first time earlier in the wait task, and (b) children lower in negative affectivity also took longer to express anger for the first time. We will use a semi-parametric approach (Cox regression), because we do not know the shape of the underlying distribution (which precludes the use of a parametric approach) and because we want to incorporate multiple predictors as time-invariant predictors (which is for several reasons not ideal with a non-parametric approach).


Step 2: Data Preparation

See Part 1 for data preparation steps


Step 3: Data Description

Now, we need to examine several characteristics of the data before we start modeling. Specifically, we need to examine:

  • How many cases in the data are right-censored. We already examined left-censoring in Step 2 so we do not need to examine that further in Step 3. If there are a high number of right-censored cases, we will need to examine if right-censored cases differ from non-censored cases in systematic ways (e.g., by predictors to be used in model).
  • How many ties (cases with exactly the same survival time) are present in the data. If a researcher has data with continuous time, a large number of ties could be problematic but this is unlikely with observational data. If a large number of ties are present, the researcher could aggregate time into discrete values and use a discrete-time model to overcome the presence of many ties.
  • The median survival time. Standard descriptive statistics (mean, standard deviation) will not provide accurate information about survival analysis data because of censoring. The median survival time is used instead.
  • A plot of survival times to understand how survival times are distributed in the data.
  • Descriptive statistics for any predictors in the model. We will be looking at negative affectivity.

Determine how much of sample has right censoring

Some kids in these data are right-censored such that the period of observation expired before they expressed anger.

length(wait36a[ which(wait36a$status==0),]$id)
## [1] 12

In this case, 12 of the 117 children in the sample have right censoring, which is within the acceptable range.

Determine how much of sample has left censoring

See sections in step 2 (data preparation, Part 1) for this information. In these data, 7 cases were left censored, meaning that 7 children were expressing anger at the start of the observation. To overcome the issue of left censoring, we defined the start of the observation at the first second of the wait task during which children were not expressing anger.

Examine data for ties (how many cases have events at same point in time)

table(wait36a$event_time)
## 
##   2   4   5   6   7   8   9  10  13  14  15  16  18  19  20  23  24  25 
##   3   7   4   2   3   3   2   3   1   1   1   1   3   1   1   2   2   1 
##  26  27  32  34  35  38  39  42  44  47  48  53  55  56  60  63  65  66 
##   1   2   1   1   1   2   1   2   1   1   2   1   1   2   1   1   1   1 
##  72  73  75  79  80  89  91  98 109 121 130 131 133 142 162 176 181 202 
##   2   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 203 209 213 220 223 226 238 253 256 267 273 285 296 334 355 366 400 410 
##   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1 
## 427 476 480 
##   1   1  12

Calculate the median survival time

summary(wait36a$event_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    15.0    56.0   134.1   220.0   480.0

The median survival time is 56 seconds, indicating that half of the children had expressed anger for the first time by 56 seconds into the task.

Plot survival times and ties

Create new incremental count id:

wait36a$idcount <- c(1:length(wait36a$id))

Order by survival time and create an order variable:

wait36a <- wait36a[order(-wait36a$event_time,wait36a$status),]
wait36a$order <- c(1:length(wait36a$id))

Create an ordered survival time plot:

ggplot(data=wait36a, aes(x=event_time, y=order)) +
geom_rect(xmin=0,xmax=wait36a$event_time,ymin=wait36a$order,ymax=wait36a$order+1, colour="lightgray") +
geom_rect(xmin=wait36a$event_time-2,xmax=wait36a$event_time,ymin=wait36a$order,ymax=wait36a$order+1,
          fill=factor(wait36a$status+1)) +
geom_vline(xintercept= 56,linetype="solid") +
scale_x_continuous(breaks=seq(0,480,120)) +
geom_text(aes(140, 95, label="Median Survival Time")) +
xlab("Survival Time (Seconds)") + ylab("Participants (ordered by survival time)") +
ggtitle("Survival Times for Each Child") +
theme_classic() +
theme(legend.position="none",
      panel.grid.major=element_blank(),
      panel.grid.minor=element_blank(),
      panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Examine descriptive statistics of predictors

Center predictors:

wait36a$negative_affectivity30.c <- scale(wait36a$negative_affectivity30, center=TRUE, scale=FALSE)

#Multiply by 100 to make interpretation easier, centered for inclusion in model:
wait36a$negative_affectivity30.c100 <- wait36a$negative_affectivity30.c*100

#And uncentered for reporting:
wait36a$negative_affectivity30.100 <- wait36a$negative_affectivity30*100

Decribe model predictors and make sure transformations look good:

describe(wait36a[,c("negative_affectivity30")])
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 112 3.56 0.55    3.6    3.55 0.52 2.27 4.96  2.68 0.07    -0.33
##      se
## X1 0.05
describe(wait36a[,c("negative_affectivity30.100")])
##    vars   n   mean    sd median trimmed   mad    min    max  range skew
## X1    1 112 355.93 54.55 360.47   355.4 51.69 227.22 495.58 268.36 0.07
##    kurtosis   se
## X1    -0.33 5.15

Step 4: Model Building, Estimation, and Assessment of Fit to the Data

We will now run our Cox regression models. First, we will fit a baseline model (intercept only; no predictors included) to examine the survival and cumulative hazard functions. Next, we will include negative affectivity as a predictor to test our research question. Then, we will use diagnostics to examine model fit.

Cox Regression Model

The equation for the model is specified as,

\(h_{i} = h_{0} exp(\beta_{1}NegativeAffectivity_{i})\)

Fit Cox Regression Model - Baseline

Fit baseline/null model (no predictors) using the survival package:

coxmodel1 <- coxph(formula = Surv(event_time, status) ~ 1,
                data = wait36a)
summary(coxmodel1)
## Call:  coxph(formula = Surv(event_time, status) ~ 1, data = wait36a)
## 
## Null model
##   log likelihood= -423.4879 
##   n= 117
model1fit <- survfit(coxmodel1)

Fit baseline/null model (no predictors) using the eha package. We need to do this step in this package in order to plot the non-cumulative hazard function.

coxmodel1b <- coxreg(formula = Surv(event_time, status) ~ 1,
data = wait36a,center=FALSE)
summary(coxmodel1b)
## Call:
## coxreg(formula = Surv(event_time, status) ~ 1, data = wait36a, 
##     center = FALSE)
## 
## Null model
## NULL

Plot the Non-cumulative Hazard Function

Get \(H_0\) :

model1bdata <- data.frame(matrix(unlist(coxmodel1b$hazards),ncol=2,nrow=74))
colnames(model1bdata) <- c("time","basehaz")

Plot:

ggplot(model1bdata,(aes(x=time,y=basehaz))) +
  geom_line() +
  scale_x_continuous(breaks=seq(0,480,120)) +
  scale_y_continuous(breaks=seq(0,0.08,.02)) +
  expand_limits(y=c(0,0.08)) +
  xlab("Time in Task (seconds)") + ylab("Baseline Hazard") +
theme(panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Data Set-up

Create a dataframe with baseline model information so we can plot baseline survival and hazard functions:

model1data <- data.frame(cbind(model1fit$time,model1fit$surv,model1fit$cumhaz,model1fit$upper,model1fit$lower))
names(model1data) <- c("time","surv","cumhaz","upperCI","lowerCI")

Plot the Cumulative Hazard Function

ggplot(model1data,aes(x=time)) +
  geom_line(aes(y=cumhaz),size=1) +
  scale_x_continuous(breaks=seq(0,480,120)) +
  scale_y_continuous(breaks=seq(0,2.5,.5)) +
  expand_limits(y=c(0,2.5)) +
  xlab("Time in Task (seconds)") + ylab("Cumulative Hazard Function") +
  ggtitle("Cumulative Hazard Function for First Anger") +
theme(panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Plot the Survival Function for the Baseline Model

ggplot(model1data,aes(x=time)) +
  geom_line(aes(y=surv),size=1) +
  geom_line(aes(y=upperCI),colour="red",linetype="dashed",size=1) +
  geom_line(aes(y=lowerCI),colour="red",linetype="dashed",size=1) +
  scale_x_continuous(breaks=seq(0,480,120)) +
  xlab("Time in Task (seconds)") + ylab("Survival Function") +
  ggtitle("Baseline Survival Function") +
theme(panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Fit Cox Regression Model - Adding Time-Invariant Predictor

Fit the model, adding in negative_affectivity30.c100 as a predictor:

coxmodel1c <- coxph(formula = Surv(event_time,status) ~ negative_affectivity30.c100,
                      data = wait36a)
summary(coxmodel1c)
## Call:
## coxph(formula = Surv(event_time, status) ~ negative_affectivity30.c100, 
##     data = wait36a)
## 
##   n= 112, number of events= 100 
##    (5 observations deleted due to missingness)
## 
##                                   coef  exp(coef)   se(coef)      z
## negative_affectivity30.c100 -0.0008021  0.9991982  0.0020384 -0.394
##                             Pr(>|z|)
## negative_affectivity30.c100    0.694
## 
##                             exp(coef) exp(-coef) lower .95 upper .95
## negative_affectivity30.c100    0.9992      1.001    0.9952     1.003
## 
## Concordance= 0.495  (se = 0.033 )
## Rsquare= 0.001   (max possible= 0.999 )
## Likelihood ratio test= 0.16  on 1 df,   p=0.6936
## Wald test            = 0.15  on 1 df,   p=0.6939
## Score (logrank) test = 0.15  on 1 df,   p=0.6939
extractAIC(coxmodel1c)
## [1]   1.0000 801.3719
coxmodel1c$loglik
## [1] -399.7636 -399.6860

Plot

Now we will examine the model results by plotting the survival function at high (+1 SD above the mean) and low (-1 SD above the mean) levels of our predictor.

First get summary stats:

sum.stats <- data.frame(describe(wait36a[,c("negative_affectivity30.c100")]))

Create objects to use for representing prototypical children:

NA.minus1 <- as.numeric(sum.stats[1,3] - sum.stats[1,4])
NA.plus1 <- as.numeric(sum.stats[1,3] + sum.stats[1,4])

Isolate data for prototypical cases:

survfit(coxmodel1c ,newdata=list(negative_affectivity30.c100=NA.minus1))->cox1
survfit(coxmodel1c ,newdata=list(negative_affectivity30.c100=NA.plus1))->cox2

Merge back into one dataframe:

coxmodel1c.surv <- data.frame(cbind(cox1$time,cox1$surv,cox2$surv))
colnames(coxmodel1c.surv) <- c("event_time","NA_minus1SD","NA_plus1SD")

coxmodel1c.cumhaz <- data.frame(cbind(cox1$time,cox1$cumhaz,cox2$cumhaz))
colnames(coxmodel1c.cumhaz) <- c("event_time","NA_minus1SD","NA_plus1SD")

Melt data for plotting purposes:

coxmodel1c.surv.melt <- melt(coxmodel1c.surv, id="event_time") 
colnames(coxmodel1c.surv.melt) <- c("event_time","predictor","survival")

coxmodel1c.cumhaz.melt <- melt(coxmodel1c.cumhaz,id="event_time")
colnames(coxmodel1c.cumhaz.melt) <- c("event_time","predictor","hazard")

Plot the survival function at different levels of the predictors:

ggplot(coxmodel1c.surv.melt ,aes(x=event_time,y=survival,colour=factor(predictor))) +
  geom_line(size=1) +
  scale_y_continuous(limits=c(0,1)) +
  xlab("Time in Task (seconds)") + ylab("Survival") +
theme_classic() + 
  theme(axis.line.x = element_line(color = "black"),
        axis.line.y = element_line(color = "black"))

Evaluating Model Fit

Likelihood-Ratio Test

The results of the likelihood-ratio test can be found in the model summary above.

Examine Proportional Hazards Assumption

Calculate a statistical test of the proportional hazards assumption for the predictor and the global model using Schoenfeld residuals. Non-significant results indicate that the hazards areproportional. Here, the assumption has not been violated.

out.zph <- cox.zph(coxmodel1c ,transform="log")
out.zph
##                                 rho chisq     p
## negative_affectivity30.c100 -0.0655  0.55 0.458

Plot the Schoenfeld residuals to graphically examine proportional hazards assumption:

plot(out.zph)

Examine Influential Observations by Plotting Residuals

model1cresid <- data.frame(resid(coxmodel1c,type="dfbeta"))
model1cresid$id <- seq(1:112)
names(model1cresid) <- c("dfbeta","id")

ggplot(model1cresid,aes(x=id,y=dfbeta)) + 
  ylim(-.005, .005)+
  geom_point()

Test Overall Model Adequacy by Plotting Cox-Snell Residuals

Construct a plot of residuals on a unit exponential distribution with a hazard ratio of 1 (see Mills, 2011, pg. 149). Model adequacy is indicated by the extent to which the plot forms a line on a 45-degree angle.

r <- coxmodel1c$residuals

rr <- wait36a$status-r
## Warning in wait36a$status - r: longer object length is not a multiple of
## shorter object length
fit <- survfit(Surv(rr,wait36a$status)~1)

S <- fit$surv

T <- fit$time

plot(T,S,xlim=range(T),ylim=c(0,1),xlab="time",ylab="Cox-Snell Residual Values")
t0 <- seq(0,max(T),0.05)
s0 <- exp(-t0)
lines(t0,s0)


Step 5: Presentation and Interpretation of Results

Our results show that children’s negatie affectivity was not associated with time until children expressed anger.