Survival - Time to Failure
Analysis of survival data is more complex than than other methods we've seen so far; We can't just take the mean survival time a a confidence interval to predict when the last patient will die. Also, survival times are unlikely to follow a Normal distribution, so simple regression techniques are not applicable.
Censoring
Censoring of data can happen in a number of ways:
- Some subjects still have not experienced the event of interest by the end of the study. Most studies have a recruitment period followed by a pure follow-up period. Patients who are enrolled earlier have a higher chance of experiencing the event.
- Some participants drop out early, either due to dropping out or another event unrelated to the study.
In either case, you know that the subject has participated in a study to a certain time without witnessing the event, but have no information thereafter. Such incomplete information is said to be right censored. Note that we censor the data, not the subject. Throughout this lecture we will assume independent censoring - the participants remaining in the study are representative of the target population.
Methods of Analyzing Survival
Let's consider a dataset of rats testing a new cancer treatment.
Approach 1: Test for Difference in Outcome After a Set Time (Wrong)
We could pick a set number of days and consider the number of rats in each group. We can use a chi-square test to determine if the outcome between the two groups is the same. A 95% confidence interval for the population could also be determined.
The problem with this approach is it disregards exact death times and ignored censored observations before the cutpoint. Comparing survival at a single point in time is not informative, the results could change significantly at a different cutpoint.
Approach 2: Comparing Incidence Rates
Estimates can be obtained for the incidence rates, by taking the number of deaths and the population follow-up time in each group. We would expect the proportion of cases will be the same under the null hypothesis. The estimate of the rates can also be obtained by using a Poisson regression.
SAS Script
data one ;
input Group C Time ;
LogT = log ( Time ) ;
cards ;
1 17 4095
2 19 5023
run ;
proc genmod data = one ;
class Group ;
model C = Group / dist = poisson offset = LogT ;
estimate ’ Group 1 ’ Intercept 1 Group 1 0/ exp ;
estimate ’ Group 2 ’ Intercept 1 Group 0 1/ exp ;
Estimate ’ Group 2 vs 1 ’ Group -1 1/ exp ;
Run ;
Approach 3: Consider Follow-Up Time to be a Stratifying Variable
Compare rates between groups in the first 200 days, days 201-250 and 251-300. Dropouts/censoring occur at the end of each interval.
We could make a 2x2 table for each strata and use the Cochran - Mantel - Haenszel test statistic for Group 1 vs Group 2 survival.
/* Comparing survival using CMH test and 3 time strata */
data RCMH ;
do group =1 to 2;
Do time =1 to 3;
input yes no @@ ;
output ;
end ;
end ;
cards ;
6 13 9 4 1 1
4 17 9 8 5 2
run ;
proc sort data = RCMH ;
by group time ;
run ;
proc transpose data = RCMH out = RCMH2 prefix = count name = died ;
by group time ;
run ;
/* Comparing survival using CMH test and 3 time strata */
title1 ’ Comparing survival using CMH test and 3 time strata ’;
options pageno =1 ps =57 ls =91 center nodate ;
proc freq data = RCMH2 order = data ;
table time * group * died / cmh chisq relrisk ;
weight count1 ;
run ;
Approach 4: Log-Rank Test (Best)
Taking smaller intervals gives a more refined test. If the intervals are chosen based on the event times, that is each interval starts and ends with an event, the resulting test is the log-rank test.
Here the censored obesrvations are followed through the last interval they complete. In the data, to fully specify the each subject we must provide (1) a time of follow-up and (2) an indicator of whether censorship or the terminating event occurs at that time.
This can be performed by hand; however, it can get computationally intensive. The Log-Rank test can be directly obtained from SAS from PROC LIFETEST; the risk variable should be specified in the STRATA statement.
SAS Script
data ratc ;
input group day censor @@ ;
logt = log ( day ) ;
cards ;
1 143 0 1 220 0 2 156 0 2 239 0
1 164 0 1 227 0 2 163 0 2 240 0
1 188 0 1 230 0 2 198 0 2 261 0
1 188 0 1 234 0 2 205 0 2 280 0
1 190 0 1 246 0 2 232 0 2 280 0
1 192 0 1 265 0 2 232 0 2 296 0
1 206 0 1 304 0 2 233 0 2 296 0
1 209 0 1 216 1 2 233 0 2 323 0
1 213 0 1 244 1 2 233 0 2 204 1
1 216 0 2 142 0 2 233 0 2 344 1
run ;
title1 ’ Using Log - Rank test ’;
options pageno =1 ps =57 ls =91 centerPROC LIFETEST data = ratc notable ;
TIME day * censor (1) ;
STRATA group ;
run ;
Note that the censoring variable contains the label associated with observation being censored in parentheses.
Survival Function
Denote T as the time to event for an individual. The survival function (denoted by S(t)) is defined as:
$$ S(t) = P(T > t), 0 \leq t < \infty $$
In other words, S(t) is the probability the individual survives beyond time t.
The Cumulative Distribution Function (CDF) of the random variable T is defined as F(t) = P(T <= t) and is related to the survival function through: F(t) = 1 - S(t)
Suppose T is a continuous random variable. In this case, the hazard function λ(t), is defined as the instantaneous rate of failure at T = t conditional on surviving beyond T = t. Note the hazard is NOT a probability.
More precisely, since the probability of an event in the interval (t, t + δ) given the event has not occurred up to time t, is P(t < T <= t + δ|T > t), the hazard is obtained by dividing the probability by the length of the interval and defining a limit. In other words:
$$ \lambda(t) = \lim_{\delta \to 0+} {{P(t < T \leq t + \delta | T > t)} \over {\delta}} $$
Some simplification reduces this to F'(t) / S(t) or f(t) / S(t), where F'(t) or f(t) is the probability density function (the derivative of F(t)). From linking the survival function to the cumulative distribution function we know F'(t) = -S'(t), so we can further reduce the hazard to:
$$ \lambda(t) = d log S(t) / dt $$
and integrating both sides with respect to t and using the fact that S(0) = 1:
$$ S(t) = exp( - \int_0^t \lambda(s) ds) $$
Which gives us the cumulative hazard function:
$$ A(t) = \int_0^t \lambda (s) ds $$
Which we can simplify again to get the density function as a product of the hazard function:
$$ f(t) = \lambda(t) exp(-\int_0^t \lambda(s) ds) $$
All three parameters (survival, cumulative hazard and density) are related. If one parameter is specified the other 2 can be derived.
Life Table Estimates
Life tables are used to describe pattern of survival in populations. These methods can be applied to the study of any time to event endpoint.
The life table method requires a pre-specification of a set of intervals that span the follow-up time. The life-tables are computed by counting the numbers of censored and uncensored observations that fall into each of the time intervals.
The conditional probability of an event in the interval Ii = [ti,ti+1) given that the event did not occur up to time ti that is P(T ∈ Ii | T > ti) is estimated by:
$$ p_i = r_i / m_i $$
where r is the number of events in the interval, I and m = n - c/2 are called the effective sample size, equal to the number of patients at risk (n) in the interval m minus half the number of censored observations (c) in the interval I. This is because the assumption of the life-table is that any censored observation with an interval is treated as if it were censored at the midpoint of the interval. So, since the censored observations are at risk for only half the interval they count for only half in figuring out the number at risk.
The estimated variance of pi and its estimated standard error is:
$$ var(p_i) = {{ p_i (1 - p_i) } \over m_i} $$
The product limit estimate of the survival function at ti is obtained by piecing together the conditional probabilities pi estimated above.
$$ \hat S (t_0) = 1 $$
$$ \hat S (t_i) = P(T > t_i) = P(T < t_i | T > t_{i-1})P(T > t_{i-1}) = (1 - p_{i - 1})\hat S(t_{i-1}) =(1 - p_{i - 1})(1 - p_{i - 2})... (1 - p_{1})(1 - p_{0}) $$
$$ Var(\hat S(t_0)) = 0 $$ $$ Var(\hat S(t_i)) = \hat S(t_i)^2 \ssum_{sum_{j=1}^{i-1} {p_j \over {m_j (1-p_j}}$$