Interim Analysis and Data Monitoring

Clinical trials are often longitudinal in nature. It is often impossible to enroll all subjects at the same time, so it can take a long time to complete a longitudinal study. Over the course of the trial one needs to consider administrative monitoring, safety monitoring and efficacy monitoring.

Efficacy monitoring can be performed by taking interim looks at the primary endpoint data (prior to all subjects being enrolled or all subjects completing treatment). This is because:

It potentially stops the trial early if there is convincing evidence of benefit or harm of the new product
It potentially stops the trial for futility, in other words the chance of significant beneficial effect by the end of the study is small given observed data
Re-estimate final sample size required to yield adequate power to obtain a significant result

Interim analysis evaluates for early efficacy, early futility, safety concerns, or adaptive design with respect to sample size or power.

Group Sequential Design

A common type of study design for interim analysis is GSD, in which data are analyzed at regular intervals.

Determine a priori the number of interim "looks"
Let K = # of total planned analyses including final (K >= 2)
For simplicity, assume 2 groups and that subjects are randomized in a 1:1 manner
After every n = N/K subjects are enrolled and followed for a specific time period, perform an interim analysis on all subjects followed cumulatively
If there is a significant treatment difference at any point, consider stopping the study

Due to multiple testing the probability of observing at least one significant interim result is much greater than the overall α = .05, as a result the interim analyses should NOT be performed using the family-wise error rate. The data at each interim analysis contains data from the previous interim and thus are not independent.

Equivalently we would have K critical values for each interim:

First interim analysis compare test statistic to critical value Z1 |> c1. If significant then stop the trial.
Second interim analysis compare test statistic to critical value Z2 |> c2. If significant stop the trial
...
Final analysis compare test statistic to critical value Xk |> ck.

Pocock Approach (1977)

Derives constant critical values across all stages to maintain the overall significance level at .05. The critical value depends on the number of interim analyses, but is the same for each interim look.

Ex. When K=5, Z critical value = 2.413 for each interim and the final analysis. When K=4, Z critical value = 2.361.

O'Brien-Fleming Approach (1979)

Proposed sequential testing procedure that has critical values (in absolute value) decrease over the stages. Z critical value depends on total number of interim analyses and the stage of the interim analysis. The critical Z value depends on the total number of interim analyses and the stage of the interim analysis.

Ex. For K = 5 after after 200 subjects completed:

C₅ = 2.040 a constant determined by O'Brien-Fleming)
First look critical value is: sqrt(5/1)*2.04 = 4.56
Second look critical value: sqrt(5/2)*2.04 = 3.23
...
Critical value final look: sqrt(5/5)*2.04 = 2.04

This makes it more difficult to declare superiority at "earlier" looks, but does lose much of the original alpha at the final look. This is more conservative than Pocock and the recommended approach by the FDA on their 2010 Guidance on Adaptive Designs.

Controlling the Overall Significance Level

Issues with group sequential procedures:

Need to specify a priori for the number of planned interim analyses
The timing of interim analyses is generally not exact calendar time but "information time" (i.e. based on sample size or number of events)
- Assumes interim analyses performed after every n subjects complete follow-up (or y events occur)
- Difficult to schedule formal review procedures for interim analyses after every n subjects (or y events); may want to allow more flexibility in scheduling
So the main issue is: How do we allow for more flexibility for unscheduled interim analysis?

Alpha-Spending

From Lan and DeMets (1983) Biometrika: Adjust teh levels via an "alpha-spending function". Think of it like each analysis spends a bit of the alpha power.

α(s) denotes alpha spending function.
- s = proportion of information (sample size or events) accrued
- s = 0 at the start of the study (0% information); α(0) = 0
- s = 1 is the end of the study (100% information); α(1) = 1
- α(s_k) proportion of Type 1 error one is willing to spend up to time k
- Not a significance level

This can work in conjunction with O'Brien-Fleming.

Interpretation:

If the first interim analysis occurs after 20% of information, reject treatment equality if p-value < .000001
If the third interim analysis occurs after 60% of the subjects are in, reject treatment equality if p-value < .0074

Focus on the significance levels, note the the alpha spending and significance level are two distinct values and not equal. Computing critical values and corresponding significant levels require knowledge of multivariate distributions. There is no "simple" equation to get from α(s) to Z. Typically obtained via numerical integration.

/*
	plots=boundary - graph observed stanardized test statistics 
    	at each interim
    errspend - give overall error spending
    bscale=pvalue - instead of critical values print p-values
    info=equal - equal intervals
    stop=reject - trying to reject the null (default)
*/
proc seqdesign errspend plots=boundary;
TwoSidedObrienFleming: design nstages=5 alpha=0.05
alt=twosided info=equal method=errfuncobf
stop=reject;
run;

Notes that the bscale=pvalue option gives one tail of the distribution, so to get the significance level we need to multiply the upper or lower boundary by 2.

Pocock Approach in SAS

?* Pocock alpha-spending for a two-sided test with 
	two-sided alpha spent of .05 by final analysis */
proc seqdesign errspend bscale=pvalue;
TwoSidedPocock: design nstages=5 alpha=0.05
alt=twosided info=equal method=poc stop=reject;
run;

Interim Analyses For Safety

It is not necessarily easy from an administrative and study conduct perspective. We need to determine if data is still recent enough to be included. Weeks or months may pass between the last subject visit and generation of interim results due to data entry and cleaning. Depending on the size of the study, the ideal goal is to have a < 60-day lag between data collected at sites and the interim analysis report. Otherwise, interim analysis be be obsolete by the time analysis is completed.

Inspection of adverse events and serious adverse events is primary concern. Labs, vital signs, etc. need to be inspected. Unlike efficacy there is often no formal stopping rules based on p-values or a parametric test. If it is felt there is a safety concern, the study may be stopped regardless of significance between treatments.

The results of the interim analysis are also inspected by a Data Safety and Monitoring Board (DSMB), sometimes called the Data Monitoring Committee. Usually these consist of:

>= 1 Statistician with expertise in interim analyses
2-4 clinicians with experience in the topic
Maybe an ethicist (especially in government-sponsored trials)
No member can be a study investigator
DSMB is independent of the sponsor and all study activities
DSMB can recommend to sponsor early stoppage when there is evidence of clear risk, harm or futility. But they cannot stop the trial themselves