Propensity Score Weighting Analysis

Unlike randomized clinical trials, observational studies must adjust for differences such as confounding to ensure patient characteristics are comparable across treatment groups. This is frequently addressed through propensity scores (PS), which summarizes differences in patient characteristics between treatment groups. Propensity Score is the probability that each individual will be assigned to receive the treatment of interest given their measured covariates. Matching or Weighting on the PS is used to adjust comparisons between the 2 groups, thus reducing the potential bias in estimated effects of observational studies.

The following use cases assume a binary treatment or exposure in order to infer causality. Given a treatment and control with one outcome observed per unit, can we estimate the treatment effect? Note we can only estimate the treatment effect, identification of causality is not possible through observational studies.

Estimation of Propensity Scores

Propensity scores are most commonly estimated using binomial regression models (logistic regression, probit, etc.). Other propensity score methods include:

Classification Trees
Bagging/Boosting
Neural Networks
Recursive Partitioning

Basically, any model that provides predictive probability and where all the covariates related to treatment and outcome that were measured before treatment are included in the propensity score estimation model. The SAS example below estimates propensity scores for the treatment variable group predicted from covariates var1, var2, and var3 using a logistic regression:

PROC LOGISTIC data=ps_est;
  title 'Propensity Score Estimation';
  model group = var1-var3 / lackfit outroc = ps_r;
  output out = ps_p pred = ps xbeta=logit_ps;
  /* Output the propensity score and the logit of the propensity score */
run;

Once we have the propensity scores estimated, we must make sure the measured covariates are balanced in order to reduce bias. There are several ways to achieve this:

Graphic of the propensity score distribution - The distribution of propensity score between the two groups should overlap. Non-overlapping distributions suggest that one or more covariates are strongly predictive, and variable selection or stratification should be reconsidered.
Standardized differences of each covariate between treatment groups - the magnitude of the difference between baseline characteristics of the groups can be calculated depending on the method of deriving propensity scores. One limitation of this method is the lack of consensus as to what the threshold should be, though researchers have suggested a standardized difference of .1 or more denotes meaningful imbalance in the baseline covariates.
Stratify by deciles or quintiles - By stratifying the propensity score by deciles or quintils, a boxplot can represent each quintile.

When the scores aren't balanced, the covariates in the model should be adjusted. This could mean adding or removing covariates, adding interactions, or substituting a non-linear term for a continuous one.

Estimating Treatment Effects

After we obtain the propensity score, the next step is to estimate the Average Treatment Effect in the population (ATE). Using multiple methods can help strengthen conclusions, while discrepancies can indicate confounding and/or sensitivity to the analysis approach.

Stratification

Stratification divides individuals into many groups on the basis of their propensity score values. The optimal number of strata depends on sample size and the amount of overlap between treatment and control group propensity scores, however researchers have suggested 5 subclasses is sufficient enough to remove 90% of bias in the majority of PS studies. The treatment effect for each subclass is the average effect across strata weighted by stratum-specific standard error:

$$ \hat\mu_{1i} - \hat\mu_{2i} = {{\sum w_i (\bar X_{1i} - \bar X_{2i}) } \over {\sum w_i}} $$

Where $ w_i = { 1 \over {SE}^2_i} $ (the square of the standard error of the difference between means). This can be coded in SAS as follows:

/* Stratify */
PROC RANK data=ps_p out=ps_strata groups=5;
	var ps_pred;
    ranks ps_pred_rank;
run;

/* Sort */
proc sort data = ps_strataranks;
	by ps_pred_rank;

/* Compute the difference between group means in each stratum,
	as well as the standard error of this within stratum difference */
proc ttest;
  by ps_pred_rank;
  class group;
  var outcome;
  ods output statistics = strata_out;

/* Find stratum specific weights and mutliply by mean difference */
data weights;
  set strata_out;
  if class = 'Diff (1-2)';
  wt_i = 1/(StdErr**2);
  wt_diff = wt_i*Mean;
 
/* Find the mean weighted difference and its standard error */
proc means noprint data = weights;
  var wt_i wt_diff;
  output out = total sum = sum_wt sum_diff;
  data total2;
  set total;
  Mean_diff = sum_diff/sum_wt;
  SE_Diff = SQRT(1/sum_wt);

proc print data = total2;

run;

Matching

The goal of matching is to obtain similar groups of treatment and control subjects by matching individual observations on their propensity scores. It is generally recommended to use 1:1 matching with no replacement to avoid bias. The Nearest Neighbor (or Greedy matching) selects a control unit for each treated unit based on the smallest distance from the treated unit. The big problem with this is it depends on the order in which the data is sorted, thus randomization is necessary.

Nearest Neighbor with calipers is a method which improves NN by allowing a maximum allowable difference between scores (or caliper width) in order for PS to be matched. Researchers have suggested a caliper width of .2 times the standard deviation, or a static value such as .1.

There are SAS macros available for matching propensity score. The general procedure:

Sort observations into random order with each group

Transpose data to obtain seperate datasets for treatment and control

Merge them into a single dataset, matching each observation in the treatment group with a single observation in the control group
- If no control observations are found in the range, no matched pair is created

After matched pairs have been identified, the difference in outcome means in each group can be tested ussing a t-test.
- Use a correlated-means t-test rather than independant-means (PAIRED keyword in SAS)

Inverse Probability of Treatment Weights (IPTW)

In IPTW, individuals are weighted by the inverse probability of receiving the treatment they actually received. So, control subjects are weighted by $1 / (1 - p_i) $ and treated subjects with $ 1/p_i $.

The IPTW method is inclusive of all subjects, so no data loss occurs. However, it is very sensitive to outliers.

References

Overlap Weighting: A Propensity Score Method That Mimics Attributes of a Randomized Clinical Trials

Propensity Score Analysis and Assessment of propensity Score Approaches Using SAS Procedures