Statistical Modeling

Statistical association analysis is not all about significance. There is much to consider when deciding covariates and choosing a model to represent the relationship.

Regression Modeling

Troubleshooting Regression Modeling

Variable Selection Methods

When no interaction exists, we have 3 primary approaches to variable selection:

  1.  Forward
    • Pick the variable most highly associated with the outcome
    • Given that the variable is in the model, now pick from the remaining variables
    • Continue until there are no more significant variables or all variables are in the model
  2. Stepwise Regression
    • It is the same as the forward procedure except that at each step it looks at all variables in the model to see if they are still significantly significant (if not the variable is dropped)
    • Pro:
      • Speeds up process of searching through models
      • More feasible than manual selection
    • Cons:
      • Blind analysis without thinking
  3. Backward methods
    • Start with fully adjusted model
    • Remove all variables starting with largest p-value
    • Check for any confounding impact by comparing ORs before and after removal
    • Stop when all remaining variables are significant or confound other variables
    • Pro:
      • Useful with small numbers of variables
    • Cons:
      • In practice it is impossible to use with a very large number of variables

Modeling with Interaction

Interaction is regarded as deviation from no interaction model, AKA multiplicative. In a multiplicative model, the order of the interaction terms can make a difference. In a hierarchical model lower order terms always come before higher order terms.

  1. Force all main effects into the model; perform stepwise regression analysis on the 2-way interactions
    • It may be smart to only consider interactions having to do with the exposure of interest
    • Be sure to have a hierarchical model
  2. Retain whatever 2-way interactions remained after step 1, perform a stepwise regression analysis on the main effects which are not part of the 2-way interactions retained

Model Selection

One can look at changes in the deviance (-2 log likelihood change)
image-1670256175900.png

AIC & BIC

Both are based on likelihood. An advantage is you do not need hierarchical models to compare the AIC or BIC between models unlike -2 ln LR, the disadvantage is that there is no test or p-value that goes with comparison of models. A model with smaller values of AIC or BIC provides a better fit. When n > 7 the BIC is more conservative.

AIC and BIC can also compare non-nested models (a model where the set of independent variables in one model is not a subset of the independent variables in the other model).

The data must be the same.

Common Issues with Model Building

Collinearity occurs when independent variables in a regression model are highly correlated, or a variable can be expressed as a linear combination of other variables.

Problem: the estimate of standard errors may be inflated or deflated, so that the significance testing of the parameters becomes unreliable.

One strategy that should be used is to examine the correlations among potential independent variables, and give collinearity diagnostics. When two variables are collinear, drop one or create a new variable of both.

Risk Prediction & Model Performance

Calibration is meant to quantify how close predictions are to actual outcomes (goodness of fit)
Discrimination refers to the ability of the model to distinguish correctly the two classes of outcome

This is generally only used for dichotomous outcomes.

A model that assigns a probability of 1 to all events and 0 to nonevents would have perfect calibration and discrimination. A probability of .51 to events and .49 to nonevents would have perfect discrimination and poor calibration.

Calibration Metrics

Calibration at large: how close is the proportion of events to the mean of predicted probabilities.

Home-Lemshow Decile Approach (Logistic Regression)
  1. Divide the model-based predicted probabilities of size n_j
  2. In each decile calculate the mean of predicted probabilities p_bar_j and comare it to the observed proportion of events in that decile (r_j):

    image-1670258135240.png

  3. Degrees of freedom equal to number of groups minus 2

Sensitive to small event probabilities in categories - a construct of adding 1/n_j to p_bar_j in the denominator. It is sensitive to large sample sizes.

This solution actually has serious problems; results can depend markedly on the number of groups and there's no theory to guide the choice of that number. It cannot be used to compare models.

ROC (Receiver Operating Curve)

Plots sensitivity (true positive) for different decisions and look for the best trade off between sensitivity and specificity (true negative).

image-1670258456991.png
image-1670258614039.png

Prediction, Goodness of Fit

This value can be interpreted as the "true" predictive accuracy. Methods to validate:

With any of these the analysis should be repeated many (over 100) times.

Harrell's c Survival Analysis

Calibration at large - compare how close the mean of model-based predicted probabilities at time t is to the Kaplan-Meier estimate at time t
Calibration by decile - replace rates/proportions in deciles with their Kaplan-Leier equivalents; change df to 9

Multiple Comparisons

The significance is the probability that, in one test, a parameter is called significant when it is not (Type I error)

Multiple comparisons asks what the overall significance is when we test several hypotheses.
image-1670259244796.png
We can apply the Bonferroni correction (.05/n)  to account for multiple comparisons.

False discovery rate - False positives among the set of rejected hypotheses, often used in studying  gene expression.


Revision #2
Created 5 December 2022 15:03:40 by Elkip
Updated 5 December 2022 16:56:44 by Elkip