Statistical Modeling
Statistical association analysis is not all about significance. There is much to consider when deciding covariates and choosing a model to represent the relationship.
Regression Modeling
- If we have a small number of variables we can manual assess confounding and collinearity by comparing models with each potential confounder.
- First we need to ask what a "small" number of variables; a number that is manageable for manual assessment of variable significance and confounding.
- If we have > 10 variables and need to cut variables
- Base on prior assumptions about confounders
- Base on a threshold p-value (maybe .2) from univariable analyses
- There's no test for confounding: 10% is not a hard rule, we might consider > 10% to be conservative
- When we have a large number of variables it becomes impossible to run manually, so we choose a systemic model to cut variables and reduce the model
- Beware of dropping important confounders or relying solely on measures such as AIC
- If there are K variables (in an additive model without interaction):
Troubleshooting Regression Modeling
- Getting errors or weird results in a categorical variable with multiple categories
- Likely due to a small number of samples in some categories
- Fix: Combine categories in a way that makes sense to increase sample size in one group
- If the variable cannot be defined differently, consider excluding the variable initially and including it later on when we have less variables
- Sometimes we want to "force" a variable into a model, regardless of its significance or confounding
- This might occur when its something the readers want to know (age, sex, etc) or its the main exposure of interest
Variable Selection Methods
When no interaction exists, we have 3 primary approaches to variable selection:
- Forward
- Pick the variable most highly associated with the outcome
- Given that the variable is in the model, now pick from the remaining variables
- Continue until there are no more significant variables or all variables are in the model
- Stepwise Regression
- It is the same as the forward procedure except that at each step it looks at all variables in the model to see if they are still significantly significant (if not the variable is dropped)
- Pro:
- Speeds up process of searching through models
- More feasible than manual selection
- Cons:
- Blind analysis without thinking
- Backward methods
- Start with fully adjusted model
- Remove all variables starting with largest p-value
- Check for any confounding impact by comparing ORs before and after removal
- Stop when all remaining variables are significant or confound other variables
- Pro:
- Useful with small numbers of variables
- Cons:
- In practice it is impossible to use with a very large number of variables