Missing Data
Missing data is common in epidemiology studies, and always observed in longitudinal studies. Inadequate handling of missing data may cause bias or lead to inefficient analyses. If a variable is heavily missing, it may be appropriate to remove the variable then remove all incomplete records.
There is a difference between missing and "unknown" data. Ex. someone refusing to report income on a survey would be considered missing, but an individual who has no preference between republican and democrat candidates may be considered a new category of "do not know".
Missing Data Mechanisms
The way to analyze incomplete data depends on the underlying missing data mechanisms. Ex. Is the missing data the same between males and females? Are values which are higher or lower more likely to be missing?
- Missing completely at random (MCAR): No systematic differences between the missing values and the observed values.
- Ex. Blood pressure measurements missing because of a broken tool
- Missing at random (MAR): Any systemic difference between the missing values and the observed values can be explained by differences in observed data.
- Ex. Missing blood pressure measurements may be lower than those measured because younger people are more likely to be missing measurements.
- Missing not at random (MNAR): Even after the observed data are taken into account, systematic differences remain between the missing values and the observed values.
- Ex. people with hypertension missing clinic appointment because of headaches
Missing Data Patterns
Suppose we have a sample of size n, with some missing data points where:
The pattern of missing data can be described via the distribution of the indicator variable R:
- If the probability that Ri = 1 is independent of the value yi, for any index i, the missing data pattern is ignorable. Equivalently. data are MAR
- If the probability that Ri = 1 is independent of the value yi, for any index i, the missing data pattern is NOT ignorable, i.e. MNAR, the data are informatively missing
Imputation
Imputation is the idea that we can fill in the missing data using either a deterministic or stochastic method or combination of both.
- Deterministic imputation
- Refers to the situation given specific values of other fields, when only one value of a field will cause the record to satisfy all of the edits, such as using the mean or median
- Stochastic (regression) imputation
- This method uses regression, it adds a random error term to the predicted value and is therefore able to reproduce the correlation of X and Y more appropriately.
Other procedure s include EM algorithm, Markov Chain Monte Carlo Batesian method, weighted methods (IPW), etc.