Missing Data

Missing data is common in epidemiology studies, and always observed in longitudinal studies. Inadequate handling of missing data may cause bias or lead to inefficient analyses.  If an estimate is incomplete, we can remove it without introduce bias. If a variable is heavily missing, it may be appropriate to remove the variable then remove all incomplete records.

There is a difference between missing and "unknown" data. Ex. someone refusing to report income on a survey would be considered  missing, but an individual who has no preference between republican and democrat candidates may be considered a new category of "do not know".

Missing Data Mechanisms

The way to analyze incomplete data depends on the underlying missing data mechanisms. Ex. Is the missing data the same between males and females? Are values which are higher or lower more likely to be missing?

MAR vs MCAR

Missing Data Patterns (Univariate)

Suppose we have a sample of size n, with some missing data points where:
image-1670859174228.png

The pattern of missing data can be described via the distribution of the indicator variable R:
image-1670859228747.png

Missing Data Patterns (Mutlivariate)

Assume we have a multivariate dataset with outcome Y and covariates X1-Xk

Suppose we only have missing data on Y. As in the univarite case, the patterns of missing data can be described via the distribution of the indicator variable R.
image-1670860184831.png
Where Ri = 1 has probability p*Ri, and Ri = 0 has probability 1-p*Ri

Imputation

Imputation is the idea that we can fill in the missing data using either a deterministic or stochastic method or combination of both.

Other procedures include EM algorithm, Markov Chain Monte Carlo Bayesian method, weighted methods (IPW), etc.

Multiple Imputation

To avoid the bias of the deterministic imputation, the following methods are applied:

There is a paper on this by Sterne et al which explains the pitfalls and reporting procedure, but I will not go into detail here.

Other methods to fill-in missing data

Note that to assess confounding we compute crude estimates against adjusted with the same imputed data. It is very important to use the same dataset to keep results consistent.

R Code

framdat2 <- read.table("C:/Users/liuc/Documents/BS852/BS852_Fall2021/class12/framdat2.txt",header=T, na.strings=c("."))
summary(framdat2)

# Proportion missing per variable
for(i in 1:ncol(framdat2)){
  print(names(framdat2)[i])
  print(sum(is.na(framdat2[,i]))/nrow(framdat2))
} # 6 of the variables have missing data

# Distribution of incomplete cases
num.miss <- apply(is.na(framdat2),1,sum)
table(num.miss) 
length(num.miss[which(num.miss %in% c(1:6))])

MRW <- is.na(framdat2$MRW)

# Missing MRW
table(MRW)
# Missing GLI by sex
table(framdat2$SEX, MRW)/apply(table(framdat2$SEX, MRW),1,sum)

# Is missing GLI associated with sex?
summary(glm(MRW~framdat2$SEX, family=binomial(link = "logit")))

cbind(framdat2$FVC, is.na(framdat2$FVC))[1:28,]

sim.data <- read.csv("C:/Users/liuc/Documents/BS852/BS852_Fall2021/class12/sim.data.csv",header=T, na.strings=c("."))

# crude and adjusted models
summary(lm(y ~ SMOK, data=sim.data))

summary(lm(y ~ SMOK + Age + BMI + SEX, data=sim.data))

# 10% MCAR
set.seed(1234)
R <- rbinom(nrow(sim.data), p=0.10, size=1) # randomly select data to keep

mod.10mcar.crude <- lm(y ~ SMOK, data=sim.data[R==0,])

mod.10mcar.adj <- lm(y ~ SMOK + Age + BMI + SEX, data=sim.data[R==0,])

summary(mod.10mcar.crude); summary(mod.10mcar.adj)

# Marginal Mean Imputation
# 10% MCAR
set.seed(1234)
R <- rbinom(nrow(sim.data), p=0.10, size=1)

mean.obs.y <- mean(sim.data[R==0,]$y) ## complete data
sim.data$y.imputed <- sim.data$y 
sim.data$y.imputed[R==1] <- mean.obs.y
boxplot(list(raw=sim.data$y, imputed=sim.data$y.imputed))
 
mod.imputed.crude <- lm(y.imputed ~ SMOK, data=sim.data)
summary(mod.imputed.crude)

mod.imputed.adj <- lm(y.imputed ~ SMOK+Age+BMI+SEX, data=sim.data)
summary(mod.imputed.adj)

# Conditional Mean Imputation
# 10% MCAR
set.seed(1234)
R <- rbinom(nrow(sim.data), p=0.10, size=1)

imp.model <- lm(y ~ SMOK+Age+BMI+SEX, data=sim.data[R==0,])
summary(imp.model)
imp.model$coeff
sim.data$y.imputed[R==1] <- imp.model$coeff[1]+
                                imp.model$coeff[2]*sim.data$SMOK[R==1]+
                                imp.model$coeff[3]*sim.data$Age[R==1]+
                                imp.model$coeff[4]*sim.data$BMI[R==1]+
                                imp.model$coeff[5]*sim.data$SEX[R==1]
boxplot(list(raw=sim.data$y,imputed=sim.data$y.imputed))
 
mod.imputed.crude <- lm(y.imputed ~ SMOK, data=sim.data);  summary(mod.imputed.crude)

mod.imputed.adj <- lm(y.imputed ~SMOK+Age+BMI+SEX, data=sim.data); summary(mod.imputed.adj)

# Distinguish MCAR from MAR
set.seed(1234)

p.miss <- c()
for (i in 1:nrow(sim.data)){
  p.miss[i] <- max(0, sim.data$Age[i]/400 + runif(n=1, min=0, max=0.02))
}

R <- c()
for (i in 1:nrow(sim.data)){
  R[i] <- rbinom(n=1, size=1, p=p.miss[i])
}

mod.mar <- glm(R ~ sim.data$Age+sim.data$BMI, family=binomial)
mod.step <- step( mod.mar, scope = R~.) 
summary(mod.step)


hdl.data <- read.csv("hdl_data.csv",header=T, na.strings=c("."))

hdl.data.2 <- hdl.data[, c("SEX","BMI5","AGE5","ALC5","CHOL5","TOTFAT_C")]
summary(hdl.data.2)

summary(lm(TOTFAT_C~SEX+BMI5+AGE5+ALC5+CHOL5, data=hdl.data.2))

# Create missing data indicator for TOTFAT_C
R <- rep(0, nrow(hdl.data.2))
R[which(is.na(hdl.data.2$TOTFAT_C)==T)] <- 1

mod.mar <- glm(R ~ AGE5 + BMI5 + SEX, family = binomial, data=hdl.data.2)
step(mod.mar, scope=R~.)

library(Amelia)

set.seed(1234)
imp.data <- amelia(hdl.data.2, m=10)  ## 10 imputed datasets
summary(imp.data[[1]][[1]])

beta.bmi <- c(); se.beta <- c()
for(i in 1:10){
  mod <- lm(TOTFAT_C~SEX+BMI5+AGE5+ALC5+CHOL5, data=imp.data[[1]][[i]])
  beta.bmi <- c(beta.bmi,   
  summary(mod)$coefficients[grep("BMI5", row.names(summary(mod)$coefficients)),1])
  se.beta <- c(se.beta, 
  summary(mod)$coefficients[grep("BMI5", row.names(summary(mod)$coefficients)),2])}
   
beta.mean <- mean(beta.bmi); beta.mean
beta.var <- mean(se.beta^2) + var(beta.bmi)
beta.mean/sqrt(beta.var)

# significance based on Wald test
2*(1-pnorm(beta.mean, 0, sqrt(beta.var)))

beta.bmi <- c(); se.beta <- c()
for(i in 1:10){
    mod <- lm(TOTFAT_C~ BMI5, data=imp.data[[1]][[i]])
    beta.bmi <- c(beta.bmi, 
        summary(mod)$coefficients[grep("BMI5", row.names(summary(mod)$coefficients)),1])
    se.beta <- c(se.beta, 
        summary(mod)$coefficients[grep("BMI5", row.names(summary(mod)$coefficients)),2])}

beta.mean <- mean(beta.bmi); beta.mean
beta.var <- mean(se.beta^2) + var(beta.bmi)
beta.mean/sqrt(beta.var)

# # significance based on Wald test
2*(1-pnorm(beta.mean, 0, sqrt(beta.var)))

Revision #4
Created 12 December 2022 15:04:20 by Elkip
Updated 12 December 2022 16:18:33 by Elkip