Regression Diagnostics

The estimation and inference from the regression model depends on several assumptions. These assumptions need to be checked using regression diagnostics.

We divide the potential problems into three categories:

Diagnostic Techniques

Model building is often an interactive and interactive process. It is quite common to repeat the diagnostics on a succession of models.

Unobservable Random Errors

Recall a basic multiple linear regression model is given by:
    E[Y|X] = Xβ   and   Var(Y|X) = 𝜎2I

The vectors of errors is 𝜀 = Y - E(Y|X) = Y - Xβ; where  𝜀  is unobservable random variables with:
    E(𝜀 | X) = 0
    Var(𝜀 | X) =  𝜎2I

We estimate beta with
image-1665094586198.png
and the fitted values Y_hat corresponding to the observed value Y are:image-1665094623407.png
Where H is the hat matrix. Defined as:
image-1665094651063.png

The Residuals

The vector of residuals e_hat, which can be graphed, is defined as:image-1665094692646.png
    E(e_hat | X) = 0
    Var(e_hat | X) = 𝜎2(I - H)
    Var(e_hat_i | X) = 𝜎2(I - hii); where hii is the ith diagonal element of H.

Diagnostic procedures are based on the residuals which we would like to assume behave as the unobservable errors would.

Cases with large values of hii will have small values for Var(e_hat_i | X)

The Hat Matrix

The hat matrix H is n x n symmetric matrix

image-1665095127938.png

hii is also called the leverage of the ith case. As hii approaches 1, y_hat_i gets close to y_i.

Error Assumptions

We wish to check the independence, constant variance, and normality of the errors  𝜀. The errors are not observable, but we can examine the residuals e_hat.

They are NOT interchangeable with the error.image-1665095273422.png

The errors may have equal variance and be uncorrelated while the residuals do not. The impact of this is usually small and diagnostics are often applied to the residuals in order to check the assumptions on the error.

Constant Variance

Check whether the variance in the residuals is related to some other quantity Y_hat and Xi

Normality Assumptions

The tests and confidence intervals we used are based on the assumption of normal errors. The residuals can be assessed for normality using a QQ plot. This compares residuals to "ideal" normal observations.

Suppose we have a sample of n: x1, x2... xn. and wish to examine whether the x's are a sample from normal  distribution:

Many statistics have been proposed for testing a sample for normality. One of these that works well is the Shapiro and Wilk W statistic, which is the square of the correlation between the observed order statistics and the expected order statistics.

Testing for Curvature

One helpful test looks for curvature in the plot. Suppose we have residual e_hat vs a quantity U where U could be a regressor or a combination of regressors.

A simple test for curvature is:

Unusual Observations

Some observations do not fit the model well, called outliers. Some observations change the fit of the model in a substantive manner, called influential observations. If an observation has the potential to influence the fit, it is called a leverage point.

hii is called leverage and is useful diagnostics.

Var(e_hat_i | X) = 𝜎2(1 - hii)
    A large leverage will make Var(e_hat_i | X) small
    The fit will be forced close to yi

image-1665097665011.png = number of parameters
    An average value for hii is p'/n
    A "rule of thumb": leverage > 2p'/n should be looked at more closely

hij = xi'(X'X)-1xj
    Leverage only depends on X, not Y

Suppose that the i-th case is suspected to be an outlier:

Alternative Method

Bonferroni Correction

Even though we might explicitly test only one or two large ti by identifying them as large, we are implicity testing all cases. So, multiple testing correction such as Bonferroni correction should be implemented. Suppose we want a level alpha test:

image-1665099162424.png

So it suggests that if an overall level alpha test is required, then a level should be alpha/n in each of the tests.

Notes on Outliers:

When handling outliers:

Influential Observations

An influential point is one whose removal from the dataset causes a large change in fit. An influentual point may or may not be an outier or a leverage point.

Two meausures for identifying the infuential observations:

These rules are guidelines only, not a hard rule.

Code

## Test for normallity
gs <- lm(sqrt(Species) ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)
g <- lm(Species ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)

par(mfrow=c(2,2))
plot(fitted(g), residuals(g), xlab="fitted", ylab="Residuals")

qqnorm(residuals(g), ylab="Residuals")
qqline(residuals(g))

hist(g$residuals)

## Testing for curvature
library(alr4)
m2 <- lm(fertility ~ log(ppgdp) + pctUrban, UN11)
residualPlots(m2)
summary(m2)$coeff

## Testing for Outliers
g <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
n <- nrow(savings)
pprime <- 5   # number of parameters
jack <- rstudent(g)   # studentized residual
jack[which.max(abs(jack))]  # maximum studentized residual

# threshold for lower tail
qt(0.05/(50*2), df = n-pprime-1 , lower.tail=TRUE) 

#### influential points
cook <- cooks.distance(g)
n <- nrow(savings)
pprime <- 5

check <- cook[cook > 4/n]  # rule of thumb
sort(check, decreasing=TRUE) [1:5]  # list first five max

cook[cook>0.5]    # check Di>0.5
cook[(pf(cook, pprime, n-pprime)>0.5)] # use F-dist

influenceIndexPlot(g)

Revision #6
Created 6 October 2022 22:01:16 by Elkip
Updated 6 October 2022 23:53:34 by Elkip