Regression Diagnostics

The estimation and inference from the regression model depends on several assumptions. These assumptions need to be checked using regression diagnostics.

We divide the potential problems into three categories:

Error: 𝜀 ~ N(0, 𝜎²I); i.e. the errors are:
- Independent
- Have equal variance
- Are normally distributed
Model: The structure part of model E[y] = Xβ is correct
Unusual observations: Sometimes just a few observations do not fit the model but might change the choice and fit of the model

Diagnostic Techniques

Graphical
- More flexible but harder to definitively interpret
Numerical
- Narrower in scope but require no intution

Model building is often an ~~interativve~~interactive and interactive process. It is quite common to repeat the diagnostics on a succession of models.

Unobservable Random Errors

Recall a basic multiple linear regression model is given by:
E[Y|X] = Xβ and Var(Y|X) = 𝜎²I

The vectors of errors is 𝜀 = Y - E(Y|X) = Y - Xβ; where 𝜀 is unobservable random variables with:
E(𝜀 | X) = 0
Var(𝜀 | X) = 𝜎²I

We estimate beta with

and the fitted values Y_hat corresponding to the observed value Y are:
Where H is the hat matrix. Defined as:

The Residuals

The vector of residuals e_hat, which can be graphed, is defined as:
E(e_hat | X) = 0
Var(e_hat | X) = 𝜎²(I - H)
Var(e_hat_i | X) = 𝜎²(I - h_ii); where h_ii is the i^th diagonal element of H.

Diagnostic procedures are based on the residuals which we would like to assume behave as the unobservable errors would.

Cases with large values of h_iiwill have small values for Var(e_hat_i | X)

The Hat Matrix

The hat matrix H is n x n symmetric matrix

HX = X

(I - H)X = 0

HH = H² = H

Cov(Y_hat, e_hat | X) = Cov(HY, (I - H)Y| X) = 𝜎²H(I - H) = 0

h_iiis also called the leverage of the ith case. As h_ii approaches 1, y_hat_i gets close to y_i.

Error Assumptions

We wish to check the independence, constant variance, and normality of the errors 𝜀. The errors are not observable, but we can examine the residuals e_hat.

They are NOT interchangeable with the error.

The errors may have equal variance and be uncorrelated while the residuals do not. The impact of this is usually small and diagnostics are often applied to the residuals in order to check the assumptions on the error.