# Multivariable Analysis

# Simple Linear Regression

One of the first known uses of regression was to study inheritance of traits from generation to generation in a study in the UK from 1893-1898. E.S. Pearson organized the collection of heights of mothers and one of their adult daughters over 18. The mother's height was the predictor variable (x) and the daughter's height was the response variable (y).

The goal of a linear regression is to quantify the relationship between one independent variable and a single dependent variable. A simple linear regression can be represented by the following:

- y\_hat = β<sub>0</sub> + β<sub>1</sub>x + Error; E(Error) = 0; V(Error) = σ<sup>2</sup>
- E(y) = β<sub>0</sub> + β<sub>1</sub>x ; V(y) = σ<sup>2</sup>

As with correlation, a strong association in a regression analysis does **NOT** imply causality

Additionally, prediction values outside the range of values observed for x is not reliable, and is called **extrapolation.**

#### Least Squares Estimation  


OLS/LS - Ordinary Least Squares is a method that looks for the line that minimizes the "residual sum of squares"

 Residual = Observed - Predicted = y - ŷ

So we can set up an equation for sum of squared residuals: ∑(y - ŷ)<sup>2</sup>

Then substitute the linear regression equation, take the derivative, set to zero and solve.The solution comes out to:

[![image-1662677372848.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662677372848.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662677372848.png)

The fitted equation will pass through x bar, y bar (the center of data)

[![image-1662678589985.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662678589985.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662678589985.png)

#### Estimating Variances of LSE

The square root of an estimated variance is called **standard error** represented by s or se().

[![image-1662678839105.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662678839105.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662678839105.png)

E(Y |X = x) = a function that depends on the value of x

[![image-1662678847860.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662678847860.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662678847860.png)

#### LSE Assuming Normally Distributed Data

The distributions of estimates are used to make predictions and hypothesis testing. Since variance of a fixed variable is 0, we can estimate the error of a normal distribution as follows:

[![image-1662680044931.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662680044931.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662680044931.png)

[![image-1662680250142.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662680250142.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662680250142.png)

The estimates are correlated with covariance -x\_bar/S<sub>xx</sub>

[![image-1662680297717.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662680297717.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662680297717.png)

It can be shown that the Least Squares Estimate is also the Maximum Likelihood Estimate.

#### Confidence Intervals

The distribution of β<sub>1</sub> is normal if the variance is known.

When we use estimated variance, we use a Student's t distribution on n-2 degrees of freedom to estimate parameters:

[![image-1662680596587.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662680596587.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662680596587.png)

By finding the standard error of y\_hat we can also calculate a confidence interval for a fitted value, or a given of x

[![image-1662913412443.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662913412443.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662913412443.png)

The interval width is:

[![image-1662680664338.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662680664338.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662680664338.png)

And width will increase as the distance between observed and expected increases.

#### ANOVA

Though ANOVA is usually used in determining if there is a difference in variance with data containing 3 or more categories, linear regression under standard conditions is a special case of ANOVA. The name **analysis of variance** is derived from a partitioning of total variability into its component parts.

[![image-1662915269966.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662915269966.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662915269966.png)

This formula uses an F-distribution with 1 and n − 2 df in place of the t-distribution to correct for the simultaneous inference from our estimate of Beta\_1.

### Prediction of New Observations

When using points not in the dataset (extrapolation) use the following adjusted formulas for variance of V(y):

[![image-1663276470416.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663276470416.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663276470416.png)

Thus, the confidence interval for a new observation would be significantly wider

##### Other notations commonly seen in our textbook

[![image-1662677002671.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1662677002671.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1662677002671.png)

# Mutiple Linear Regression and Estimation

Multiple Linear Regression analysis can be looked upon as an extension of simple linear regression analysis to the situation in which more than one independent variables must be considered.

The general model with response Y and regressors X<sub>1</sub>, X<sub>2</sub>,... X<sub>p</sub>:

[![image-1663279551985.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663279551985.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663279551985.png)

Suppose we observe data for *n* subjects with *p* variables. The data might be presented in a matrix or table like:

[![image-1663279646413.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663279646413.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663279646413.png)

We could then write the model as:

[![image-1663279874599.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663279874599.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663279874599.png)

We can think of Y and E as (n x 1) vectors when transposed, and β as a ((p + 1) x 1) vector. p +1 is number of predictors + the intercept. Thus X would be:

[![image-1663279974357.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663279974357.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663279974357.png)

The general linear regression may be written as:

[![image-1663280050486.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663280050486.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663280050486.png)

Or **y<sub>i</sub> = x<sub>i</sub><sup>|</sup>\*β + ∈<sub>i</sub>**

The model is represented as the systematic structure plus the random variation with n dimensions = (p + 1) + { n - (p + 1 ) }

#### Ordinal Least Squares Estimators

The least squares estimate β\_hat of β is chosen by minimizing the residual sum of squares function:

[![image-1663281298522.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663281298522.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663281298522.png)

By differentiating with respect to β<sub>i</sub> and solve by setting equal to 0:

[![image-1663281366843.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663281366843.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663281366843.png)

The least squares estimate of β\_hat of β is given by:

[![image-1663281452357.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663281452357.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663281452357.png)

and if the inverse exists:

[![image-1663281474240.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663281474240.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663281474240.png)

#### Fitted Values and Residuals  


The fitted values are represetned by Y\_hat = X\*β\_hat

[![image-1663282777062.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663282777062.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663282777062.png)

where the **hat matrix**  is defined as H = X(X<sup>|</sup>X)<sup>-1</sup>X<sup>|</sup>

The residual sum of squares (RSS):

[![image-1663282850648.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663282850648.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663282850648.png)

#### Gauss-Markov Conditions

In order for estimates of β to have some desirable statistical properties, we need a set of assumptions referred to as the Gauss-Markov conditions; for all i, j = 1... n

1. E\[∈<sub>i</sub>\] = 0
2. E\[∈<sub>i</sub><sup>2</sup>\] = 𝜎<sup>2</sup>
3. E\[∈<sub>i</sub>∈<sub>j</sub>\] = 0, where i != j

Or we can write these in matrix notation as: E\[∈\] = 0, E\[∈∈<sup>|</sup>\] = 𝜎<sup>2</sup>\*I

The GM conditions imply that E\[Y\] = Xβ and cov(Y) = E\[(Y-Xβ)(Y-Xβ)<sup>|</sup>\] = E\[∈∈<sup>|</sup>\] = ∈

Under the GM assumptions, the LSE are the Best Linear Unbiased Estimators (BLUE). In this expression, "best" means <span style="text-decoration: underline;">minimum variance</span> and linear indicates that the estimators are <span style="text-decoration: underline;">linear functions of y.</span>

The LSE is a good choice but it does require the errors are uncorrelated and have equal variance. Even if the errors behave, then nonlinear or biased estimates may work better.

#### Estimating Variance

By definition:

[![image-1663284680441.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663284680441.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663284680441.png)

We can estimate variance by an average from the sample:

[![image-1663284717348.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663284717348.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663284717348.png)

Under GM conditions, s<sup>2</sup> is an <span style="text-decoration: underline;">**unbiased estimate** </span>of variance.

#### Total Sum of Squares

Total sum of squares is Syy = SSreg + RSS

The corrected total sum of squares with n-1 degrees of freedom:

[![image-1663284931783.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663284931783.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663284931783.png)

where J is an n x n matrix of 1s and H = X(X<sup>|</sup>X)<sup>-1</sup>X<sup>|</sup>

#### Regression and Residual Sum of Sqaures

**Regression** sum of squares represent the number of X variables:

[![image-1663285018389.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663285018389.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663285018389.png)

**Residual** Sum of Squares with n - (p + 1) degrees of freedom:

[![image-1663285075431.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663285075431.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663285075431.png)

#### F Test for Regression Relation

To test whether there is a regression relation between Y and a set of variables X, use the hypothesis test:

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑝 = 0 v.s. 𝐻1 : not all 𝛽𝑘 = 0, 𝑘 = 1, … , 𝑝

We use the test statistic:

[![image-1663285544018.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663285544018.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663285544018.png)

The chances of a Type I error is alpha, our degrees of freedom is n - p -1

#### The Coefficient of Determination

Recall this measures the model of fit by the proportionate reduction of total variation in Y associated with the use of the set of X variables.

R<sup>2</sup> = SSreg / Syy = 1 - RSS / Syy

In a multi-linear regression we must adjust the coefficient by the associated degrees of freedom:

[![image-1663285802285.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663285802285.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663285802285.png)

Add more independent variables to the model can only increase R<sup>2</sup>, but R<sup>2</sup><sub>alpha</sub> may become smaller when more independent variables are introduced, as the decrease in RSS may be offset by the loss of degrees of freedom.

#### T Tests and Intervals  


Tests for β<sub>k</sub> are pretty standard:

[![image-1663286153070.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663286153070.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663286153070.png)

With the rejection rule of 𝑡 &gt;= t(1 − 𝛼/2; 𝑛 − 𝑝 − 1)

And likewise confidence limits for β<sub>k</sub> and 1 - alpha confidence

[![image-1663286260560.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663286260560.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663286260560.png)

# Model Fitting: Inference

Given several predictors and a response, we need to figure out whether all are needed.

Consider a large model, Ω, and a smaller model, 𝜔, which consist of a subset of predictors in Ω.

- If there is not much difference in fit, we perfer the smaller model.
- If the larger model has a much better fit, we perfer the larger model

Suppose we have a response Y and a vector p<sup>I</sup> regressors X<sup>I</sup> = (X<sub>1</sub>, X<sub>2</sub>) that we partition into two parts so that:

- X<sub>2</sub> has q regressors
- X<sub>1</sub> has the remaining p<sup>I</sup> - q

The general hypothesis test we consider is:

[![image-1663885960711.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663885960711.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663885960711.png)

The null hypothesis is obtained by setting beta\_2 = 0

The reasoning is that if RSS<sub>𝜔</sub> - RSS<sub>Ω</sub> is small, the fit of the smaller model is almost as good as the larger model. On the other hand, if the difference is large the superior fit of the larger model would be perferred.

This suggests (RSS<sub>𝜔</sub> - RSS<sub>Ω</sub>)/RSS<sub>Ω </sub>would be a potentially good test statistic where the denominator is used for scaling purpose

### F Tests

Suppose the dimensions of Ω is p and that of 𝜔 is q. The general formula for the test is:

[![image-1663886557966.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663886557966.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663886557966.png)

where df<sub>Ω</sub> = n - p, and df<sub>𝜔</sub> = n - q

Thus, we would reject the null hypothesis if F &gt; F<sup>α</sup><sub> p - q, n - p</sub>

#### Simple Regression  


[![image-1663887396158.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887396158.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887396158.png)

Recall the ANOVA table for a simple regression:

[![image-1663887176176.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887176176.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887176176.png)

Under the null hypothesis:

[![image-1663887330435.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887330435.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887330435.png)

The df is the number of n observations minus the number of estimated parameters: (n-1)

Under the alternative hypothesis:

[![image-1663887289267.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887289267.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887289267.png)

### Test of All the Predictors

[![image-1663887421754.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887421754.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887421754.png)

[![image-1663887440224.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887440224.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887440224.png)

Where p is the number of regressors and n is the sample size.

[![image-1663887487824.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887487824.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887487824.png)

#### One Predictor

Can a particular predictor be dropped from the model?

[![image-1663887532157.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887532157.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887532157.png)

A t test can be used with (n - p - 1) degrees of freedom

[![image-1663887559097.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1663887559097.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1663887559097.png)

The F test may be used as introduced earlier with a df of 1, n-p-1. t<sub>i</sub><sup>2</sup> here is exactly the F-statistic.

# Dummy Variables and Analysis of Covariance

So far we have mostly seen quantitative variables in regression models, but many variables of interest are qualitative (sex, status, etc). To add such information to a model, we can set up a **[indicator/dummy variable](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_multivariable/bs704_multivariable7.html "BU Link").**

For example, we could set up a variable X<sub>i2 </sub>representing sex as 0 for type A and 1 for other for the i<sup>th</sup> individual. The resulting model would look something like:

[![image-1664492524062.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492524062.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492524062.png)  
Where X<sub>i2</sub> is 0 when the individual is type A. So for type A:

[![image-1664492608189.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492608189.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492608189.png)  
But for any other type:

[![image-1664492625910.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492625910.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492625910.png)

#### Models with Interaction  


If we have a first-order regression model with an interaction we can represent it with an interaction term:

[![image-1664490098266.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664490098266.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664490098266.png)

We can illustrate interaction as follows:

[![image-1664490347740.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664490347740.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664490347740.png)  
Beta2 indicates how much greater (smaller) the Y intercept for the class coded 1 than that of the class coded 0  
Beta3 indicates how much greater (smaller) the slope for the class coded 1 then that of the class coded 0

#### Raw Mean

A raw mean is simply an average of the observations without considering other covariates. **Least square means** (sometimes called adjusted mean) are adjusted for other covariates, since it is estimated from a linear regression.

#### Qualitative Variable with 2+ Classes  


If there are more than 2 classes to a qualitative variable, we require additional indicator variables in the regression model.

[![image-1664492179261.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492179261.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492179261.png)

Resulting in a model like:

[![image-1664492245310.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492245310.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492245310.png)

And the X matrix would look like:

[![image-1664492264229.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492264229.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492264229.png)

[![image-1664492322594.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664492322594.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664492322594.png)

# Broken Stick Regression, Polynomial Regression, Splines

Transformations of the response and predictors can improve the fit and correct violations of model assumptions such as non-constant error variance. This chapter focuses on the transformation of predictors.

The idea behind **Broken Stick Regression/Segmented Regression**  is that different linear regression models may apply in different regions of the data. We define two basis functions:

[![image-1664493768291.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664493768291.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664493768291.png)  
where c marks the division between groups. Bl(x) and Br(x) form a first-order spline basis with knot point c. Sometimes Bl(x) and Br(x) are called hockey-stick functions.

[![image-1664493964181.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664493964181.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664493964181.png)

#### Polynomial Regression

Polynomial Regression models are frequently used curvilinear response models in practice. They make it easy to handle special cases of the linear regression model, however in reality this is not an exact representation.

[![image-1664494051482.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/scaled-1680-/image-1664494051482.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-09/image-1664494051482.png)

We choose d by adding terms until the added term is not statistically significant. Start with a large d and eliminate non-statistically significant terms starting with the highest order term.

#### Regression Splines

- Polynomials regression has: 
    - the advantage of smoothness
    - the disadvantage that each data point affects the fit globally
- Broken stick regression method 
    - Localized the influence of each data point to its particular segment
    - Do not have the same smoothness as with polynomials
- Use B-spline basis functions 
    - Have both smoothness and local influence
    - Can overfit the model

# Regression Diagnostics

The estimation and inference from the regression model depends on several assumptions. These assumptions need to be checked using regression diagnostics.

We divide the potential problems into three categories:

- Error: 𝜀 ~ N(0, 𝜎<sup>2</sup>I); i.e. the errors are: 
    - Independent
    - Have equal variance
    - Are normally distributed
- Model: The structure part of model E\[y\] = Xβ is correct
- Unusual observations: Sometimes just a few observations do not fit the model but might change the choice and fit of the model

##### Diagnostic Techniques

- Graphical 
    - More flexible but harder to definitively interpret
- Numerical 
    - Narrower in scope but require no intution

Model building is often an interactive and interactive process. It is quite common to repeat the diagnostics on a succession of models.

#### Unobservable Random Errors  


Recall a basic multiple linear regression model is given by:  
 E\[Y|X\] = Xβ and Var(Y|X) = 𝜎<sup>2</sup>I

The vectors of errors is 𝜀 = Y - E(Y|X) = Y - Xβ; where 𝜀 is unobservable random variables with:  
 E(𝜀 | X) = 0  
 Var(𝜀 | X) = 𝜎<sup>2</sup>I

We estimate beta with  
[![image-1665094586198.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665094586198.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665094586198.png)  
and the fitted values Y\_hat corresponding to the observed value Y are:[![image-1665094623407.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665094623407.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665094623407.png)  
Where H is the **hat matrix**. Defined as:  
[![image-1665094651063.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665094651063.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665094651063.png)

### The Residuals

The vector of residuals e\_hat, which can be graphed, is defined as:[![image-1665094692646.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665094692646.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665094692646.png)  
 E(e\_hat | X) = 0  
 Var(e\_hat | X) = 𝜎<sup>2</sup>(I - H)  
 Var(e\_hat\_i | X) = 𝜎<sup>2</sup>(I - h<sub>ii</sub>); where h<sub>ii</sub> is the i<sup>th</sup> diagonal element of H.

Diagnostic procedures are based on the residuals which we would like to assume behave as the unobservable errors would.

Cases with large values of h<sub>ii </sub>will have small values for Var(e\_hat\_i | X)

#### The Hat Matrix

The hat matrix H is n x n symmetric matrix

- HX = X
- (I - H)X = 0
- HH = H<sup>2</sup> = H
- Cov(Y\_hat, e\_hat | X) = Cov(HY, (I - H)Y| X) = 𝜎<sup>2</sup>H(I - H) = 0

[![image-1665095127938.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665095127938.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665095127938.png)

h<sub>ii </sub>is also called the leverage of the i<sup>th</sup> case. As h<sub>ii</sub> approaches 1, y\_hat\_i gets close to y\_i.

#### Error Assumptions

We wish to check the independence, constant variance, and normality of the errors 𝜀. The errors are not observable, but we can examine the residuals e\_hat.

They are NOT interchangeable with the error.[![image-1665095273422.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665095273422.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665095273422.png)

The errors may have equal variance and be uncorrelated while the residuals do not. The impact of this is usually small and diagnostics are often applied to the residuals in order to check the assumptions on the error.

### Constant Variance

Check whether the variance in the residuals is related to some other quantity Y\_hat and Xi

- e\_hat against Y\_hat: If all is well there should be constant variance in the vertical direction (e\_hat) and the scatter should be symmetric vertically about 0 (linearity)
- e\_hat against Xi (for predictors that are both in and out of the model): Look for the same things except in the case of plots against predictors not in the model, look for any relationship that might indicate that this predictor should be included.
- Although the interpretation of the plot may be ambiguous, one can be at least sure that nothing is seriously wrong with the assumptions.

### Normality Assumptions

The tests and confidence intervals we used are based on the assumption of normal errors. The residuals can be assessed for normality using a QQ plot. This compares residuals to "ideal" normal observations.

Suppose we have a sample of n: x1, x2... xn. and wish to examine whether the x's are a sample from normal distribution:

- Order the x's to get x(1) &lt;= x(2).... &lt;= x(n)
- Consider a standard normal sample of size n. Let z(1) &lt;= z(2).... &lt;= z(n)
- If x's are normal then E\[x(i)\] = mean + sd\*z(i); so the regression of x(i) on z(i) will be a straight line

Many statistics have been proposed for testing a sample for normality. One of these that works well is the Shapiro and Wilk W statistic, which is the square of the correlation between the observed order statistics and the expected order statistics.

- **H0 is that the residuals are normal**
- Only recommend this in conjunction with a QQ plot
- For a small sample size, formal tests lack power
- For a large dataset, even mild deviations from non-normality may be detected. But there would be little reason to abandon least squares because the effects of non-normality are mitigated by large sample sizes.

### Testing for Curvature

One helpful test looks for curvature in the plot. Suppose we have residual e\_hat vs a quantity U where U could be a regressor or a combination of regressors.

A simple test for curvature is:

- To refit the model with an additional regressor for U^2 added
- The test is based on test testing the coefficient for U^2 to be 0
- If U does not depend on estimated coefficients, then a usual t-test of this hypothesis can be used
- If U is equal to the fitted values (which depends on the estimated coefficients) then the test statistic is approximately the standard normal distribution

### Unusual Observations

Some observations do not fit the model well, called **outliers**. Some observations change the fit of the model in a substantive manner, called **influential** **observations**. If an observation has the potential to influence the fit, it is called a **leverage point**.

h<sub>ii </sub>is called **leverage** and is useful diagnostics.

Var(e\_hat\_i | X) = 𝜎<sup>2</sup>(1 - h<sub>ii</sub>)  
 A large leverage will make Var(e\_hat\_i | X) small  
 The fit will be forced close to y<sub>i</sub>

![image-1665097665011.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665097665011.png) = number of parameters  
 An average value for h<sub>ii </sub>is p'/n  
 A "rule of thumb": leverage &gt; 2p'/n should be looked at more closely

h<sub>ij</sub> = x<sub>i</sub>'(X'X)<sup>-1</sup>x<sub>j</sub>  
 Leverage only depends on X, not Y

Suppose that the i-th case is suspected to be an outlier:

- We exclude point i-th case is suspected to be an outlier
- Recompute the estimates to get estimated coefficents and variance
- If y\_hat\_i - y\_i is large, then case i is an outlier. To judge the size of potential outlier, we need to an appropriate scale:  
    [![image-1665098740745.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665098740745.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665098740745.png)  
    The variance is estimated by replaced variance with estimated variance. at i
- Assuming normal errors, the hypothesis E\[y\_hat\_i - y\_i\] = 0 is given by  
    [![image-1665098840677.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665098840677.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665098840677.png)

##### Alternative Method

- Define standardized residual (internal studentized residual) as   
    [![image-1665098970225.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665098970225.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665098970225.png)
- Then studentized residual (external studentized, jackknife, or cross-validated residual, Rstudent) can be calculated as:  
    [![image-1665099058229.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665099058229.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665099058229.png)

#### Bonferroni Correction

Even though we might explicitly test only one or two large t<sub>i </sub>by identifying them as large, we are implicity testing all cases. So, multiple testing correction such as Bonferroni correction should be implemented. Suppose we want a level alpha test:

[![image-1665099162424.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665099162424.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665099162424.png)

So it suggests that if an overall level alpha test is required, then a level should be alpha/n in each of the tests.

#### Notes on Outliers:

- An outlier in one model may not be an outlier in another when the variables have been changed or transformed
- The error distribution may not be normal and so larger residuals may be expected
- Individual outliers are usually much less of a problem in larger datasets. A single point will not have the leverage to affect the fit very much. It is still worth identifying outliers if these types of observations are worth knowing in the context.
- For large datasets, we only need to worry about clusters of outliers. Such clusters are less likely to occur by chance and more likely too represent actual structure.

When handling outliers:

- Check for a data-entry error first
- Examine the physical context, ex. the outliers in the analysis of credit card transactions may indicate fraud
- Exclude the point from the analysis but try re-including it later if the model is changed.
- Always report the existence of outliers even if they are not included in the final model!

#### Influential Observations

An influential point is one whose removal from the dataset causes a large change in fit. An influentual point may or may not be an outier or a leverage point.

Two meausures for identifying the infuential observations:

- Difference in Fits (DFFITS)  
    [![image-1665099952078.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665099952078.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665099952078.png)
    
    An observation is influential if the absolute value is greater than:  
    [![image-1665099983707.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665099983707.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665099983707.png), where p is the number of parameters (predictors + 1)
- Cook's Distance  
    [![image-1665100022233.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665100022233.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665100022233.png)  
    where p is the number of parameters (predictors + 1)
    
    
    - Di summarized how much alll of the fitted values change with the i-th observation is deleted
    - A "rule of thumb" is Cook Distance &gt; 4/n should be looked at more closely
    - Di = 1 will potentially have important change in estimate 
        - Di &gt; .5 may be influential
        - Di &gt;= 1 quite likely to be influential
        - If Di sticks out from others it is almost certainly influential
- Potential Outlier's percentile value using the F-distribution ~ F(p', n-p') 
    - If &lt; 10 or 20 percentile, little apparent influence
    - If &gt; 50 percentile, highly influential
    - If in between, ambiguous

These rules are guidelines only, not a hard rule.

### Code

```
## Test for normallity
gs <- lm(sqrt(Species) ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)
g <- lm(Species ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)

par(mfrow=c(2,2))
plot(fitted(g), residuals(g), xlab="fitted", ylab="Residuals")

qqnorm(residuals(g), ylab="Residuals")
qqline(residuals(g))

hist(g$residuals)

## Testing for curvature
library(alr4)
m2 <- lm(fertility ~ log(ppgdp) + pctUrban, UN11)
residualPlots(m2)
summary(m2)$coeff

## Testing for Outliers
g <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
n <- nrow(savings)
pprime <- 5   # number of parameters
jack <- rstudent(g)   # studentized residual
jack[which.max(abs(jack))]  # maximum studentized residual

# threshold for lower tail
qt(0.05/(50*2), df = n-pprime-1 , lower.tail=TRUE) 

#### influential points
cook <- cooks.distance(g)
n <- nrow(savings)
pprime <- 5

check <- cook[cook > 4/n]  # rule of thumb
sort(check, decreasing=TRUE) [1:5]  # list first five max

cook[cook>0.5]    # check Di>0.5
cook[(pf(cook, pprime, n-pprime)>0.5)] # use F-dist

influenceIndexPlot(g)
```

# Variable Selection

Variable selection is intended to select the "best subset" of predictors. Variable selection shouldn't be separated from the rest of the model; Outliers and influential points can change the model we select. Transformations of the variables can have an impact on the model selection. **Some iteration and experimentation is often necessary to find some better model.**

**The aim of variable selection is to construct a model that predicts or explains the relationships in the data. Automatic variable selections are not guaranteed to be consistent with these goals, so use these methods as a guide only.**

The "best" subset of predictors:

- Explains the data in the simplest way
- Doesn't waste degrees of freedom with unnecessary predictors, which add noise
- Can save time or money by not measuring redundant predictors
- Doesn't include collinearity caused by too many variables trying to do the same job

There are two types of variable selections we will cover today:

- Stepwise testing approach - compares successive models
- The criterion approach - finds the model that optimizes some measure of goodness of fit

#### Model Hierarchy

Some models have a natural hierarchy, ex polynomial regression models (x<sup>2</sup> is a higher order term than x). When selecting variables it is important to **respect the hierarchy.** Lower order terms **should not** be removed from the model before higher order terms in the same variable.

Consider the model:  
[![image-1665699110928.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665699110928.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665699110928.png)  
Suppose the summary shows that the term in x is not significant but x<sup>2</sup> is. If we removed x our model would become:  
[![image-1665699207039.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665699207039.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665699207039.png)  
Then let's change the scale and change x to (x + a). Then the model would become:[![image-1665699275489.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665699275489.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665699275489.png)  
The first order x reappears! The interpretation should not depend on the scale.

Scale changes should not make any important changes to the model.

### Testing-Based Procedures

#### Backward Elimination

1. Start with all the predictors in the model
2. Remove the predictor with highest p-value greater than alpha
3. Refit the model
4. Remove the remaining least significant predictor provided its p-value is greater than alpha
5. Repeat 3 and 4 until all "non-significant" predictors are removed

Alpha is sometimes called the "p-to-remove" and does not have to be 5%. For prediction purpose, 15-20% cutoff may be best.

#### Forward Selection

Reverses the backwards method

1. Start with no variables
2. For predictors not in the model, check the p-value if they are added to the model. We choose the one with lowest p-value less than alpha
3. Continue until no new predictors can be added

#### Stepwise Regression

Combination of backwards elimination and forward selection.

- At each stage, a variable may be added or removed and there are several variations on how this is done.
- The stepwise regression can be done top-down (alternate drop step with add step) or bottom-up (alternate add step with drop step)

#### Notes on Testing-Based Procedures

- Possible to miss the "optimal" model due to "one-at-a-time" nature of adding/dropping variables
- The p-values used should not be treated too literally as there is so much multiple testing occurring
- The procedures are not directly linked to final objectives of prediction or explanation
- Variables that are dropped can still be correlated with the response. It is just that they provide no additional explanatory effect beyond those variables already included in the model
- Stepwise selection tends to pick models smaller than desirable for prediction purpose

### Criterion-Based Procedures

Criteria for model selection are based on lack of fit of a model and its complexity.

Some possible criteria:

- Akaike Information Criterion (AIC): 
    - -2 max log-likelihood + 2p'
    - n\*log(RSS/n) + 2p'
- Bayes Information Criterion (BIC): 
    - -2 max log-likelihood + p' log(n)
    - n\*log(RSS/n) + log(n) \* p'

Note p' is the number of parameters including the intercept.

**Small values of AIC and BIC are preferred.** So better candidate sets will have smaller RSS and a smaller number of terms p. Larger models fit better and have smaller RSS but use more parameters. The goal is to find a balance between RSS and p.

BIC penalizes larger models more heavily. Smaller models are perferred in BIC as compared to AIC.

#### Adjusted R<sup>2</sup>

R<sup>2</sup> = 1 - RSS/SSY

- Adding variables can only decrease RSS and increase R<sup>2</sup>
- Not a good criterion as it always perfers the largest model
- Important to pay attention for significant changes of RSS!

Another commonly used criterion is adjusted R<sup>2</sup>, written R<sub>a</sub><sup>2</sup>[![image-1665703334997.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665703334997.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665703334997.png)

#### Mallow's C<sub>p</sub> Statistics

A good model should predict well, so the average mean square error of prediction might be a good criterion:[![image-1665703426676.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665703426676.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665703426676.png)  
which can be estimated by the C<sub>p</sub> statistic:  
[![image-1665703466451.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665703466451.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665703466451.png)  
where sigma squared is from the model with ALL predictors and RSS<sub>p</sub> indicates that RSS from a model with p parameters.

- For the full model C<sub>p</sub> = p
- If a p-predictor model fits then:  
    [![image-1665703552755.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1665703552755.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1665703552755.png)
- If a model has a bad fit, C<sub>p</sub> will be much larger than p
- **We desire models with small p and C<sub>p</sub> around or less than p**

## R Code

```
install.packages("faraway")
library(faraway)
data(state)
names(state)
statedata <- data.frame(state.x77, row.names=state.abb)

names(statedata)
write.csv(statedata, file="statedata.csv", quote=FALSE, row.names=F)

############# from here
## the data were collected from U.S. Bureau of the Census. 
## use life expectancy as the response
## the remaining variables as predictors

statedata <- read.csv("statedata.csv")
g <- lm(Life.Exp ~ ., data = statedata)
summary(g)

### backward elimination
g <- lm(Life.Exp ~ ., data = statedata)
summary(g)$coefficients

g <- update(g, .~ . - Area)
summary(g)$coefficients

g <- update(g, .~ . - Illiteracy)
summary(g)$coefficients

g <- update(g, .~ . - Income)
summary(g)$coefficients

g <- update(g, .~ . - Population)
summary(g)$coefficients
summary(g)

## step(lm(Life.Exp ~ ., data = statedata), 
#	scope=list(lower=as.formula(.~Illiteracy)), direction="backward")

### the variables omitted from the model may still be related to the response
summary(lm(Life.Exp ~ Illiteracy+Murder+Frost, statedata))$coeff
 
## forward
f <- ~Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area
m0 <- lm(Life.Exp ~ 1, data = statedata)
m.forward <- step(m0, scope = f, direction ="forward", k=2)

## hand calculate the AIC value
aov(lm(Life.Exp ~ Murder, data=statedata))
# AIC = n*log(RSS/n) + 2p

n <- nrow(statedata)
n*log(34.46/n)+2*2

extractAIC(m.forward, k=2)		# by default k=2, AIC
extractAIC(m.forward,k=log(50))		# k=log(n), BIC

## final model using AIC
summary(m.forward)$coefficients

## use BIC
n <- nrow(statedata)
m.forward.BIC <- step(m0, scope = f, direction ="forward", 
			k=log(n), trace=FALSE)
summary(m.forward.BIC)$coefficients


### backward
m1 <- update(m0, f)
m.backward <- step(m1, scope = c(lower= ~ 1), 
			direction = "backward", trace=FALSE)
summary(m.backward)$coefficients

### stepwise
m.stepup <- step(m0, scope=f, direction="both", trace=FALSE)
summary(m.stepup)$coefficients

########## about Cp
install.packages("leaps")
library(leaps)
leaps <- regsubsets(Life.Exp ~ ., data = statedata)
rs <- summary(leaps)
par(mfrow=c(1,2))
plot(2:8, rs$cp, xlab="No. of parameters",
		ylab="Cp Statistic")
abline(0,1)

plot(2:8, rs$adjr2, xlab="No. of parameters",
		ylab="Adjusted R-Squared")
abline(0,1)
names(rs)

rs
```

# Midterm Cheat Sheet

<table border="1" id="bkmrk-linear-regression-mu" style="border-collapse: collapse; width: 100%;"><tbody><tr><td style="width: 49.9413%;">**Linear Regression**

[![image-1666104229251.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104229251.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104229251.png)

[![image-1666104239088.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104239088.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104239088.png)

[![image-1666104245811.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104245811.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104245811.png)

[![image-1666104252727.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104252727.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104252727.png)

[![image-1666104260540.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104260540.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104260540.png)

[![image-1666104271056.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104271056.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104271056.png)

[![image-1666104276418.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104276418.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104276418.png)

[![image-1666104282605.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104282605.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104282605.png)

[![image-1666104287023.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104287023.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104287023.png)

<span style="text-decoration: none;">Predicting a CI new obs adds a 1 to se(y):</span>

<span style="text-decoration: none;">𝛽<sub><span style="font-weight: normal;">0</span></sub><span style="font-weight: normal;"> </span><span style="font-weight: normal;">+</span><span style="font-weight: normal;"> 𝛽</span><sub><span style="font-weight: normal;">2</span></sub><span style="font-weight: normal;">x +/- t\*</span></span>[![image-1666104297248.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104297248.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104297248.png)

</td><td style="width: 49.9413%;">**Multiple Linear Regression and Estimation**

[![image-1666104323618.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104323618.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104323618.png)

[![image-1666104327905.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104327905.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104327905.png)

[![image-1666104332246.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104332246.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104332246.png)

[![image-1666104336362.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104336362.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104336362.png)

[![image-1666104345652.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104345652.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104345652.png)

[![image-1666104350789.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104350789.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104350789.png)

[![image-1666104355280.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104355280.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104355280.png)

[![image-1666104359234.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104359234.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104359234.png)

[![image-1666104363437.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104363437.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104363437.png)

[![image-1666290673808.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666290673808.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666290673808.png)

[![image-1666290691336.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666290691336.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666290691336.png)

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑝 = 0   
v.s. 𝐻1 : not all 𝛽𝑘 = 0, 𝑘 = 1, … , 𝑝

[![image-1666104389032.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104389032.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104389032.png)[![image-1666104395117.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104395117.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104395117.png)

rejection rule of 𝑡 &gt;= t(1 − alpha/2; 𝑛 − 𝑝 − 1)

[![image-1666104407108.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104407108.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104407108.png)

[![image-1666104410593.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104410593.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104410593.png)

[![image-1666104415116.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104415116.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104415116.png)

[![image-1666104423017.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104423017.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104423017.png)

</td></tr><tr><td style="width: 49.9413%;">**Model Fitting: Inference**

[![image-1666104576887.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104576887.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104576887.png)

df<sub>Ω</sub> = n - p, and df<sub>𝜔</sub> = n – q

[![image-1666138531298.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666138531298.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666138531298.png)

Reject the null hypothesis if F &gt; Fα p - q, n – p

[![image-1666104643517.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104643517.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104643517.png)

[![image-1666138340542.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666138340542.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666138340542.png)

</td><td style="width: 49.9413%;">**Dummy Variables and Analysis of Covariance**  
Consider a Xi2 for which is 0 for – and 1 for +:

[![image-1666104607845.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104607845.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104607845.png)

An interaction between Xi1 and Xi2:

[![image-1666104615922.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104615922.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104615922.png)

A model with multiple categorical variables:

[![image-1666104625980.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104625980.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104625980.png)

[![image-1666104633030.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104633030.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104633030.png)

</td></tr><tr><td style="width: 49.9413%;">**Regression Diagnostics**  
Assumptions:  
 • Error: ~ N(0, SD2I);   
 ◦ Independent  
 ◦ Equal Variance  
 ◦ Normally Distributed  
 • Model: E\[y\] = Xβ is correct  
 • Unusual observations  
   
Leverage Points: data point with unusual x-value

[![image-1666104773344.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104773344.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104773344.png)

 The Hat Matrix – n\*n matrix  
h<sub>ii </sub>is the leverage of the i<sup>th</sup> case  
leverage &gt; 2p’/n should be looked at closely

  
Outliers: Unusual observation on x or y axis

[![image-1666104790374.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104790374.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104790374.png)

Calculate the t-test and compare abs with limit:  
abs(qt(.05/(n\*2), df = n - pprime - 1, lower.tail = T))

</td><td style="width: 49.9413%;">Influential Points: causes changes to regression  
 Difference in Fits:

[![image-1666104815733.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104815733.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104815733.png)

with a threshold of

[![image-1666104825747.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104825747.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104825747.png)

Where p’ is the number of parameters

 Cook's Distance:

[![image-1666104834542.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666104834542.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666104834542.png)

with a threshold of  
Di &gt; 4/n should be looked at  
Di &gt; .5 possible influence  
Di &gt;= 1 very influential

Error: a plot of e\_hat should  
 • have constant variance  
 • have no clear pattern  
 • H0: residuals are normal

<span style="text-decoration: none;">Shapiro-Wilk normality test</span>

<span style="text-decoration: none;">H0: Residuals are normally distributed</span>

<span style="text-decoration: none;">Bonferroni Correction: Divide alpha by n</span>

</td></tr><tr><td style="width: 49.9413%;">**Variable Selection**

<span style="font-weight: normal;">Backwards Elimination:</span>

1. <span style="font-weight: normal;">Start model with all the predictors</span>
2. <span style="font-weight: normal;">Remove the predictor with highest p-value greater than alpha</span>
3. <span style="font-weight: normal;">Refit the model</span>
4. <span style="font-weight: normal;">Remove the remaining least significant predictor provided its p-value is greater than alpha</span>
5. <span style="font-weight: normal;">Repeat 3 and 4 until all "non-significant" predictors are removed</span>

<span style="font-weight: normal;">Cutoff p significance can be 15-20% for testing</span>

Forward Selection:

1. <span style="font-weight: normal;">Start model with no predictors</span>
2. <span style="font-weight: normal;">For predictors not in the model, check the p-value if they are added to the model. We choose the one with lowest p-value less than alpha</span>
3. <span style="font-weight: normal;">Continue until no new predictors can be added</span>

<span style="font-weight: normal;">Stepwise regression: A combination of the tw</span>o

</td><td style="width: 49.9413%;">Selection Criteria:  
Akaike Information Criterion (AIC):   
 • -2 max log-likelihood + 2p'   
 • n\*log(RSS/n) + 2p'  
Bayes Information Criterion (BIC):   
 • -2 max log-likelihood + p' log(n)   
 • n\*log(RSS/n) + log(n) \* p'   
Adjusted R2:  
R<sup>2</sup> = 1 – RSS/SSY

[![image-1666202623504.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666202623504.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666202623504.png)

Mallow’s C<sub>p</sub> Statistic: Avg MSE of prediction

[![image-1666202639730.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666202639730.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666202639730.png)If a p-predictor fits then:

[![image-1666202661358.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666202661358.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666202661358.png)

We desire models with small p and Cp around or less than p

</td></tr></tbody></table>

**R Code Snippets**

<table border="1" id="bkmrk-%23-model-with-only-be" style="border-collapse: collapse; width: 100%;"><tbody><tr><td style="width: 49.9383%;"><span style="font-family: Courier, monospace;"><span style="font-size: small;">\# Model with only beta\_0</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span style="font-weight: normal;">sr\_lm0 &lt;- lm(</span><span style="font-weight: normal;">y</span><span style="font-weight: normal;"> ~ 1, data=s</span><span style="font-weight: normal;">r</span><span style="font-weight: normal;">)</span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">\# Full model </span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span style="font-weight: normal;">sr\_lm1 &lt;- lm(</span><span style="font-weight: normal;">y </span><span style="font-weight: normal;">~ ., data=s</span><span style="font-weight: normal;">r</span><span style="font-weight: normal;">)</span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">sr\_syy &lt;- sum((savings$sr - mean(savings$sr))^2)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">sr\_rss &lt;- deviance(sr\_lm1)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">\# F = ((SYY -RSS)/((n-1) - (n-2))) / (RSS / (n - 1))</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">sr\_num &lt;- (sr\_syy - sr\_rss)/(df.residual(sr\_lm0) - df.residual(sr\_lm1))</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">sr\_den &lt;- sr\_rss / df.residual(sr\_lm1)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">sr\_f &lt;- sr\_num / sr\_den</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">\# dfΩ = n - p, and df𝜔 = n - q</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">pf(sr\_f, df.residual(sr\_lm0) - df.residual(sr\_lm1), df.residual(sr\_lm1), lower.tail = F)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span style="font-weight: normal;">\# β=(X</span><sup><span style="font-weight: normal;">I</span></sup><span style="font-weight: normal;"> </span><span style="font-weight: normal;">X)</span><sup><span style="font-weight: normal;">−1</span></sup><span style="font-weight: normal;"> X</span><sup><span style="font-weight: normal;">I</span></sup><span style="font-weight: normal;">Y</span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">beta &lt;- solve(t(x)%\*%x)%\*%(t(x)%\*%y)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">\# Pearson's</span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">cor(lin\_reg$fitted.values, lin\_reg$residuals, method="pearson")</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Stratify variables by a factor</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">by(depress, depress$publicassist, summary)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Welsh's Two Sample T-test </span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# </span></span></span><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">For difference in means</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">t.test(assist$cesd, noassist$cesd) </span></span></span><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# or t.test(data.y ~ factor)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# </span></span></span><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">CI of LS means based on covariates</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">library(lsmeans)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">lsmeans(reg, ~Type)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Apply a mean function to an array </span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# split on a factor</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">tapply(assist$cesd, assist$assist, mean)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# When a regression factor has </span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# </span></span></span><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">more than two categories</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">reg &lt;- lm(Pulse1 ~ Height + Sex + Smokes + as.factor(Exercise))</span></span></span></span>

</td><td style="width: 49.9383%;"><span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Cook's Distance </span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">cook &lt;- cooks.distance(reg)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">cook\[cook &gt; 4/n\]</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Shapiro Test for normallity</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">shapiro.test(reg$residuals)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Studentized residuals</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">stud &lt;- rstudent(reg)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Threshold for lower tail of </span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# studentized resids with correction</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">lim = abs(qt(.05/(n\*2), df = n - pprime - 1, lower.tail = T))</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">stud\[which(abs(stud) &gt; lim)\]</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">\# Hat values</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">hat &lt;- hatvalues(reg)</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">lev &lt;- 2 \* pprime / n</span></span></span></span>

<span style="font-size: small;"><span style="font-family: Courier, monospace;"><span lang="zxx"><span style="font-weight: normal;">hat\[hat &gt; lev\]</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">\# </span></span><span lang="zxx"><span style="font-weight: normal;">Forward </span></span><span lang="zxx"><span style="font-weight: normal;">selection</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">forward &lt;- ~ year + unemployed + femlab + marriage + birth + military</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">m0 &lt;- lm(divorce ~ 1, data = usa)</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">reg.forward.AIC &lt;- step(m0, scope = forward, direction = "forward", k = 2)</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">n &lt;- nrow(usa)</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">\# AIC = n\*log(RSS/n) + 2p'</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">n\*log(162.1228/n)+2\*6 </span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">extractAIC(reg.forward.AIC, k=2)</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">\# BIC</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">reg.forward.BIC &lt;- step(m0, scope = forward, direction = "forward", k = log(n))</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">extractAIC(reg.forward,k=log(n)) </span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">\# BIC = n\*log(RSS/n) + p'\*log\*n)</span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;"><span lang="zxx"><span style="font-weight: normal;">n\*log(162.1228/n)+6\*log(n) </span></span></span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">library(leaps)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">leaps &lt;- regsubsets(divorce ~ .)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">rs &lt;- summary(leaps)</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">par(mfrow=c(1,2))</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">plot(2:7, rs$cp, xlab="No. of parameters", ylab="Cp Statistic")</span></span>

<span style="font-family: Courier, monospace;"><span style="font-size: small;">abline(0,1)</span></span>

</td></tr></tbody></table>

[![image-1666105013557.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666105013557.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666105013557.png)

[![image-1666105036225.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666105036225.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666105036225.png)[![image-1666105020734.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666105020734.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666105020734.png)

# Tree Based Methods

**Classification and regression trees** can be generated from multivariable data sets using recursive partitioning with:

- Gini index or entropy for a categorical outcome
- Residual sum of squares for a continuous outcome

We can use **cross-validation** to assess the predictive accuracy of tree-based models.

#### Non-Linear Regression

**Polynomial Regression** adds extra predictors by raising original predictors to a power. Provides a simple way to provide a nonlinear fit to data.

**Step functions/Broken stick regression** cuts the range of a variable into K distinct regions. Fits a piecewise constant function/regression line.

**Regression splines and smoothing lines** is an extension of the two above. The range of X is divided into K distinct regions. Within each region a polynomial function is fit to the data. It is constrained in a way which smoothly joins at the region boundaries, or knots. It can produce an extremely flexible fit if there are enough divisions.

## Classification and Regression Trees

Classification and Regression Trees (CART) are a computer intensive alternative to fitting classical regression models. CART works very well for highly non-linear problems.

- These involve stratifying or segmenting the predictor space into a number of simple regions.
- In order to make a prediction for a given observation we typically use the mean or mode of the training observations in the region which it belongs to.
- **Classification trees** apply when the response it categorical
- **Regression trees** apply when the response is continuous

A regression tree is *basically* just a multiple linear regression model with indicator variables that represent the branches of the tree.

### Regression Tree

[![image-1666304260933.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304260933.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304260933.png)

Note: Number of leaves = complexity

1. We divide the predictor space into J distinct and non-overlapping regions R<sub>1</sub>, R<sub>2</sub>, R<sub>j</sub>.
2. The fitted value in a region R<sub>j</sub> is simply the mean of the response values in this region :  
    [![image-1666304470132.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304470132.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304470132.png)

Goal: The regions are identified to minimize:  
[![image-1666304497649.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304497649.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304497649.png)  
Where Rj is the rectangle corresponding to the jth terminal node and y\_bar<sub>j</sub> is the mean of the training observations in R<sub>j</sub>

A complete search is not feasible and we take a top-down, greedy approach known as **recursive binary splitting**.  
Top-down: Start from the top of the tree  
Greedy: Best solution at each node of the tree (local rather than global)

#### Recursive Binary Splitting

Input: "explanatory" variables and outcome Y

For each variable X<sub>g</sub>,  
<span style="text-decoration: underline;">without splitting </span>X<sub>g,</sub> the variable Y has a certain variability  
[![image-1666304720370.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304720370.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304720370.png)  
<span style="text-decoration: underline;">Splitting the data</span> by X<sub>g</sub> &lt; s, then the new RSS is:[![image-1666304774871.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304774871.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304774871.png)

- Identify the split of the variable to minimize the new RSS
- Identify the variable whose split minimizes the new RSS
- Repeat in each region in a recursive way

**Residual Mean Deviance**

[![image-1666304961814.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666304961814.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666304961814.png)

T<sub>0</sub> = tree; | T<sub>0</sub> | = number of leaves; y\_bar = mean in region R<sub>j</sub>

### Pruning  


Build a large tree and fold terminal branches back. This can be based on a maximum number of leaves, or use cost-complexity pruning.

Typically pruning **overfits** the data (too many splits). A smaller tree with fewer splits might lead to better interpretation at the cost of higher residual deviance.

Possible alternatives: Build the tree only so long as the decrease in the RSS due to each split exceeds some threshold. This strategy may terminate too early.  
The idea here is to minimize RSS  
For a given alpha search for the tree T nesting into T<sub>0</sub> that minimizes: [![image-1666305701607.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666305701607.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666305701607.png)where alpha=cost; |T| number of leaves  
Use a heuristic search to find alpha

##### Cart for Prediction

To access the accuracy of a regression tree to predict Y in new patients, you can evaluate the predictive accuracy by using a training and testing set.

### Classification Tree

[![image-1666306498770.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666306498770.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666306498770.png)

Input: Explanatory variables and a class label for different samples

The objective is to choose the tree which maximizes the classification accuracy

The procedure (same as regression tree with 2 differences):

- The fitted value at each terminal node is the most frequent label
- The deviance is replaced by a "measure of information" or classification error

#### Shannon Entropy

Shannon entropy is the most common measure of (Lack of) information content of a binary variable:  
[![image-1666306823973.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666306823973.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666306823973.png)

#### Gini's Impurity Index

Gini's Impurity Index is the information content of a binary variable.  
[![image-1666306946461.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666306946461.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666306946461.png)

[![image-1666307471032.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666307471032.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666307471032.png)

### Best Split

For each variable:

- Compute the information gain for each x<sub>gs</sub>
- Identify the split that gives the largest information gain

Do this linear search for each variable and identify the optimal split and maximum gain of information.

Split the node by using the variable that gives the largest information gain with a binary split.

Typically Gini index is used for growing a tree, and **misclassification error** is a good metric for pruning a tree.

### Pros and Cons of Trees

- Very easy to explain, even more so than linear regression
- Some people think decision trees more closely mirror human decision-making than previously seen classification and regression.
- Trees can be displayed graphically and are easily interpreted by a non-expert
- Trees can easily handle qualitative predictors without the need to create dummy variables.
- HOWEVER, trees are not precise for prediction. By using methods like bagging, random forests, and boosting the predictive performance can be substantially improved.

### Bagging, Boosting, and Random Forest

- **Bagging**  is a useful procedure when the goal of analysis is prediction 
    - Generate many subsets of data by randomly sampling subsets of the data (with replacement so the subject can appear more than once in the same set)
    - In each subset generate a tree without pruning
    - Using the average of the prediction from all the trees to compute the final prediciton
- Sampling with replacement to estimate statistical quanitities is known as <span style="text-decoration: underline;">bootstrap</span>
- **Random Forest** is similar to bootstrap 
    - The difference is in each bootstrap sample the set of covariates to build the tree is randomly selected selected rather than using all of them
- **Boosting** grows a family of trees sequentially

Bagging and Random Forest improve prediction at the cost of interpretability.

One can generate an importance value of each covariate as the average improvement of RSS (or entropy or Gini index) due to the covariate in each tree and rank the covariates by this importance value.

The larger the average change in deviance, the more important the variable.

### Classification Rules

The data is split into testing and training sets. We use the training set to build the classification rule; and the test set to evaluate how the classification rule labels new cases with known label.

Accuracy is the rate of correctly classified labels in the test set.  
[![image-1666308799698.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666308799698.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666308799698.png)

Sensitivity is a metric of true positive detection and shows whether the rule is sensitive to identify positive outcomes:   
[![image-1666308861537.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666308861537.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666308861537.png)

Specificity is the measure of true negative detection and showos whether the rule is specific in detection of negative outcomes:

[![image-1666308909287.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/scaled-1680-/image-1666308909287.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-10/image-1666308909287.png)

# Principal Component Analysis

The goal of supervised learning methods (regression and classification) is to predict outcome/response variable Y using a set of p features (X1, X2... Xp) measured on n observations. We train the machine on 'labeled' data to predict outcomes for unforeseen data.

Unsupervised learning is a set of tools (principle component analysis and clustering) intended to explore only a set of features (X1, X2... Xp) and to discover interesting things about these features. This is often performed as part of an exploratory data analysis.

The challenge of unsupervised learning is that is is more subjective than supervised learning, as there is no simple goal for the analysis such as prediction of a response.

### Principle Component Analysis (PCA)

Visualize n observations with measurements on a set of p features as part of an exploratory data analysis. Do this by examining 2-dimensional scatterplots. PCA produces a low-dimensional representation of a dataset that contains as much variation as possible.

Input of PCA is a data matrix in which the columns are centered to have mean 0. Typically the columns represent variables (age, weight, etc) and rows represent different subjects.

Output of PCA is a data matrix Y in which the columns are linear transformations of the columns of X, and they are uncorrelated.

In general the columns has n rows and p columns (variables).

[![image-1667513791946.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667513791946.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667513791946.png)

### Method

PCA uses orthogonal transformation to convert the columns of X into new variables called **principal components** that are:

- Linear combinations of the original variables
- Uncorrelated
- Sorted so that the first PC has the largest variance and the rest are descending
- We can have at most as many PCs as columns, assuming p &lt; n

The first principal component is the <span style="text-decoration: underline;">normalized</span> linear combination of the vectors x1, ..., xp that has the largest variance. By normalized we mean:  
[![image-1667514024494.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667514024494.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667514024494.png)  
Where the theta elements of the above are *loadings* of the first principal component. We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

#### Calculating the First Principal Component

Assuming that the variables Xi are centered, we search for the loadings that maximize the sample variance of teh first PC, subject to the constraint above that sum of theta\_j1^2 = 1.

We refer y11, y21, ..., yn1 as the scores or realized values of the first principal component where:[![image-1667514310937.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667514310937.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667514310937.png)

The average of yi1 (the scores of the first PC) is 0 since it is centered. The sample variance of the values of the n values of yi1 is:  
[![image-1667514598638.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667514598638.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667514598638.png)

The loading vector theta\_1 with elements theta\_11, theta\_21, ..., theta\_p1 defines a direction in feature space along which the data vary the most.

If we project the n data points x1, x2, ..., xn onto this direction, the projected values are the principal component scores y11, y21, ..., yn1 themselves.

#### Calculation of the Second Principal Component

After we find [![image-1667514758386.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667514758386.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667514758386.png) we can calculate the second PC:  
[![image-1667514787262.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667514787262.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667514787262.png)

And so on until all PCs are found. These calculations can be done using the "singular value decomposition".

<p class="callout info">In R the 'prcomp()' function computes principal components by using a **singular value decomposition.**</p>

The advantage of using PCA is that we hope to end up with a <span style="text-decoration: underline;">number of components that is smaller than the number of variables p.</span>

We can use PCA for data reduction techniques. If 2 or 3 PCs explain a large portion of the total variance, then we can use these 2 or 3 variables for analysis rather than the whole set.

## Spectral Decomposition

The spectral decomposition recasts a matrixx in terms of it eignenvalues and eigenvectors. The representation turns out to be very useful.

<p class="callout success">Let M be a real **symmetric** d x d matrix with eigen values lambda\_1, lambda\_2, ..., lambda\_d and corresoinding orthonormal eigenvectors u1, u2, ..., ud then:  
</p>

[![image-1667516527816.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1667516527816.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1667516527816.png)

## R Code  


```
### Example 1
head(USArrests)
dim(USArrests)
sqrt(apply(USArrests,2,var))

plot(USArrests)
# compute principal components
pca1 <- prcomp(USArrests, scale=T)
pca1
(13.2 -mean(USArrests$Murder))/sqrt(var(USArrests$Murder))*( -0.5358995) +
  (236-mean(USArrests$Assault))/sqrt(var(USArrests$Assault))*(-0.5831836) +
  (58-mean(USArrests$UrbanPop))/sqrt(var(USArrests$UrbanPop))*(-0.2781909) +
  (21.2-mean(USArrests$Rape))/sqrt(var(USArrests$Rape))*(-0.5434321)

sum(((USArrests[1,]-pca1$center)/pca1$scale)*pca1$rotation[,1])

## generate summary of loadings
summary(pca1)
plot(pca1)

# extract principal components
pca1$x[1:5,]

# plot PCs
plot(pca1$x[,1:2])
biplot(pca1)
```

# Intro to Cluster Analysis

**Clutsering** refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations, we partition the profiles into distinct groups so that the profiles are similar within the groups but different from other groups. To do this, we must define what makes observations similar or different.

##### PCA vs Clustering

Both clustering and PCA seek to simplify the data via a small number of summaries, but their mechanisms are different:

- PCA searches for a low-dimensional representation of the observations that explains a good fraction of the variance
- Clustering looks to find homogeneous subgroup among the observations

##### Notation

Input is data with p variables and n subjects  
[![image-1668562361706.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668562361706.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668562361706.png)

A distance between two vectors i and j must obey several rules:

- The distance must be positive definite, d<sub>ij</sub> &gt;= 0
- The distance must be symmetric, d<sub>ij</sub> = d<sub>ij</sub>, so that the distance from j to i is the same as the distance from i to j
- An object is zero distance from itself, d<sub>ii</sub> = 0
- The triangle rule - When considering three objects i, j and k the distance from i to k is always less than or equal to the sum of the distance from i to j and the distance from j to k  
    d<sub>ik</sub> &lt;= d<sub>ij</sub> + d<sub>jk</sub>

#### Clustering Procedures

- **Hierarchical clustering**: Iteratively merges profiles into clusters using a simple search. Start with each profile/cluster and end with one 1. The clustering procedure is represented by a *dendrogram*.
- **K-mean clustering:** Start with a per-specified number of clusters and random allocation of profiles to clusters. Iteratively move profiles from one cluster to the other to optimize some criterion. End up with the same number of clusters.

### K-Means Clustering

We partition the K clusters so that we maximize the similarity within clusters and minimize the similarity between clusters.

We can represent the data with a vector of means, which is the overall profile or cluster:  
[![image-1668606369350.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668606369350.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668606369350.png)

Supposed we split the data into 2 clusters, C1 with 𝑛1 observations of the p variables, and C2 with 𝑛2 observations of the p variables. We would have two different vectors of means representing the centroids of the clusters. The total sum of squares within clusters would be:  
[![image-1668606548035.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668606548035.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668606548035.png)  
And we seek to keep WSS small, but it is NOT guaranteed to give the minimum WSS so ideally one should start from different initial values.

- Start with K "random" clusters by assigning each of the n profiles to one of the K clusters at random
- Iterate until no more changes are possible: 
    - For each of the K clusters, compute the cluster profile (centroid)
    - Assign each observation to the cluster whose centroid is closest (using Euclidean distance)  
        [![image-1668606130150.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668606130150.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668606130150.png)

Example with K = 3, p = 2:  
[![image-1668607255893.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668607255893.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668607255893.png)

#### Standardization

When the variables are measured on different scales the measurement units may bias the cluster analysis. The Euclidean distance is not scale invariant.

### Hierarchical Clustering

This is an alternative approach that does not require a fixed number of clusters. The algorithm essentially rearranges profiles so that similar profiles are displayed next to each other in a tree (dendrogram) and the dissimilar profiles are displayed in different branches.

[![image-1668611245622.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668611245622.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668611245622.png)

[![image-1668611376451.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668611376451.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668611376451.png)

We do this we define the similarity (distance) between:

- Two profiles: s(xi, xi)
- A profile and a cluster: s(xc, xi)
- Two clusters: s(xc, xc)

#### Similarity Between Clusters

##### Complete-Linkage Clustering

- Also known as the maximum or furthest-neighbor method
- The distance between two clusters is calculated as the greatest distance between members of the relevant clusters
- This method tends to produce very compact clusters of elements and the clusters are often similar in size

##### Single-Linkage Clustering

- Referred to as the minimum or nearest neighbor method
- The distance between two clusters is calculated as the minimum distance between members of the relevant clusters
- This method tends to produce clusters that are 'loose' because clusters can be joined if any two members are close together. In particular, this method often results in 'chaining' or sequential addition of single samples to an existing cluster and produces trees with many long, single-addition branches.

##### Average-Linkage Clustering

- The distance between clusters is calculated using average values. There are various methods used for calculating this average. The most common is the unweighted pair-group method average (UPGMA), where each average is calculated from the distance between each point in a cluster and all other points in another cluster and the 2 clusters with the lowest average distance are joined together into a new cluster.

##### Centroid Clustering

- Related methods substitute the centroid for the average.

**Complete and average linkage are similar, but complete linkage is faster because it does not require recalculation of the similarity matrix at each step.**

[![image-1668612158830.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668612158830.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668612158830.png)

### Detection of Clusters

Inspection of the QQ-plot would inform about the existence of clusters in the data.

[![image-1668612379283.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668612379283.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668612379283.png)

- Observed and “expected” distances are statistically indistinguishable would suggest that there are no clusters in the data
- Departure of the QQ-plot from the diagonal line would suggest that there are clusters

We can also use a 95%-tile to detect the number of clusters using extreme percentiles of the reference distribution. The idea is to color the entry of the data set, so that the colors represent the standardized difference of the cell intensity from a baseline. Typically, columns are samples and rows are variables.

[![image-1668612715262.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668612715262.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668612715262.png)

[![image-1668612745091.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668612745091.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668612745091.png)

# Classification

**Classification** is often used to describe modeling of a categorical outcome.

In binary classification the outcome is two possible values/classes and the goal is the predict the correct class using covariates.

**Classification rule:** A mathematical function to predict the outcome of a new sample unit when the values of the covariates are known.

##### Common Classification Problems

- Molecular diagnostic: Using values of gene products (such as biomarkers) develops a diagnostic tool for early detection of diseases.
- Classification of cancer type: Different sub-types of cancer are characterized by a combination of specific markers that are on/off. Gene based classification rules can be used to increase the specificity of the diagnosis.
- Classification of drug response
- Risk prediction

#### Types of Classifiers

- Regression-based: 
    - Logistic regression, CART
- Example-based: 
    - KNN
- Based on Bayes theorem: 
    - Discriminant analysis
- Ensemble of classifiers

To evaluate a classifier, split the data into a training and test set. Use the training set to build the classification rule and the test set to evaluate how the classification rule labels new cases with known label.

### Logistic Regression

[![image-1668726889867.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668726889867.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668726889867.png)  
b<sub>i</sub>, i = 0,1... k can be estimated using Maximum Likelihood

To estimate the probability of a binary outcome as a function of covariates with logistic regression:[![image-1668727009638.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668727009638.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668727009638.png)

##### How to Pick Classification Rule

We can use the above formula for x in \[0,1\] to generate thresholds based on the predicted value to decide how to classify.

Accuracy is the rate of correctly classified labels in the test set:  
[![image-1668727368514.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668727368514.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668727368514.png)

Misclassification error:  
False positive: prediction 1 and true is 0; n01 / n0+  
False negative: predict 0 and true is 1; Fn10 / n1+  
Accuracy: (n00 + n11) / (n00 + n01 + n10 + n11)

Sensitivity (recall) the metric of true positive detection and shows whether the rule is sensitive to identify positive outcomes: 1 - n10 / n1+ = n11/n1+ = 1 - FNR

Specificity is the measure of true negative detection and shows whether the rule is specific in detection of negative outcomes: 1 - n01 / n0+ = n00 / n0+ = 1 - FPR

You CANNOT maximize sensitivity and specificity simultaneously. Maximum sensitivity test always says 1, maximum sensitivity always says 0.

Positive/Negative predicted values: The number of correct all predicted positive/negative values

#### ROC (Receiver Operating Characteristics) Analysis

The ROC curve is a population graphic for simultaneously displaying the two types of errors for all possible thresholds. They are useful for comparing different classifiers since they take into account all possible thresholds.

The overall performance of a classifier, summarized over all possible thresholds is given by the area under the ROC curve (AUC). An ideal ROC curve will hug the top-left corner, so the larger the AUC the better the classifier.

We expect a classifier that performs no better than chance to have an AUC of .5

#### Steps to Build and Evaluate Classification Rule:

1. Generate training and test set
2. Generate the classification rule using the training set;
3. Generate the predicted rules in the test set;
4. Use ROC analysis to decide the threshold that gives a good balance between sensitivity and specificity.
5. The next examples will show a variety of methods to generate classification rules (so step 2)

The previous chapter describes methods to create classification trees and K-nearest neighbor.

#### Discriminant Analysis

Logistic regression involves directly modeling P(Y = k | X = x) using logistic function. An alternative approach will model the distribution of the predictors X separately in each of the response classes, then use Bayes theorem to flip these around into estimates for P(Y = k | X =x)

Assuming the covariate x is normally distributed:  
[![image-1668730683857.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668730683857.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668730683857.png)

Binary outcome: Classify as 1 if p(C = 1 | x) &gt; p(C = 0 | x)  
[![image-1668730751004.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668730751004.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668730751004.png)

A special case when the variances of X in the groups are the same:  
[![image-1668730805144.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/scaled-1680-/image-1668730805144.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-11/image-1668730805144.png)

#### More than One Feature

To extend this approach to multiple covariates, one needs to decide:

1. How to model the correlation of the covariates within each group defined by the outcome
2. If the correlation of the covariates changes in different groups

LDA - assumes equal variance-covariance structure between groups  
QDA - assumes group specific variance covariance matrices  
LDA/QDA models all covariates as normal distributions

### Summary

For each method:

1. Fit the classification model using training data
2. Evaluate the classification accuracy in test data using ROC analysis
3. Can also choose a “best” threshold to optimize sensitivity/specificity
4. Compare different classifiers by their AUC

The final classifier should be trained on all data to be used for future applications

There is no single "Best Classification Method". There is clear evidence that different methods work better in some data and worse in others. Thus, we often use the prediction probabilies of various classifiers to build an ensemble, using prediction probabilities from various rules. This is limited as there is no description of mechanism, useful only for prediction.

# Two-Way ANOVA

We've learned about One-Way Analysis of Variance (ANOVA) previously, it is a regression model for one continuous outcome and one categorical variable. It allows us to compare the means of the groups to detect significant differences.

A one-way fixed-effects ANOVA is an extension of the two-sample t-test. There are **factors**, categorical variables, and **levels**, individual groups of the factor which represent different populations. Balanced design contain the same number of individuals in each level. **This is a special case of linear regression.**

Assumptions of one-way ANOVA:

- The data are random samples from k independent populations
- Within each population the dependent variable is normally distributed
- The observations are independent
- The population variance of the dependent variable is equal in all groups (homoscedasticity)

<p class="callout info">H<sub>0</sub>: The mean level is independent of the factor</p>

Ex. The mean testosterone level is independent of smoking history. A conclusion might be "there is evidence that testosterone level is significantly different in at least two of the smoking history groups.

We test the hypothesis by decomposing the overall variance into "explained" and "residual" variances and comparing them.

[![image-1669936903897.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669936903897.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669936903897.png)  
Where N is the total sample size and k is the number of groups  
SSB represents the variability due to the treatment effect. SSW represents the residual variability.

[![image-1669937073680.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669937073680.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669937073680.png)  
We compare the F statistic to the critical value of F with \[(k - 1), (N - k)\] degrees of freedom to test the null hypothesis of equal means.

#### Multiple Comparisons Procedures

To find which means are different from each other we should make all pairwise comparisons (k\*(k-1)/2). This introduces the multiple tests issues and increases likeliness of type I and II errors. Multiple comparison procedures take into account the total number of comparisons being made by lowering the significance level for each test so that the overall significance level is still fixed at some level alpha.

[![image-1669937488081.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669937488081.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669937488081.png)

All of these adjustments are appropriate when comparing pairs of means. Tukey's is the most powerful and provides exact p-values when the group sizes are equal. Scheffe's procedure is more powerful than Bonferroni's procedure, in general.

We can also write ANOVA as a linear regression model where **the parameter alpha\_i are the group effects** (difference of effect between group i and group 1)

### Two Way ANOVA

Two way ANOVA is a regression model for one continuous outcome and **two** categorical variables. We analyze two-way ANOVA experiments using additive and interaction models.

Critical assumptions:

- Observations at different factor levels are samples from normal distributions
- The variances of the different populations are the same

An experiment may also contain replication, for example consider the following tables:  
[![image-1669938856486.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669938856486.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669938856486.png)[![image-1669939027651.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669939027651.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669939027651.png)  
Both are the same data where the two factors are drug dosage and mutation, and we have 3 replications.

We can create summaries of the means:  
[![image-1669938936434.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669938936434.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669938936434.png)  
[![image-1669938962004.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669938962004.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669938962004.png)

##### Possible Models

**Interaction model**: The effect of each factor changes as the other factor levels vary  
[![image-1669938523361.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669938523361.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669938523361.png)

**Additive** **model**: The effect of each factor doesn't change as the other factor levels vary  
[![image-1669938550315.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669938550315.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669938550315.png)

We test different models using ANOVA to look at the global effect of the factors

#### Sum of Squares

The traditional one-way ANOVA uses a decomposition of the sum of squares for analysis.   
[![image-1669939467503.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669939467503.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669939467503.png)

Each sum of squares uses a number of degrees of freedom, given by number of different levels - 1  
[![image-1669939431751.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669939431751.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669939431751.png)

[![image-1669939548755.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669939548755.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669939548755.png)

SS(y) = model SS = SS(a) + SS(b) + SS(g)

These sum of squares are computed sequentially, so **the order of terms in the model matters**! If the design is balances the order does not matter.

#### Steps

1. Goodness of fit/Global null hypothesis  
    H<sub>0</sub>: All parameters = 0  
    H<sub>a</sub>: At least one differs from 0
2. Test interaction 
    - F test is MSI/MSE ~ F<sub>(r-1)(c-1), rc(m-1)</sub>
    - The P-value is p(F<sub>(r-1)(c-1), rc(m-1) </sub>&gt; F | H0)
    - If the p-value is smaller than our alpha then we reject the null
    - If we do not reject the null, there is no evidence that all interaction terms are not 0. Fit an additive model and repeat
3. Test main effects 
    - If we accept the null hypothesis of no interaction, it makes sense to test the significance of the main effects, to do this <span style="text-decoration: underline;">we pool the RSS with SS(y)</span>
    - SS(a) and SS(b) are Type III sum of sqaures

Once a model is accepted one can do mean comparisions to identify the factor levels with different effects (Tukey's).

The balance design means the the design is orthogonal and so there is no need to recompute sum of squares or estimates for different models. If not, analyze the data as a traditional linear model.

#### Problems

When the design is orthogonal/balanced the decomposition SS(a) + SS(b) + SS(g) is unique, when it is not the decomposition depends on the order.  
[![image-1669941015789.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/scaled-1680-/image-1669941015789.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-12/image-1669941015789.png)

## R Code

```
####### Two-way ANOVA
##### drug dataset
drug <- read.csv("anova.csv", header=T)
drug

hbf <- c(t(drug[,3:5]))
SNP <- c(rep("No",12), rep("Yes", 12))
Drug <- c(rep(0,3), rep(10,3), rep(20,3), rep(30,3),
               rep(0,3), rep(10,3), rep(20,3), rep(30,3))
data.drug <- data.frame(hbf,SNP, Drug)
head(data.drug)

### summaries
overall.mean <- mean(hbf); overall.mean
drug.means <- tapply(hbf, Drug, mean); drug.means
snps.means <- tapply(hbf, SNP, mean); snps.means
cell.means <- tapply(hbf, interaction(Drug,SNP), mean); cell.means

dim(cell.means) <- c(4,2); cell.means
cell.means <- data.frame(cell.means)
row.names(cell.means) <- levels(factor(Drug))
names(cell.means) <- levels(factor(SNP))
cell.means 

## Visualization
#install.packages("gplots")
library(gplots)
plotmeans(hbf~Drug, data=data.drug,xlab="Drug", 
		ylab="HbF",  main="Mean Plot\n with 95% CI")
plotmeans(hbf~SNP, data=data.drug,xlab="SNP", 
		ylab="HbF",  main="Mean Plot\n with 95% CI")
plotmeans(hbf~interaction(Drug,SNP), data=data.drug,
		xlab="SNP",  connect=list(1:4,5:8),ylab="HbF", 
          	main="Interaction Plot\nwith 95% CI")

interaction.plot(factor(Drug), factor(SNP), hbf, type="b",  
		xlab="Drug", ylab="hbf", main="Interaction Plot")


### two-way anova with balanced design
mod <- aov(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod)
table(mod$fitted.values)

# TukeyHSD(mod)
mod <- lm(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod)
anova(mod)
table(mod$fitted.values)

##### Exercise with balance design

mod.a <- aov(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod.a)		# anova(mod.a)
mod.lma <- lm(hbf~as.factor(Drug)*SNP, data=data.drug)
anova(mod.lma)

mod.b <- aov(hbf~SNP*as.factor(Drug), data=data.drug)
summary(mod.b)		# anova(mod.b)
mod.lmb <- lm(hbf~SNP*as.factor(Drug), data=data.drug)
anova(mod.lmb)


##### Exercise with unbalance design
data.drug.1 <- read.csv("data.drug.1.csv", header=T)
mod.1a <- aov(hbf~as.factor(Drug)*SNP, data=data.drug.1)
summary(mod.1a)
mod.1lma <- lm(hbf~as.factor(Drug)*SNP, data=data.drug.1)
anova(mod.1lma)


mod.1b <- aov(hbf~SNP*as.factor(Drug), data=data.drug.1)
summary(mod.1b)
mod.1lmb <- lm(hbf~SNP*as.factor(Drug), data=data.drug.1)
anova(mod.1lmb)
```