Multivariable Analysis

BS806: Multivariable analysis for biostatisticians

Simple Linear Regression

One of the first known uses of regression was to study inheritance of traits from generation to generation in a study in the UK from 1893-1898. E.S. Pearson organized the collection of heights of mothers and one of their adult daughters over 18. The mother's height was the predictor variable (x) and the daughter's height was the response variable (y).

The goal of a linear regression is to quantify the relationship between one independent variable and a single dependent variable. A simple linear regression can be represented by the following:

As with correlation, a strong association in a regression analysis does NOT imply causality

Additionally, prediction values outside the range of values observed for x is not reliable, and is called extrapolation.

Least Squares Estimation

OLS/LS - Ordinary Least Squares is a method that looks for the line that minimizes the "residual sum of squares"

    Residual = Observed - Predicted = y - ŷ

So we can set up an equation for sum of squared residuals:    ∑(y - ŷ)2

Then substitute  the linear regression equation, take the derivative, set to zero and solve.The solution comes out to:

image-1662677372848.png

The fitted equation will pass through x bar, y bar (the center of data)

image-1662678589985.png

Estimating Variances of LSE

The square root of an estimated variance is called standard error represented by s or se().

image-1662678839105.png

E(Y |X = x) = a function that depends on the value of x

image-1662678847860.png

LSE Assuming Normally Distributed Data

The distributions of estimates are used to make predictions and hypothesis testing. Since variance of a fixed variable is 0, we can estimate the error of a normal distribution as follows:

image-1662680044931.png

image-1662680250142.png

The estimates are correlated with covariance -x_bar/Sxx

image-1662680297717.png

It can be shown that the Least Squares Estimate is also the Maximum Likelihood Estimate.

Confidence Intervals

The distribution of β1 is normal if the variance is known.

When we use estimated variance, we use a Student's t distribution on n-2 degrees of freedom to estimate parameters:

image-1662680596587.png

By finding the standard error of y_hat we can also calculate a confidence interval for a fitted value, or a given of x

image-1662913412443.png

The interval width is:

image-1662680664338.png

And width will increase as the distance between observed and expected increases.

ANOVA

Though ANOVA is usually used in determining if there is a difference in variance with data containing 3 or more categories, linear regression under standard conditions is a special case of ANOVA. The name analysis of variance is derived from a partitioning of total variability into its component parts.

image-1662915269966.png

This formula uses an F-distribution with 1 and n − 2 df in place of the t-distribution to correct for the simultaneous inference from our estimate of Beta_1.

Prediction of New Observations

When using points not in the dataset (extrapolation) use the following adjusted formulas for variance of V(y):

image-1663276470416.png

Thus, the confidence interval for a new observation would be significantly wider

Other notations commonly seen in our textbook

image-1662677002671.png


Mutiple Linear Regression and Estimation

Multiple Linear Regression analysis can be looked upon as an extension of simple linear regression analysis to the situation in which more than one independent variables must be considered.

The general model with response Y and regressors X1, X2,... Xp:

image-1663279551985.png

Suppose we observe data for n subjects with p variables. The data might be presented in a matrix or table like:

image-1663279646413.png

We could then write the model as:

image-1663279874599.png

We can think of Y and E as (n x 1) vectors when transposed, and β as a ((p + 1) x 1) vector.  p +1 is number of predictors + the intercept. Thus X would be:

image-1663279974357.png

The general linear regression may be written as:

image-1663280050486.png

Or     yi = xi|*β + ∈i

The model is represented as the systematic structure plus the random variation with n dimensions = (p + 1) + { n - (p + 1 ) }

Ordinal Least Squares Estimators

The least squares estimate β_hat of β is chosen by minimizing the residual sum of squares function:

image-1663281298522.png

By differentiating with respect to βi and solve by setting equal to 0:

image-1663281366843.png

The least squares estimate of β_hat of β is given by: 

image-1663281452357.png

and if the inverse exists:

image-1663281474240.png

Fitted Values and Residuals

The fitted values are represetned by Y_hat = X*β_hat

image-1663282777062.png

where the hat matrix is defined as H = X(X|X)-1X|

The residual sum of squares (RSS): 

image-1663282850648.png

Gauss-Markov Conditions

In order for estimates of β to have some desirable statistical properties, we need a set of assumptions referred to as the Gauss-Markov conditions; for all i, j = 1... n

  1. E[∈i] = 0
  2. E[∈i2] = 𝜎2
  3. E[∈ij] = 0, where i != j

Or we can write these in matrix notation as:  E[∈] = 0, E[∈∈|] = 𝜎2*I

The GM conditions imply that E[Y] = Xβ and cov(Y) = E[(Y-Xβ)(Y-Xβ)|] = E[∈∈|] = ∈

Under the GM assumptions, the LSE are the Best Linear Unbiased Estimators (BLUE). In this expression, "best" means minimum variance and linear indicates that the estimators are linear functions of y.

The LSE is a good choice but it does require the errors are uncorrelated and have equal variance. Even if the errors behave, then nonlinear or biased estimates may work better.

Estimating Variance

By definition:

image-1663284680441.png

We can estimate variance by an average from the sample:

image-1663284717348.png

Under GM conditions, s2 is an unbiased estimate of variance.

Total Sum of Squares

Total sum of squares is Syy = SSreg + RSS

The corrected total sum of squares with n-1 degrees of freedom:

image-1663284931783.png

where J is an n x n matrix of 1s and H = X(X|X)-1X|

Regression and Residual Sum of Sqaures

Regression sum of squares represent the number of X variables:

image-1663285018389.png

Residual Sum of Squares with n - (p + 1) degrees of freedom:

image-1663285075431.png

F Test for Regression Relation

To test whether there is a regression relation between Y and a set of variables X, use the hypothesis test:

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑝 = 0  v.s.  𝐻1 : not all 𝛽𝑘 = 0, 𝑘 = 1, … , 𝑝

We use the test statistic:

image-1663285544018.png

The chances of a Type I error is alpha, our degrees of freedom is n - p -1

The Coefficient of Determination

Recall this measures the model of fit by the proportionate reduction of total variation in Y associated with the use of the set of X variables.

R2 = SSreg / Syy = 1 - RSS / Syy

In a multi-linear regression we must adjust the coefficient by the associated degrees of freedom:

image-1663285802285.png

Add more independent variables to the model can only increase R2, but R2alpha may become smaller when more independent variables are introduced, as the decrease in RSS may be offset by the loss of degrees of freedom.

T Tests and Intervals

Tests for βk are pretty standard:

image-1663286153070.png

With the rejection rule of 𝑡 >= t(1 − 𝛼/2; 𝑛 − 𝑝 − 1)

And likewise confidence limits for βk and 1 - alpha confidence

image-1663286260560.png

Model Fitting: Inference

Given several predictors and a response, we need to figure out whether all are needed.

Consider a large model, Ω, and a smaller model, 𝜔, which consist of a subset of predictors in Ω.

Suppose we have a response Y and a vector pI regressors XI = (X1, X2) that we partition into two parts so that:

The general hypothesis test we consider is:

image-1663885960711.png

The null hypothesis is obtained by setting beta_2 = 0

The reasoning is that if RSS𝜔 - RSSΩ is small, the fit of the smaller model is almost as good as the larger model. On the other hand, if the difference is large the superior fit of the larger model would be perferred.

This suggests  (RSS𝜔 - RSSΩ)/RSSΩ would be a potentially good test statistic where the denominator is used for scaling purpose

F Tests

Suppose the dimensions of Ω is p and that of 𝜔 is q. The general formula for the test is:

image-1663886557966.png

where dfΩ = n - p, and df𝜔 = n - q

Thus, we would reject the null hypothesis if F > Fα p - q, n - p

Simple Regression

image-1663887396158.png

Recall the ANOVA table for a simple regression:

image-1663887176176.png

Under the null hypothesis:

image-1663887330435.png

The df is the number of n observations minus the number of estimated parameters: (n-1)

Under the alternative hypothesis:

image-1663887289267.png

Test of All the Predictors

image-1663887421754.png

image-1663887440224.png

Where p is the number of regressors and n is the sample size.

image-1663887487824.png

One Predictor

Can a particular predictor be dropped from the model?

image-1663887532157.png

A t test can be used with (n - p - 1) degrees of freedom

image-1663887559097.png

The F test may be used as introduced earlier with a df of 1, n-p-1. ti2 here is exactly the F-statistic.

Dummy Variables and Analysis of Covariance

So far we have mostly seen quantitative variables in regression models, but many variables of interest are qualitative (sex, status, etc). To add such information to a model, we can set up a indicator/dummy variable.

For example, we could set up a variable Xi2 representing sex as 0 for type A and 1 for other for the ith individual. The resulting model would look something like:

image-1664492524062.png
Where Xi2 is 0 when the individual is type A. So for type A:

image-1664492608189.png
But for any other type:

image-1664492625910.png

Models with Interaction

If we have a first-order regression model with an interaction we can represent it with an interaction term:

image-1664490098266.png

We can illustrate interaction as follows:

image-1664490347740.png
Beta2 indicates how much greater (smaller) the Y intercept for the class coded 1 than that of the class coded 0
Beta3 indicates how much greater (smaller) the slope for the class coded 1 then that of the class coded 0

Raw Mean

A raw mean is simply an average of the observations without considering other covariates. Least square means (sometimes called adjusted mean) are adjusted for other covariates, since it is estimated from a linear regression.

Qualitative Variable with 2+ Classes

If there are more than 2 classes to a qualitative variable, we require additional indicator variables in the regression model.

image-1664492179261.png

Resulting in a model like:

image-1664492245310.png

And the X matrix would look like:

image-1664492264229.png

image-1664492322594.png

Broken Stick Regression, Polynomial Regression, Splines

Transformations of the response and predictors can improve the fit and correct violations of model assumptions such as non-constant error variance. This chapter focuses on the transformation of predictors.

The idea behind Broken Stick Regression/Segmented Regression is that different linear regression models may apply in different regions of the data. We define two basis functions:

image-1664493768291.png
where c marks the division between groups. Bl(x) and Br(x) form a first-order spline basis with knot point c. Sometimes Bl(x) and Br(x) are called hockey-stick functions.

image-1664493964181.png

Polynomial Regression

Polynomial Regression models are frequently used curvilinear response models in practice. They make it easy to handle special cases of the linear regression model, however in reality this is not an exact representation.

image-1664494051482.png

We choose d by adding terms until the added term is not statistically significant. Start with a large d and eliminate non-statistically significant terms starting with the highest order term.

Regression Splines

 

Regression Diagnostics

The estimation and inference from the regression model depends on several assumptions. These assumptions need to be checked using regression diagnostics.

We divide the potential problems into three categories:

Diagnostic Techniques

Model building is often an interactive and interactive process. It is quite common to repeat the diagnostics on a succession of models.

Unobservable Random Errors

Recall a basic multiple linear regression model is given by:
    E[Y|X] = Xβ   and   Var(Y|X) = 𝜎2I

The vectors of errors is 𝜀 = Y - E(Y|X) = Y - Xβ; where  𝜀  is unobservable random variables with:
    E(𝜀 | X) = 0
    Var(𝜀 | X) =  𝜎2I

We estimate beta with
image-1665094586198.png
and the fitted values Y_hat corresponding to the observed value Y are:image-1665094623407.png
Where H is the hat matrix. Defined as:
image-1665094651063.png

The Residuals

The vector of residuals e_hat, which can be graphed, is defined as:image-1665094692646.png
    E(e_hat | X) = 0
    Var(e_hat | X) = 𝜎2(I - H)
    Var(e_hat_i | X) = 𝜎2(I - hii); where hii is the ith diagonal element of H.

Diagnostic procedures are based on the residuals which we would like to assume behave as the unobservable errors would.

Cases with large values of hii will have small values for Var(e_hat_i | X)

The Hat Matrix

The hat matrix H is n x n symmetric matrix

image-1665095127938.png

hii is also called the leverage of the ith case. As hii approaches 1, y_hat_i gets close to y_i.

Error Assumptions

We wish to check the independence, constant variance, and normality of the errors  𝜀. The errors are not observable, but we can examine the residuals e_hat.

They are NOT interchangeable with the error.image-1665095273422.png

The errors may have equal variance and be uncorrelated while the residuals do not. The impact of this is usually small and diagnostics are often applied to the residuals in order to check the assumptions on the error.

Constant Variance

Check whether the variance in the residuals is related to some other quantity Y_hat and Xi

Normality Assumptions

The tests and confidence intervals we used are based on the assumption of normal errors. The residuals can be assessed for normality using a QQ plot. This compares residuals to "ideal" normal observations.

Suppose we have a sample of n: x1, x2... xn. and wish to examine whether the x's are a sample from normal  distribution:

Many statistics have been proposed for testing a sample for normality. One of these that works well is the Shapiro and Wilk W statistic, which is the square of the correlation between the observed order statistics and the expected order statistics.

Testing for Curvature

One helpful test looks for curvature in the plot. Suppose we have residual e_hat vs a quantity U where U could be a regressor or a combination of regressors.

A simple test for curvature is:

Unusual Observations

Some observations do not fit the model well, called outliers. Some observations change the fit of the model in a substantive manner, called influential observations. If an observation has the potential to influence the fit, it is called a leverage point.

hii is called leverage and is useful diagnostics.

Var(e_hat_i | X) = 𝜎2(1 - hii)
    A large leverage will make Var(e_hat_i | X) small
    The fit will be forced close to yi

image-1665097665011.png = number of parameters
    An average value for hii is p'/n
    A "rule of thumb": leverage > 2p'/n should be looked at more closely

hij = xi'(X'X)-1xj
    Leverage only depends on X, not Y

Suppose that the i-th case is suspected to be an outlier:

Alternative Method

Bonferroni Correction

Even though we might explicitly test only one or two large ti by identifying them as large, we are implicity testing all cases. So, multiple testing correction such as Bonferroni correction should be implemented. Suppose we want a level alpha test:

image-1665099162424.png

So it suggests that if an overall level alpha test is required, then a level should be alpha/n in each of the tests.

Notes on Outliers:

When handling outliers:

Influential Observations

An influential point is one whose removal from the dataset causes a large change in fit. An influentual point may or may not be an outier or a leverage point.

Two meausures for identifying the infuential observations:

These rules are guidelines only, not a hard rule.

Code

## Test for normallity
gs <- lm(sqrt(Species) ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)
g <- lm(Species ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)

par(mfrow=c(2,2))
plot(fitted(g), residuals(g), xlab="fitted", ylab="Residuals")

qqnorm(residuals(g), ylab="Residuals")
qqline(residuals(g))

hist(g$residuals)

## Testing for curvature
library(alr4)
m2 <- lm(fertility ~ log(ppgdp) + pctUrban, UN11)
residualPlots(m2)
summary(m2)$coeff

## Testing for Outliers
g <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
n <- nrow(savings)
pprime <- 5   # number of parameters
jack <- rstudent(g)   # studentized residual
jack[which.max(abs(jack))]  # maximum studentized residual

# threshold for lower tail
qt(0.05/(50*2), df = n-pprime-1 , lower.tail=TRUE) 

#### influential points
cook <- cooks.distance(g)
n <- nrow(savings)
pprime <- 5

check <- cook[cook > 4/n]  # rule of thumb
sort(check, decreasing=TRUE) [1:5]  # list first five max

cook[cook>0.5]    # check Di>0.5
cook[(pf(cook, pprime, n-pprime)>0.5)] # use F-dist

influenceIndexPlot(g)

Variable Selection

Variable selection is intended to select the "best subset" of predictors. Variable selection shouldn't be separated from the rest of the model; Outliers and influential points can change the model we select. Transformations of the variables can have an impact on the model selection. Some iteration and experimentation is often necessary to find some better model.

The aim of variable selection is to construct a model that predicts or explains the relationships in the data. Automatic variable selections are not guaranteed to be consistent with these goals, so use these methods as a guide only.

The "best" subset of predictors:

There are two types of variable selections we will cover today:

Model Hierarchy

Some models have a natural hierarchy, ex polynomial regression models (x2 is a higher order term than x). When selecting variables it is important to respect the hierarchy. Lower order terms should not be removed from the model before higher order terms in the same variable.

Consider the model:
image-1665699110928.png
Suppose the summary shows that the term in x is not significant but x2 is. If we removed x our model would become:
image-1665699207039.png
Then let's change the scale and change x to (x + a). Then the model would become:image-1665699275489.png
The first order x reappears! The interpretation should not depend on the scale.

Scale changes should not make any important changes to the model.

Testing-Based Procedures

Backward Elimination

  1. Start with all the predictors in the model
  2. Remove the predictor with highest p-value greater than alpha
  3. Refit the model
  4. Remove the remaining least significant predictor provided its p-value is greater than alpha
  5. Repeat 3 and 4 until all "non-significant" predictors are removed

Alpha is sometimes called the "p-to-remove" and does not have to be 5%. For prediction purpose, 15-20% cutoff may be best.

Forward Selection

Reverses the backwards method

  1. Start with no variables
  2. For predictors not in the model, check the p-value if they are added to the model. We choose the one with lowest p-value less than alpha
  3. Continue until no new predictors can be added

Stepwise Regression

Combination of backwards elimination and forward selection.

Notes on Testing-Based Procedures

Criterion-Based Procedures

Criteria for model selection are based on lack of fit of a model and its complexity.

Some possible criteria:

Note p' is the number of parameters including the intercept.

Small values of AIC and BIC are preferred. So better candidate sets will have smaller RSS and a smaller number of terms p. Larger models fit better and have smaller RSS but use more parameters. The goal is to find a balance between RSS and p.

BIC penalizes larger models more heavily. Smaller models are perferred in BIC as compared to AIC.

Adjusted R2

R2 = 1 - RSS/SSY

Another commonly used criterion is adjusted R2, written Ra2image-1665703334997.png

Mallow's Cp Statistics

A good model should predict well, so the average mean square error of prediction might be a good criterion:image-1665703426676.png
which can be estimated by the Cp statistic:
image-1665703466451.png
where sigma squared is from the model with ALL predictors and RSSp indicates that RSS from a model with p parameters.

R Code

install.packages("faraway")
library(faraway)
data(state)
names(state)
statedata <- data.frame(state.x77, row.names=state.abb)

names(statedata)
write.csv(statedata, file="statedata.csv", quote=FALSE, row.names=F)

############# from here
## the data were collected from U.S. Bureau of the Census. 
## use life expectancy as the response
## the remaining variables as predictors

statedata <- read.csv("statedata.csv")
g <- lm(Life.Exp ~ ., data = statedata)
summary(g)

### backward elimination
g <- lm(Life.Exp ~ ., data = statedata)
summary(g)$coefficients

g <- update(g, .~ . - Area)
summary(g)$coefficients

g <- update(g, .~ . - Illiteracy)
summary(g)$coefficients

g <- update(g, .~ . - Income)
summary(g)$coefficients

g <- update(g, .~ . - Population)
summary(g)$coefficients
summary(g)

## step(lm(Life.Exp ~ ., data = statedata), 
#	scope=list(lower=as.formula(.~Illiteracy)), direction="backward")

### the variables omitted from the model may still be related to the response
summary(lm(Life.Exp ~ Illiteracy+Murder+Frost, statedata))$coeff
 
## forward
f <- ~Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area
m0 <- lm(Life.Exp ~ 1, data = statedata)
m.forward <- step(m0, scope = f, direction ="forward", k=2)

## hand calculate the AIC value
aov(lm(Life.Exp ~ Murder, data=statedata))
# AIC = n*log(RSS/n) + 2p

n <- nrow(statedata)
n*log(34.46/n)+2*2

extractAIC(m.forward, k=2)		# by default k=2, AIC
extractAIC(m.forward,k=log(50))		# k=log(n), BIC

## final model using AIC
summary(m.forward)$coefficients

## use BIC
n <- nrow(statedata)
m.forward.BIC <- step(m0, scope = f, direction ="forward", 
			k=log(n), trace=FALSE)
summary(m.forward.BIC)$coefficients


### backward
m1 <- update(m0, f)
m.backward <- step(m1, scope = c(lower= ~ 1), 
			direction = "backward", trace=FALSE)
summary(m.backward)$coefficients

### stepwise
m.stepup <- step(m0, scope=f, direction="both", trace=FALSE)
summary(m.stepup)$coefficients

########## about Cp
install.packages("leaps")
library(leaps)
leaps <- regsubsets(Life.Exp ~ ., data = statedata)
rs <- summary(leaps)
par(mfrow=c(1,2))
plot(2:8, rs$cp, xlab="No. of parameters",
		ylab="Cp Statistic")
abline(0,1)

plot(2:8, rs$adjr2, xlab="No. of parameters",
		ylab="Adjusted R-Squared")
abline(0,1)
names(rs)

rs

Midterm Cheat Sheet

Linear Regression

image-1666104229251.png

image-1666104239088.png

image-1666104245811.png

image-1666104252727.png

image-1666104260540.png

image-1666104271056.png

image-1666104276418.png

image-1666104282605.png

image-1666104287023.png

Predicting a CI new obs adds a 1 to se(y):

𝛽0 + 𝛽2x +/- t*image-1666104297248.png

Multiple Linear Regression and Estimation

image-1666104323618.png

image-1666104327905.png

image-1666104332246.png

image-1666104336362.png

image-1666104345652.png

image-1666104350789.png

image-1666104355280.png

image-1666104359234.png

image-1666104363437.png

image-1666290673808.png

image-1666290691336.png

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑝 = 0  
v.s.  𝐻1 : not all 𝛽𝑘 = 0, 𝑘 = 1, … , 𝑝

image-1666104389032.pngimage-1666104395117.png

rejection rule of 𝑡 >= t(1 − alpha/2; 𝑛 − 𝑝 − 1)

image-1666104407108.png

image-1666104410593.png

image-1666104415116.png

image-1666104423017.png

Model Fitting: Inference

image-1666104576887.png

dfΩ = n - p, and df𝜔 = n – q

image-1666138531298.png

Reject the null hypothesis if F > Fα p - q, n – p

image-1666104643517.png

image-1666138340542.png


Dummy Variables and Analysis of Covariance
Consider a Xi2 for which is 0 for – and 1 for +:

image-1666104607845.png

An interaction between Xi1 and Xi2:

image-1666104615922.png

A model with multiple categorical variables:

image-1666104625980.png

image-1666104633030.png

Regression Diagnostics
Assumptions:
    • Error:  ~ N(0, SD2I); 
        ◦ Independent
        ◦ Equal Variance
        ◦ Normally Distributed
    • Model: E[y] = Xβ is correct
    • Unusual observations
      
Leverage Points: data point with unusual x-value

image-1666104773344.png

      The Hat Matrix – n*n matrix
hii is the leverage of the ith case
leverage > 2p’/n should be looked at closely


Outliers: Unusual observation on x or y axis

image-1666104790374.png

Calculate the t-test and compare abs with limit:
abs(qt(.05/(n*2), df = n - pprime - 1, lower.tail = T))

 

Influential Points: causes changes to regression
    Difference in Fits:

image-1666104815733.png

with a threshold of

image-1666104825747.png

Where p’ is the number of parameters 

    Cook's Distance:

image-1666104834542.png

with a threshold of
Di > 4/n should be looked at
Di > .5 possible influence
Di >= 1 very influential

Error: a plot of e_hat should
    • have constant variance
    • have no clear pattern
    • H0: residuals are normal

Shapiro-Wilk normality test

H0: Residuals are normally distributed

Bonferroni Correction: Divide alpha by n

Variable Selection

Backwards Elimination:

  1. Start model with all the predictors

  2. Remove the predictor with highest p-value greater than alpha

  3. Refit the model

  4. Remove the remaining least significant predictor provided its p-value is greater than alpha

  5. Repeat 3 and 4 until all "non-significant" predictors are removed

Cutoff p significance can be 15-20% for testing

Forward Selection:

  1. Start model with no predictors

  2. For predictors not in the model, check the p-value if they are added to the model. We choose the one with lowest p-value less than alpha

  3. Continue until no new predictors can be added

Stepwise regression: A combination of the two


Selection Criteria:
Akaike Information Criterion (AIC): 
    • -2 max log-likelihood + 2p' 
    • n*log(RSS/n) + 2p'
Bayes Information Criterion (BIC): 
    • -2 max log-likelihood + p' log(n) 
    • n*log(RSS/n) + log(n) * p' 
Adjusted R2:
R2 = 1 – RSS/SSY

image-1666202623504.png

Mallow’s Cp Statistic: Avg MSE of prediction

image-1666202639730.pngIf a p-predictor fits then:

image-1666202661358.png

We desire models with small p and Cp around or less than p

R Code Snippets

# Model with only beta_0

sr_lm0 <- lm(y ~ 1, data=sr)

# Full model

sr_lm1 <- lm(y ~ ., data=sr)

sr_syy <- sum((savings$sr - mean(savings$sr))^2)

sr_rss <- deviance(sr_lm1)

# F = ((SYY -RSS)/((n-1) - (n-2))) / (RSS / (n - 1))

sr_num <- (sr_syy - sr_rss)/(df.residual(sr_lm0) - df.residual(sr_lm1))

sr_den <- sr_rss / df.residual(sr_lm1)

sr_f <- sr_num / sr_den

# dfΩ = n - p, and df𝜔 = n - q

pf(sr_f, df.residual(sr_lm0) - df.residual(sr_lm1), df.residual(sr_lm1), lower.tail = F)


# β=(XI X)−1 XIY

beta <- solve(t(x)%*%x)%*%(t(x)%*%y)

# Pearson's

cor(lin_reg$fitted.values, lin_reg$residuals, method="pearson")


# Stratify variables by a factor

by(depress, depress$publicassist, summary)

# Welsh's Two Sample T-test

# For difference in means

t.test(assist$cesd, noassist$cesd) # or t.test(data.y ~ factor)

# CI of LS means based on covariates

library(lsmeans)

lsmeans(reg, ~Type)

# Apply a mean function to an array

# split on a factor

tapply(assist$cesd, assist$assist, mean)

# When a regression factor has

# more than two categories

reg <- lm(Pulse1 ~ Height + Sex + Smokes + as.factor(Exercise))


# Cook's Distance

cook <- cooks.distance(reg)

cook[cook > 4/n]

# Shapiro Test for normallity

shapiro.test(reg$residuals)

# Studentized residuals

stud <- rstudent(reg)

# Threshold for lower tail of

# studentized resids with correction

lim = abs(qt(.05/(n*2), df = n - pprime - 1, lower.tail = T))

stud[which(abs(stud) > lim)]

# Hat values

hat <- hatvalues(reg)

lev <- 2 * pprime / n

hat[hat > lev]


# Forward selection

forward <- ~ year + unemployed + femlab + marriage + birth + military

m0 <- lm(divorce ~ 1, data = usa)

reg.forward.AIC <- step(m0, scope = forward, direction = "forward", k = 2)

n <- nrow(usa)


# AIC = n*log(RSS/n) + 2p'

n*log(162.1228/n)+2*6

extractAIC(reg.forward.AIC, k=2)

# BIC

reg.forward.BIC <- step(m0, scope = forward, direction = "forward", k = log(n))

extractAIC(reg.forward,k=log(n))

# BIC = n*log(RSS/n) + p'*log*n)

n*log(162.1228/n)+6*log(n)


library(leaps)

leaps <- regsubsets(divorce ~ .)

rs <- summary(leaps)

par(mfrow=c(1,2))

plot(2:7, rs$cp, xlab="No. of parameters", ylab="Cp Statistic")

abline(0,1)

image-1666105013557.png

image-1666105036225.pngimage-1666105020734.png

Tree Based Methods

Classification and regression trees can be generated from multivariable data sets using recursive partitioning with:

We can use cross-validation to assess the predictive accuracy of tree-based models.

Non-Linear Regression

Polynomial Regression adds extra predictors by raising original predictors to a power. Provides a simple way to provide a nonlinear fit to data.

Step functions/Broken stick regression cuts the range of a variable into K distinct regions. Fits a piecewise constant function/regression line.

Regression splines and smoothing lines is an extension of the two above. The range of X is divided into K distinct regions. Within each region a polynomial function is fit to the data. It is constrained in a way which smoothly joins at the region boundaries, or knots. It can produce an extremely flexible fit if there are enough divisions.

Classification and Regression Trees

Classification and Regression Trees (CART) are a computer intensive alternative to fitting classical regression models. CART works very well for highly non-linear problems.

A regression tree is basically just a multiple linear regression model with indicator variables that represent the branches of the tree.

Regression Tree

image-1666304260933.png

Note: Number of leaves = complexity

  1. We divide the predictor space into J distinct and non-overlapping regions R1, R2, Rj.
  2. The fitted value in a region Rj is simply the mean of the response values in this region :

    image-1666304470132.png

Goal: The regions are identified to minimize:
image-1666304497649.png
Where Rj is the rectangle corresponding to the jth terminal node and y_barj is the mean of the training observations in Rj

A complete search is not feasible and we take a top-down, greedy approach known as recursive binary splitting.
Top-down: Start from the top of the tree
Greedy: Best solution at each node of the tree (local rather than global)

Recursive Binary Splitting

Input: "explanatory" variables and outcome Y

For each variable Xg,
without splitting Xg, the variable Y has a certain variability
image-1666304720370.png
Splitting the data by Xg < s, then the new RSS is:image-1666304774871.png

Residual Mean Deviance

image-1666304961814.png

T0 = tree; | T0 | = number of leaves; y_bar = mean in region Rj

Pruning

Build a large tree and fold terminal branches back. This can be based on a maximum number of leaves, or use cost-complexity pruning.

Typically pruning overfits the data (too many splits). A smaller tree with fewer splits might lead to better interpretation at the cost of higher residual deviance.

Possible alternatives: Build the tree only so long as the decrease in the RSS due to each split exceeds some threshold. This strategy may terminate too early.
The idea here is to minimize RSS
For a given alpha search for the tree T nesting into T0 that minimizes: image-1666305701607.pngwhere alpha=cost; |T| number of leaves
Use a heuristic search to find alpha

Cart for Prediction

To access the accuracy of a regression tree to predict Y in new patients, you can evaluate the predictive accuracy by using a training and testing set.

Classification Tree

image-1666306498770.png

Input: Explanatory variables and a class label for different samples

The objective is to choose the tree which maximizes the classification accuracy

The procedure (same as regression tree with 2 differences):

Shannon Entropy

Shannon entropy is the most common measure of (Lack of) information content of a binary variable:
image-1666306823973.png

Gini's Impurity Index

Gini's Impurity Index is the information content of a binary variable.
image-1666306946461.png

image-1666307471032.png

Best Split

For each variable:

Do this linear search for each variable and identify the optimal split and maximum gain of information.

Split the node by using the variable that gives the largest information gain with a binary split.

Typically Gini index is used for growing a tree, and misclassification error is a good metric for pruning a tree.

Pros and Cons of Trees

Bagging, Boosting, and Random Forest

Bagging and Random Forest improve prediction at the cost of interpretability.

One can generate an importance value of each covariate as the average improvement of RSS (or entropy or Gini index) due to the covariate in each tree and rank the covariates by this importance value.

The larger the average change in deviance, the more important the variable.

Classification Rules

The data is split into testing and training sets. We use the training set to build the classification rule; and the test set to evaluate how the classification rule labels new cases with known label.

Accuracy is the rate of correctly classified labels in the test set.
image-1666308799698.png

Sensitivity is a metric of true positive detection and shows whether the rule is sensitive to identify positive outcomes:
image-1666308861537.png

Specificity is the measure of true negative detection and showos whether the rule is specific in detection of negative outcomes:

image-1666308909287.png

Principal Component Analysis

The goal of supervised learning methods (regression and classification) is to predict outcome/response variable Y using a set of p features (X1, X2... Xp) measured on n observations. We train the machine on 'labeled' data to predict outcomes for unforeseen data.

Unsupervised learning is a set of tools (principle component analysis and clustering) intended to explore only a set of features (X1, X2... Xp) and to discover interesting things about these features. This is often performed as part of an exploratory data analysis.

The challenge of unsupervised learning is that is is more subjective than supervised learning, as there is no simple goal for the analysis such as prediction of a response.

Principle Component Analysis (PCA)

Visualize n observations with measurements on a set of p features as part of an exploratory data analysis. Do this by examining 2-dimensional scatterplots. PCA produces a low-dimensional representation of a dataset that contains as much variation as possible.

Input of PCA is a data matrix in which the columns are centered to have mean 0. Typically the columns represent variables (age, weight, etc) and rows represent different subjects.

Output of PCA is a data matrix Y in which the columns are linear transformations of the columns of X, and they are uncorrelated.

In general the columns has n rows and p columns (variables).

image-1667513791946.png

Method

PCA uses orthogonal transformation to convert the columns of X into new variables called principal components that are:

The first principal component is the normalized linear combination of the vectors x1, ..., xp that has the largest variance. By normalized we mean:
image-1667514024494.png
Where the theta elements of the above are loadings of the first principal component. We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

Calculating the First Principal Component

Assuming that the variables Xi are centered, we search for the loadings that maximize the sample variance of teh first PC, subject to the constraint above that sum of theta_j1^2 = 1.

We refer y11, y21, ..., yn1 as the scores or realized values of the first principal component where:image-1667514310937.png

The average of yi1 (the scores of the first PC) is 0 since it is centered. The sample variance of the values of the n values of yi1 is:
image-1667514598638.png

The loading vector theta_1 with elements theta_11, theta_21, ..., theta_p1 defines a direction in feature space along which the data vary the most.

If we project the n data points x1, x2, ..., xn onto this direction, the projected values are the principal component scores y11, y21, ..., yn1 themselves.

Calculation of the Second Principal Component

After we find image-1667514758386.png we can calculate the second PC:
image-1667514787262.png

And so on until all PCs are found. These calculations can be done using the "singular value decomposition".

In R the 'prcomp()' function computes principal components by using a singular value decomposition.

The advantage of using PCA is that we hope to end up with a number of components that is smaller than the number of variables p.

We can use PCA for data reduction techniques. If 2 or 3 PCs explain a large portion of the total variance, then we can use these 2 or 3 variables for analysis rather than the whole set.

Spectral Decomposition

The spectral decomposition recasts a matrixx in terms of it eignenvalues and eigenvectors. The representation turns out to be very useful.

Let M be a real symmetric d x d matrix with eigen values lambda_1, lambda_2, ..., lambda_d and corresoinding orthonormal eigenvectors u1, u2, ..., ud then:

image-1667516527816.png

R Code

### Example 1
head(USArrests)
dim(USArrests)
sqrt(apply(USArrests,2,var))

plot(USArrests)
# compute principal components
pca1 <- prcomp(USArrests, scale=T)
pca1
(13.2 -mean(USArrests$Murder))/sqrt(var(USArrests$Murder))*( -0.5358995) +
  (236-mean(USArrests$Assault))/sqrt(var(USArrests$Assault))*(-0.5831836) +
  (58-mean(USArrests$UrbanPop))/sqrt(var(USArrests$UrbanPop))*(-0.2781909) +
  (21.2-mean(USArrests$Rape))/sqrt(var(USArrests$Rape))*(-0.5434321)

sum(((USArrests[1,]-pca1$center)/pca1$scale)*pca1$rotation[,1])

## generate summary of loadings
summary(pca1)
plot(pca1)

# extract principal components
pca1$x[1:5,]

# plot PCs
plot(pca1$x[,1:2])
biplot(pca1)

Intro to Cluster Analysis

Clutsering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations, we partition the profiles into distinct groups so that the profiles are similar within the groups but different from other groups. To do this, we must define what makes observations similar or different.

PCA vs Clustering

Both clustering and PCA seek to simplify the data via a small number of summaries, but their mechanisms are different:

Notation

Input is data with p variables and n subjects
image-1668562361706.png

A distance between two vectors i and j must obey several rules:

Clustering Procedures

K-Means Clustering

We partition the K clusters so that we maximize the similarity within clusters and minimize the similarity between clusters.

We can represent the data with a vector of means, which is the overall profile or cluster:
image-1668606369350.png

Supposed we split the data into 2 clusters, C1 with 𝑛1 observations of the p variables, and C2 with 𝑛2 observations of the p variables. We would have two different vectors of means representing the centroids of the clusters. The total sum of squares within clusters would be:
image-1668606548035.png
And we seek to keep WSS small, but it is NOT guaranteed to give the minimum WSS so ideally one should start from different initial values.

Example with K = 3, p = 2:
image-1668607255893.png

Standardization

When the variables are measured on different scales the measurement units may bias the cluster analysis. The Euclidean distance is not scale invariant.

Hierarchical Clustering

This is an alternative approach that does not require a fixed number of clusters. The algorithm essentially rearranges profiles so that similar profiles are displayed next to each other in a tree (dendrogram) and the dissimilar profiles are displayed in different branches.

image-1668611245622.png

image-1668611376451.png

We do this we define the similarity (distance) between:

Similarity Between Clusters

Complete-Linkage Clustering
Single-Linkage Clustering
Average-Linkage Clustering
Centroid Clustering

Complete and average linkage are similar, but complete linkage is faster because it does not require recalculation of the similarity matrix at each step.

image-1668612158830.png

Detection of Clusters

Inspection of the QQ-plot would inform about the existence of clusters in the data.

image-1668612379283.png

We can also use a 95%-tile to detect the number of clusters using extreme percentiles of the reference distribution. The idea is to color the entry of the data set, so that the colors represent the standardized difference of the cell intensity from a baseline. Typically, columns are samples and rows are variables.

image-1668612715262.png

image-1668612745091.png

 

Classification

Classification is often used to describe modeling of a categorical outcome.

In binary classification the outcome is two possible values/classes and the goal is the predict the correct class using covariates.

Classification rule: A mathematical function to predict the outcome of a new sample unit when the values of the covariates are known.

Common Classification Problems

Types of Classifiers

To evaluate a classifier, split the data into a training and test set. Use the training set to build the classification rule and the test set to evaluate how the classification rule labels new cases with known label.

Logistic Regression

image-1668726889867.png
bi, i = 0,1... k can be estimated using Maximum Likelihood

To estimate the probability of a binary outcome as a function of covariates with logistic regression:image-1668727009638.png

How to Pick Classification Rule

We can use the above formula for x in [0,1] to generate thresholds based on the predicted value to decide how to classify.

Accuracy is the rate of correctly classified labels in the test set:
image-1668727368514.png

Misclassification error:
False positive: prediction 1 and true is 0; n01 / n0+
False negative: predict 0 and true is 1; Fn10 / n1+
Accuracy: (n00 + n11) / (n00 + n01 + n10 + n11)

Sensitivity (recall) the metric of true positive detection and shows whether the rule is sensitive to identify positive outcomes:  1 - n10 / n1+ = n11/n1+ = 1 - FNR

Specificity is the measure of true negative detection and shows whether the rule is specific in detection of negative outcomes:  1 - n01 / n0+  =  n00 / n0+ = 1 - FPR

You CANNOT maximize sensitivity and specificity simultaneously. Maximum sensitivity test always says 1, maximum sensitivity always says 0.

Positive/Negative predicted values: The number of correct all predicted positive/negative values

ROC (Receiver Operating Characteristics) Analysis

The ROC curve is a population graphic for simultaneously displaying the two types of errors for all possible thresholds.  They are useful for comparing different classifiers since they take into account all possible thresholds.

The overall performance of a classifier, summarized over all possible thresholds is given by the area under the ROC curve (AUC). An ideal ROC curve will hug the top-left corner, so the larger the AUC the better the classifier.

We expect a classifier that performs no better than chance to have an AUC of .5

Steps to Build and Evaluate Classification Rule:

  1. Generate training and test set
  2. Generate the classification rule using the training set;
  3. Generate the predicted rules in the test set;
  4. Use ROC analysis to decide the threshold that gives a good balance between sensitivity and specificity.
  5. The next examples will show a variety of methods to generate classification rules (so step 2)

The previous chapter describes methods to create classification trees and K-nearest neighbor.

Discriminant Analysis

Logistic regression involves directly modeling P(Y = k | X = x) using logistic function. An alternative approach will model the distribution of the predictors X separately in each of the response classes, then use Bayes theorem to flip these around into estimates for P(Y = k | X =x)

Assuming the covariate x is normally distributed:
image-1668730683857.png

Binary outcome: Classify as 1 if p(C = 1 | x) > p(C = 0 | x)
image-1668730751004.png

A special case when the variances of X in the groups are the same:
image-1668730805144.png

More than One Feature

To extend this approach to multiple covariates, one needs to decide:

  1. How to model the correlation of the covariates within each group defined by the outcome
  2. If the correlation of the covariates changes in different groups

LDA - assumes equal variance-covariance structure between groups
QDA - assumes group specific variance covariance matrices
LDA/QDA models all covariates as normal distributions

Summary

For each method:

  1. Fit the classification model using training data
  2. Evaluate the classification accuracy in test data using ROC analysis
  3. Can also choose a “best” threshold to optimize sensitivity/specificity
  4. Compare different classifiers by their AUC

The final classifier should be trained on all data to be used for future applications

There is no single "Best Classification Method". There is clear evidence that different methods work better in some data and worse in others. Thus, we often use the prediction probabilies of various classifiers to build an ensemble, using prediction probabilities from various rules. This is limited as there is no description of mechanism, useful only for prediction.

Two-Way ANOVA

We've learned about One-Way Analysis of Variance (ANOVA) previously, it is a regression model for one continuous outcome and one categorical variable. It allows us to compare the means of the groups to detect significant differences.

A one-way fixed-effects ANOVA is an extension of the two-sample t-test. There are factors, categorical variables, and levels, individual groups of the factor which represent different populations. Balanced design contain the same number of individuals in each level. This is a special case of linear regression. 

Assumptions of one-way ANOVA:

H0: The mean level is independent of the factor

Ex. The mean testosterone level is independent of smoking history. A conclusion might be "there is evidence that testosterone level is significantly different in at least two of the smoking history groups.

We test the hypothesis by decomposing the overall variance into "explained" and "residual" variances and comparing them.

image-1669936903897.png
Where N is the total sample size and k is the number of groups
SSB represents the variability due to the treatment effect. SSW represents the residual variability.

image-1669937073680.png
We compare the F statistic to the critical value of F with [(k - 1), (N - k)] degrees of freedom to test the null hypothesis of equal means.

Multiple Comparisons Procedures

To find which means are different from each other we should make all pairwise comparisons (k*(k-1)/2). This introduces the multiple tests issues and increases likeliness of type I and II errors. Multiple comparison procedures take into account the total number of comparisons being made by lowering the significance level for each test so that the overall significance level is still fixed at some level alpha.

image-1669937488081.png

All  of these adjustments are appropriate when comparing pairs of means. Tukey's is the most powerful and provides exact p-values when the group sizes are equal. Scheffe's procedure is more powerful than Bonferroni's procedure, in general.

We can also write ANOVA as a linear regression model where the parameter alpha_i are the group effects (difference of effect between group i and group 1)

Two Way ANOVA

Two way ANOVA is a regression model for one continuous outcome and two categorical variables. We analyze two-way ANOVA experiments using additive and interaction models.

Critical assumptions:

An experiment may also contain replication, for example consider the following tables:
image-1669938856486.pngimage-1669939027651.png
Both are the same data where the two factors are drug dosage and mutation, and we have 3 replications.

We can create summaries of the means:
image-1669938936434.png
image-1669938962004.png

Possible Models

Interaction model: The effect of each factor changes as the other factor levels vary
image-1669938523361.png

Additive model: The effect of each factor doesn't change as the other factor levels vary
image-1669938550315.png

We test different models using ANOVA to look at the global effect of the factors

Sum of Squares

The traditional one-way ANOVA uses a decomposition of the sum of squares for analysis.
image-1669939467503.png

Each sum of squares uses a number of degrees of freedom, given by number of different levels - 1
image-1669939431751.png

image-1669939548755.png

SS(y) = model SS = SS(a) + SS(b) + SS(g)

These sum of squares are computed sequentially, so the order of terms in the model matters! If the design is balances the order does not matter.

Steps

  1. Goodness of fit/Global null hypothesis
    H0: All parameters = 0
    Ha: At least one differs from 0
  2. Test interaction
    • F test is MSI/MSE ~ F(r-1)(c-1), rc(m-1)
    • The P-value is p(F(r-1)(c-1), rc(m-1) > F | H0)
    • If the p-value is smaller than our alpha then we reject the null
    • If we do not reject the null, there is no evidence that all interaction terms are not 0. Fit an additive model and repeat
  3. Test main effects
    • If we accept the null hypothesis of no interaction, it makes sense to test the significance of the main effects, to do this we pool the RSS with SS(y)
    • SS(a) and SS(b) are Type III sum of sqaures

Once a model is accepted one can do mean comparisions to identify the factor levels with different effects (Tukey's).

The balance design means the the design is orthogonal and so there is no need to recompute sum of squares or estimates for different models. If not, analyze the data as a traditional linear model.

Problems

When the design is orthogonal/balanced the decomposition SS(a) + SS(b) + SS(g) is unique, when it is not the decomposition depends on the order.
image-1669941015789.png

R Code

####### Two-way ANOVA
##### drug dataset
drug <- read.csv("anova.csv", header=T)
drug

hbf <- c(t(drug[,3:5]))
SNP <- c(rep("No",12), rep("Yes", 12))
Drug <- c(rep(0,3), rep(10,3), rep(20,3), rep(30,3),
               rep(0,3), rep(10,3), rep(20,3), rep(30,3))
data.drug <- data.frame(hbf,SNP, Drug)
head(data.drug)

### summaries
overall.mean <- mean(hbf); overall.mean
drug.means <- tapply(hbf, Drug, mean); drug.means
snps.means <- tapply(hbf, SNP, mean); snps.means
cell.means <- tapply(hbf, interaction(Drug,SNP), mean); cell.means

dim(cell.means) <- c(4,2); cell.means
cell.means <- data.frame(cell.means)
row.names(cell.means) <- levels(factor(Drug))
names(cell.means) <- levels(factor(SNP))
cell.means 

## Visualization
#install.packages("gplots")
library(gplots)
plotmeans(hbf~Drug, data=data.drug,xlab="Drug", 
		ylab="HbF",  main="Mean Plot\n with 95% CI")
plotmeans(hbf~SNP, data=data.drug,xlab="SNP", 
		ylab="HbF",  main="Mean Plot\n with 95% CI")
plotmeans(hbf~interaction(Drug,SNP), data=data.drug,
		xlab="SNP",  connect=list(1:4,5:8),ylab="HbF", 
          	main="Interaction Plot\nwith 95% CI")

interaction.plot(factor(Drug), factor(SNP), hbf, type="b",  
		xlab="Drug", ylab="hbf", main="Interaction Plot")


### two-way anova with balanced design
mod <- aov(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod)
table(mod$fitted.values)

# TukeyHSD(mod)
mod <- lm(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod)
anova(mod)
table(mod$fitted.values)

##### Exercise with balance design

mod.a <- aov(hbf~as.factor(Drug)*SNP, data=data.drug)
summary(mod.a)		# anova(mod.a)
mod.lma <- lm(hbf~as.factor(Drug)*SNP, data=data.drug)
anova(mod.lma)

mod.b <- aov(hbf~SNP*as.factor(Drug), data=data.drug)
summary(mod.b)		# anova(mod.b)
mod.lmb <- lm(hbf~SNP*as.factor(Drug), data=data.drug)
anova(mod.lmb)


##### Exercise with unbalance design
data.drug.1 <- read.csv("data.drug.1.csv", header=T)
mod.1a <- aov(hbf~as.factor(Drug)*SNP, data=data.drug.1)
summary(mod.1a)
mod.1lma <- lm(hbf~as.factor(Drug)*SNP, data=data.drug.1)
anova(mod.1lma)


mod.1b <- aov(hbf~SNP*as.factor(Drug), data=data.drug.1)
summary(mod.1b)
mod.1lmb <- lm(hbf~SNP*as.factor(Drug), data=data.drug.1)
anova(mod.1lmb)