Midterm Cheat Sheet

Linear Regression

Predicting a CI new obs adds a 1 to se(y):

𝛽₀ + 𝛽₂x +/- t*

Multiple Linear Regression and Estimation

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑝 = 0
v.s. 𝐻1 : not all 𝛽𝑘 = 0, 𝑘 = 1, … , 𝑝

rejection rule of 𝑡 >= t(1 − alpha/2; 𝑛 − 𝑝 − 1)

Model Fitting: Inference

df_Ω = n - p, and df_𝜔 = n – q

Reject the null hypothesis if F > Fα p - q, n – p

Dummy Variables and Analysis of Covariance
Consider a Xi2 for which is 0 for – and 1 for +:

An interaction between Xi1 and Xi2:

A model with multiple categorical variables:

Regression Diagnostics
Assumptions:
• Error: ~ N(0, SD2I);
◦ Independent
◦ Equal Variance
◦ Normally Distributed
• Model: E[y] = Xβ is correct
• Unusual observations

Leverage Points: data point with unusual x-value

The Hat Matrix – n*n matrix
h_iiis the leverage of the i^th case
leverage > 2p’/n should be looked at closely

Outliers: Unusual observation on x or y axis

Calculate the t-test and compare abs with limit:
abs(qt(.05/(n*2), df = n - pprime - 1, lower.tail = T))

Influential Points: causes changes to regression
Difference in Fits:

with a threshold of

Where p’ is the number of parameters

Cook's Distance:

with a threshold of
Di > 4/n should be looked at
Di > .5 possible influence
Di >= 1 very influential

Error: a plot of e_hat should
• have constant variance
• have no clear pattern
• H0: residuals are normal

Shapiro-Wilk normality test

H0: Residuals are normally distributed

Bonferroni Correction: Divide alpha by n

R Code Snippets

# Model with only beta_0

sr_lm0 <- lm(y ~ 1, data=sr)

# Full model

sr_lm1 <- lm(y ~ ., data=sr)

sr_syy <- sum((savings$sr - mean(savings$sr))^2)

sr_rss <- deviance(sr_lm1)

# F = ((SYY -RSS)/((n-1) - (n-2))) / (RSS / (n - 1))

sr_num <- (sr_syy - sr_rss)/(df.residual(sr_lm0) - df.residual(sr_lm1))

sr_den <- sr_rss / df.residual(sr_lm1)

sr_f <- sr_num / sr_den

# dfΩ = n - p, and df𝜔 = n - q

pf(sr_f, df.residual(sr_lm0) - df.residual(sr_lm1), df.residual(sr_lm1), lower.tail = F)

# β=(X^I X)⁻¹ X^IY

beta <- solve(t(x)%*%x)%*%(t(x)%*%y)

# Pearson's

cor(lin_reg$fitted.values, lin_reg$residuals, method="pearson")

# Stratify variables by a factor

by(depress, depress$publicassist, summary)

# Welsh's Two Sample T-test

# For difference in means

t.test(assist$cesd, noassist$cesd) # or t.test(data.y ~ factor)

# CI of LS means based on covariates

library(lsmeans)

lsmeans(reg, ~Type)

# Apply a mean function to an array split on a factor

tapply(assist$cesd, assist$assist, mean)

# When a regression factor has

# more than two categories

reg <- lm(Pulse1 ~ Height + Sex + Smokes + as.factor(Exercise))

# Cook's Distance

cook <- cooks.distance(reg)

cook[cook > 4/n]

# Shapiro Test for normallity

shapiro.test(reg$residuals)

# Studentized residuals

stud <- rstudent(reg)

# Threshold for lower tail of

# studentized resids with correction

lim = abs(qt(.05/(n*2), df = n - pprime - 1, lower.tail = T))

stud[which(abs(stud) > lim)]

# Hat values

hat <- hatvalues(reg)

lev <- 2 * pprime / n

hat[hat > lev]