Skip to main content

Module 13: Linear Regression

Correlation and regression attempt to describe the strength and direction of the association between two (or more) continuous variables.

Pearson Correlation

Recall r is an estimate of population correlation coefficient:

image-1661955678193.png

It is always between -1 and 1. With 0 indicating no positive or negative linear relationship between the variables.

A strong correlation does not imply causality.

It indicates the strength and direction of a linear relationship between two random variables. The square of r, r2 = R,  measures how much information is shared between two variables; It is also called the coefficient of determination.

r can also be expressed as the average product in standard units in terms of sample standard deviations:

image-1661955971008.png

Assumptions for Pearson's Correlation:

  • Observations are independent
  • The association is linear
  • Variables are approximately normally distributed

We can compute a test statistic for r with a t-distribution:

t = r / se(r);  Where SE of r = sqrt((1-r^2)/(n-2))

Note that se is inversely related to n, so a large sample size results in a smaller se(r). Also the test has n-2 degrees of freedom.

Simple Linear Regression

Linear regression is used to quantify the relationship between one or more independent variables (X) and a single dependent variables (Y). Simple linear regression is the case when we have 1 independent continuous variable and 1 dependent continuous variable.

image-1661958084862.png

The line of best fit is the line which minimized  the least squares (LS) estimate:

image-1661958510054.png

The sum of predicted minus observed values squared. For regressions with only one independent variable, X, this yields to the following equation:

image-1661958589368.png 

Which simplifies to:

image-1661958600629.png

Even after we find our best fit line, we cannot predict values that were outside of our sample range.

Estimated Variances of Estimates

image-1661959107275.png

image-1661959120502.png

image-1661959136704.png

The square root of estimated variance is Standard Error (SE).

In R, the lm() function can be used to determine the simple linear regression:

res <-  lm(var_y ~ var_x, data=mydata)

summary(res)