Module 11: ANOVA - Analysis of Variance

ANOVA can be used to compare the means of several populations with continuous populations simultaneously. The population variance of the dependent variable must be equal in all groups.

Recall that

Which is the difference in two means over the standard error. When comparing multiple independent samples it is easier to use a pooled variance, but to do so the variances must be equal.

Equality of Variances

The equation for pool variance:

Assumption for pooled variance is that variances in the two groups are equal. We can test this with H0 = σ₁=σ₂ and use the F distribution which is indexed by the denominator df and the numerator df; choose the larger estimated variance to be numerator and the smaller estimated variance to be the denominator.

Test statistic:

If F is greater or smaller than critical values for a given significance level the null hypothesis is rejected and we can conclude there is evidence the two population variances are not equal.

The F distribution is not symmetric, which makes it hard to look up critical values. It can be done in R: pf(F, df1, df2, lower=F)

The F test is not always appropriate as it is sensitive to departures from normality. Examine variability in the two groups by comparing sample variances using boxplots to help decide which standard error is appropriate. In the case where variances are unequal we use the same procedure but SE is estimated as:

Using the n-1 degrees of freedom from whichever sample is smaller as an approximate (SAS or R would figure out the exact value).

ANOVA

Terminology:

Factor - category/grouping variable
Level - individual group of the factor
Balanced design - same number of individuals in each level

The general data configuration is we have k population groups each with n_kobservations, which can be the same or different.

Assumptions:

Observations are independent
Data are random samples from k independent populations
Within each population the dependent variable is normally distributed
The population variance of the dependent variable is equal in all groups.

H0: The k populations means are equal

Ha : The k populations means are not all equal or at least one is not equal

Recall that variance as a function of Y is:

The numerator is the "Total variability" or the "Total sum of squares" (SST)

In ANOVA we split the SST into two components:

Variability due to differences between the groups (SS Between Groups)
Variability due to differences between individual y values within the groups (SS Within Group)

Which can also be expressed as:

SS Total = SS Within Groups + SS Between Groups

R² is the proportion of variability explained by the difference between groups:

R² = SS Between / SS Total