GLM for Count Data
Generalized linear models for count data are regression techniques available for modeling outcomes describing a type of discrete data where the occurrence might be relatively rare. A common distribution for such a random variable is Poisson.
The probability that a variable Y with a Poisson distribution is equal to a count value
$${ P(Y = y) = {{\lambda^y e^{-\lambda}} \over {y!}}} y = 0, 1, 2, ..., \infty $$
where λ is the average count called the rate
The Poisson distribution has a mean which is equal to the variance:
E(Y) = Var(Y) = \(\lambda\)
The mean number of events is the rate of incidence multiplied by time passed
Because of this property, the Poisson distribution also has the following properties:
- If Y_1 ~ Poisson(λ_1) and Y_2 ~ Poisson(λ_2) where Y_1 and Y_2 are the number of subjects in groups 1 and 2, then Y_1 + Y_2 ~ Poisson(λ_1 + λ_2)
- This generalizes to the situation of n groups as: Assuming n independent counts of Y_i are measured, if each Y_i has the same expected number of events λ then Y_1 + Y_2 + ... + Y_n ~ Poisson(nλ). In this situation λ is interpretable as the average number of events per group.
Poisson Regression as a GLM
To specify a generalized linear model we need (1) the distribution of outcome, (2) the link function, and (3) the variance function. For Poisson regression these are (1) Poisson, (2) Log, and (3) identity.
Let's consider a binomial example from a previous class where:
$$ Y_{ij} ~ Binomial(N_{ij}, λ_{ij}) $$
Given a variable Y ~ Binomial(n, p), where n is number of experiements and p the probability of success, this binomial distribution can be approximated well by the Poisson distribution with mean n*p.
So from that we can assume the distribtuion of Y also follows:
$$ Y_{ij} \tilde Poisson(N_{ij} λ_{ij}) $$
and thus
$$ log(E(Y_{ij})) = log(N_{ij}) + log(\lambda_{ij}) $$