# Module 5: Multivariate Normal Distribution

A variable X follows a discrete probability distribution if the possible values of X are either:

- A finite set
- A countable infinite sequence

p<sub>x</sub>(x<sub>i</sub>) = P(X=x<sub>i</sub>) is called the probability mass function (PMF)

- p<sub>x</sub>(x<sub>i</sub>) &gt;= 0 as it is a probability
- The sum of PMF for all values of X = 1

Recall that in a Discrete Probability Distribution :

[![image-1660918683481.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660918683481.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660918683481.png)

In a Continuous Probability Distribution:

[![image-1660918718518.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660918718518.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660918718518.png)

Because in a discrete set we are not concerned with the values in between our domain values.

### Moment Generating Function

Moments are expected values of X, such as E(X), E(X<sup>2</sup>) = E(V), E(X<sup>3</sup>), etc. This, can also be calculated using the Moment Generating Function (MGF):

[![image-1660918098650.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660918098650.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660918098650.png)

The rth moment of X, E(X<sup>r</sup>) can be obtained by differentiating M<sub>x</sub>(t) r times with respect to t and setting t=0

- M<sub>x</sub>(0) = 1
- M<sup>I</sup><sub>x</sub>(0) = E(X)
- M<sup>II</sup><sub>x</sub>(0) = E(X<sup>2</sup>) -&gt; V(X) = M<sup>II</sup><sub>x</sub>(0) - (MI<sub>x</sub>(0))2
- In general, M<sub>x</sub><sup>(r)</sup>(0) = E(X<sup>r</sup>)

In short, the nth moment is the nth derivative of MGF.

Uniqueness: if X and Y are two random variables and M<sub>x</sub>(t) = M<sub>y</sub>(t) when |t| &lt; h for some positive number h, then X and Y have the same distribution

Note: MGF does not exist for all distributions (E(e<sup>tx</sup>) may be infinity)

## Important Distributions

#### Normal Distribution

[![image-1661008157874.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008157874.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008157874.png)

X ~ N(μ, σ<sup>2</sup>) -infinity &lt; μ &lt; infinity , σ &lt; 0

- PDF:

[![image-1661008301108.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008301108.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008301108.png)

- E(X) = μ
- V(X) = σ<sup>2</sup>
- MGF:

 [![image-1661008355636.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008355636.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008355636.png)

#### Binomial Distribution

[![image-1661007837488.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661007837488.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661007837488.png)

X ~ Binomial(n, p) 𝑝 ∈ \[0, 1\]

X = the number of successes in n trials when the probability of success in each trail is p.

We can think of X as the sum of n independent Bernoulli(p) random variables, with the same p for every X<sub>i</sub>

[![image-1660920861915.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660920861915.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660920861915.png)

- PMF:

 [![image-1660918544947.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660918544947.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660918544947.png)

- Expected value = E(X) = np
- Variance = V(X) = np(1-p)
- MGF = M<sub>x</sub>(t) = (pe<sup>t</sup> + (1-p))<sup>n</sup>
- Two discrete random variables are independent if: P(X = x &amp; Y = y) = P(X = x)\*P(Y=y)

Ex. A study which analyzed the prevalence of a disease in a population.

#### Poisson Distribution

[![image-1661008056947.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008056947.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008056947.png)

X ~ Poisson(λ) λ &gt; 0

X = The number of occurrences of an event of interest.

- PMF:

 [![image-1660920679945.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660920679945.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660920679945.png)

- Expected Values = E(X) = λ
- Variance = V(X) = λ
- MGF = M<sub>x</sub>(t) = e<sup>λ(e</sup><sup>^t - 1)</sup>

Poisson as an approximation of the Binomial Distribution

- If X ~ Binomial(n, p) and n -&gt; infinity, p-&gt; 0 such that np is a constant =&gt; X ~ Poisson(np)
- This assumes each event is independent
- Often used analyzing rare diseases

Ex. Analyzing lung cancer in 1000 smokers and non-smokers. This is binomial but can be estimated as a Poisson distribution.

#### Geometric Distribution

[![image-1661008075865.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008075865.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008075865.png)

X ~ Geometric(p) 𝑝 ∈ (0, 1\]

If Y<sub>1</sub>, Y<sub>2</sub>, Y<sub>3</sub> ... are a sequence of independent Bernoulli(p) random variables then the number of failures before the first success, X, follows a Geometric distribution.

- PMF = P(X = x) = p(1-p)<sup>x</sup>
- Expected value = E(X) = (1-p)/p
- Variance = V(X) = (1-p)/p<sup>2</sup>
- MGF = M<sub>x</sub>(t) = p / (1 - (1 - p)e<sup>t</sup>)

Ex. We want to know the number of times to flip a coin before it lands on heads.

#### Hyper-Geometric Distribution

[![image-1661008089863.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008089863.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008089863.png)

X ~ Hypergeometric(N, K, n)

Suppose a finite population of size N contains two mutually exclusive events: K success events and N-K failure events. If n events are randomly chosen *without* replacement X is the number of success events chosen.

- PMF:

 [![image-1660921389776.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660921389776.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660921389776.png)

- Expected value = E(X) = nk / N
- Variance = V(X) = ((nK) / N) \* ((N-K) / N) \* ((N - n) / (N - 1))

Ex. A bag has 7 red beads and 13 white beads. If 5 are drawn *without* replacement what is the probability exactly 4 are red?

#### Uniform Distribution

[![image-1661008106329.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008106329.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008106329.png)

All outcomes are equally likely, they can be discrete or continuous.

X ~ Uniform(a, b) a &lt; b

- PDF:

 [![image-1660922154972.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660922154972.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660922154972.png)

- E(X) = (a + b)/2
- V(X) = (b - a)<sup>2</sup> / 12
- CDF = F(X) = (x - a) / (b -a), a&lt;=x&lt;=b
- MGF:

 [![image-1660922288816.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660922288816.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660922288816.png)

We use this distribution we use when we have no idea how the data is distributed.

#### Log-Normal Distribution

[![image-1661008395275.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008395275.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008395275.png)

X ~ Lognormal(μ , σ<sup>2</sup>) -infinity &lt; μ &lt; infinity, σ &gt; 0

- PDF:

[![image-1661008422821.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008422821.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008422821.png)

- E(X) = exp(μ + σ<sup>2</sup>/2)
- Median = e<sup>μ</sup>
- V(X) = μ<sup>2</sup> \* (e<sup>σ^2</sup>-1)
- log(X) ~ N(μ, σ<sup>2</sup>) - the log is normal
- These distributions are often skewed to the right

Ex. Amount of rainfall, production of milk by cows, or stock market fluctuation often follow logarithmic functions.

#### Gamma Distribution

[![image-1661008814686.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008814686.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008814686.png)

X ~ Gamma(α, λ) α &gt; 0 , λ &gt; 0

Used to predict the wait time until the first of event of something.

- PDF

[![image-1661008839904.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008839904.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008839904.png)

Alternate paramterization with α &gt; 0, 𝜃 = 1 / λ &gt; 0 is used by R:

[![image-1661009165290.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661009165290.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661009165290.png)

- E(X) = α / λ
- V(X) = α / λ<sup>2</sup>
- MGF:

 [![image-1661008903685.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661008903685.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661008903685.png)

Ex. Used to model time to failure or time to death.

#### Exponential Distribution

[![image-1661009476997.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661009476997.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661009476997.png)

A special subset of the Gamma Distribution (α = 1)

X ~ Exponential(λ) λ &gt; 0

- PDF = f<sub>x</sub>(x) = λ e<sup>-λ x</sup> for x &gt; 0
- E(X) = 1 / λ
- V(X) = 1 / λ <sup>2</sup>
- CDF = F<sub>x</sub>(x) = 1 - e<sup>-λ x</sup>
- MGF = M<sub>x</sub>(t) = λ / (λ - t), t &lt; λ

Ex. The time between geyser eruptions.

#### Chi-Square Distribution

[![image-1661009672800.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661009672800.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661009672800.png)

Special case of the Gamma Distribution (α = k/2, λ = 1/2)

X ~ *X*<sup>2</sup>(k) k is a positive integer (degrees of freedom, "df")

- PDF:

[![image-1661009613379.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661009613379.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661009613379.png)

- E(X) = k
- V(X) = 2k
- MGF = (1 - 2t)<sup>-k / 2</sup>, t &lt; 1/2

If you took a sample of Z scores and squared them you would have a chi-squared distribution with k = 1. Meaning, if Z<sub>1</sub>, Z<sub>2</sub>... Z<sub>m</sub> are independent standard normal random variables then:

[![image-1661009823858.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661009823858.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661009823858.png)

Very few real world distributions follow a chi-sqaure distribution, it is mainly used in hypothisis testing.

#### Bivariate Normal Distribution  


A bivariate normal distribution is made up of two independent random variables. The two variables are both normally distributed, and have a normal distribution when added together.

[![image-1660924679433.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660924679433.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660924679433.png)

σ<sub>12</sub> = Cov(X<sub>1</sub>, X<sub>2</sub>)

PDF:

[![image-1661012030416.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661012030416.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661012030416.png)

#### Function of a Discrete Random Variable

Suppose X is a discrete random variable and Y is a function of X. Y = g(X)

The Y is also a random variable: P(Y = y) = P(g(X) = y)

[![image-1660923222954.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660923222954.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660923222954.png)

#### Function of a Continuous Random Variable

Using the same equation as above but assuming the variables are coninuous random variables:

The PDF = [![image-1660923084277.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660923084277.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660923084277.png)

The CDF = [![image-1660923112070.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660923112070.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660923112070.png)

If g is one-to-one (strictly increasing or decreasing) then g has an inverse g<sup>-1</sup>, in the above case:

[![image-1660923301389.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660923301389.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660923301389.png)

### Properties of Expectation and Variance  


[![image-1661010254301.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661010254301.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661010254301.png)

### Discrete Multivariate Distributions

[![image-1661010303330.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661010303330.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661010303330.png)

### Continuous Multivariate Distributions

[![image-1661010356599.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1661010356599.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1661010356599.png)

### Covariance and Correlation

**Correlation** is defined as an indication as to how strong the relationship between the two variables is:

[![image-1660924208472.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660924208472.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660924208472.png)

A positive correlation has σ &gt; 0 and negative correlation has σ &lt; 0

**Covariance**  provides information about how the variables vary together:

 cov(X, Y) = R\[(X - E(X))(Y - E(Y))\]

This is also equivalent to:

 cov(X, Y) = E(XY) - E(X)\*E(Y)

Thus if X and Y are independent:

 cov(X, Y) = corr(X, Y) = 0

However cov(X, Y) = 0 does not imply indepence unless they are jointly normally distributed.

**Conditional Expectation**  of X given Y = y, denoted E(X | Y = y):

[![image-1660924494462.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660924494462.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660924494462.png)

**Conditional variance**  can be defined similarly (use the conditional PMF or PDF)