# Module 3: Random Variables and Normal Distributions

A **variable** is a measurement or characteristic on which individual observations are made. A **random** **variable** is a variable whose possible values are outcomes of a random phenomenon. A **domain** is a set of all possible values a variable can take.

Discrete random variables is a finite set or countably infinite sequence.

p<sub>x</sub>(x<sub>i</sub>) = P(X = x<sub>I</sub>) is called the **Probability Mass Function** (PMF).

- 0 &lt;= p<sub>x</sub>(x<sub>i</sub>) &lt;= 1 as it is a probability,
- Sum of PMF for all values of X = 1.

Continuous random variable can lie on a numerical scale, such as all real numbers between (0, +infinity). If we mapped the data to a histogram, we would see the curve begin to smooth as the number of data points approaches infinity.

[![image-1660746198461.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660746198461.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660746198461.png)

This density curve, f<sub>x</sub>(x), is called the **Probability Density Function** (PDF).

- f<sub>x</sub>(x) &gt;= 0 but can be greater than 1
- The integral of f<sub>x</sub>(x) over the domain of X is 1

The **Cumulative Distribution Function** (CDF) is defined as F<sub>x</sub>(X) = P(X &lt;= x)

- Non-decreasing
- The limit toward -infinity is 0, toward +infinity is 1
- For discrete random variables:

[![image-1660746730973.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660746730973.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660746730973.png)

- For continuous random variables:

[![image-1660746747252.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660746747252.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660746747252.png)

The **Expected Value** of a random variable is an average of the possible values weighted by their probabilities. Also called <span style="text-decoration: underline;">mean</span> and denoted by μ.

- For discrete random variables

[![image-1660746888477.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660746888477.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660746888477.png)

- For continuous random variables

[![image-1660746926437.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660746926437.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660746926437.png)

A generalization for the expected values is the **r<sup>th</sup> moment of a random variables**, R(X<sup>r</sup>).

- For discrete random variables:

[![image-1660748084163.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660748084163.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660748084163.png)

- For continuous random variables:

[![image-1660748094182.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660748094182.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660748094182.png)

The first moment of a random variable is the expect value (mean). The r<sup>th </sup>moment of a random variable about the mean, also called the r<sup>th</sup> central moment, is defined as: E\[(X - μ)<sup>r</sup>\]

- The first central moment = 0
- The second central moment is the <span style="text-decoration: underline;">variance</span> denoted as 𝜎<sup>2</sup>

The **variance** measures the spread around the mean of a random variable: Var(X) = E\[(X - μ)<sup>2</sup>\]

- Also equivelent to Var(X) = E(X<sup>2</sup>) - \[E(X)\]<sup>2</sup>
- The <span style="text-decoration: underline;">standard deviation</span> is the square root of the variance

The **normal distribution:**

- is a continuous distribution
- can be expressed by a formula
- also called Gaussian distribution
- is a theoretical model for a population distribution that approximates the distribution of a number of measurement variables
- is appropriate for a number of measures, but not all. Only appropriate for some continuous measurements.
- is symmetric about the mean (i.e P(X &gt; μ) = P(X &lt; μ) = .5)
- is completely characterized mean and variance

The 68/95/99 rule:

- 68.25% of the data falls within 1 SD
- 95.45% of the data falls within 2 SD
- 99.74% of the data falls within 3SD

The standard normal random variable, referred to as Z, is in the scale of SD units from the mean.

Z has a μ=0 and SD = 1 we can standardize any normal distribution with:

[![image-1660749009432.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660749009432.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660749009432.png)

By converting to Z-scores we can easily compare the probability events in two different normal distribution.

The k<sup>th</sup> percentile is defined as the score that holds the k percent of the scores below it. Ex. 90th percentile is the score that has 90% of the scores below it. We can compute percentiles with: X = μ + Z𝜎

The probability density function for normal curves:

[![image-1660750388639.png](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/scaled-1680-/image-1660750388639.png)](https://bookstack.mitchellhenschel.com/uploads/images/gallery/2022-08/image-1660750388639.png)

And in normal curves when mean is 0 and SD is 1 this can be simplified.

#### Relevant R Functions

**qnorm(**percentile, μ, 𝜎) computes percentiles for normal variables

**dnorm(**x, μ, 𝜎) will return the height of normal density function with a certain mean and SD at point x

**pnorm(**z, μ, 𝜎) will return the cumulative distribution function of a normal distribution with certain mean and SD