Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Courtesy of Doxiadis, Apostolos, and Christos Papadimitriou. Logicomix: An epic search for truth. Bloomsbury Publishing USA, 2015.
Random trials and event spaces
A random trial (or experiment) is a procedure that yields one of the distinct outcomes that altogether form the sample or event space, denoted \( \Omega \). All possible outcomes constitute the universal event \( \Omega \), and subsets of \( \Omega \) define all events in the sample space.
A combination of multiple coin tosses leads to more possible results – with two coin flips, we define \[ \Omega = \{\{H, H \}, \{H, T \}, \{T, H \}, \{T, T \}\}. \]
The simple probability model
The probability of attaining some outcome in a collection of outcomes \( A \subset \Omega \) can be written as,
\[ \mathcal{P}(A) = \mathcal{P}(\omega \in A) = \frac{\text{All possible ways that the experiment's outcome }\omega\in A}{\text{All possible ways for the experiment to have an outcome }\omega\in \Omega} \]
Courtesy of Bin im Garten CC via Wikimedia Commons
Independence
\( A \) and \( B \) are independent by definition if and only if both of the following hold, \[ \begin{matrix} \mathcal{P}(A\vert B) = \mathcal{P}(A) & \text{and} & \mathcal{P}(B\vert A) = \mathcal{P}(B). \end{matrix} \]
Bayes' law
Let \( A,B\subset \Omega \) such that \( \mathcal{P}(A), \mathcal{P}(B) > 0 \). Then the probability of \( A\vert B \) can be “inverted” via Bayes' law as \[ \mathcal{P}(A \vert B) = \frac{\mathcal{P}(B \vert A) \mathcal{P}(A)}{\mathcal{P}(B)}. \]
Complementary probability
Let \( A, A^c\subset \Omega \) be defined as above, then their probabilities are complementary \[ \mathcal{P}(A) = 1 - \mathcal{P}\left(A^\mathrm{c}\right). \]
Random variables
A random variable (RV) \( X \) is a map that assigns a real-world event \( \omega\in\Omega \) to a real value in \( \mathbb{R} \).
Prototypically, \( X \) may be considered an observable of the trial, though \( X \) in general may not be observable.
Such a model is needed, where there may not be any intrinsic numerical meaning to the events in the sample space \( \Omega \).
For example, \( X \) can be the number of heads in the outcome of two coin flips
\[ \Omega = \{ \{H,H\}, \{H,T\}, \{T, H\}, \{T,T\}\} \]
With a finite number \( k \) of possible outcomes we can create a set of possible values \( x_i \) for \( X \), as \( i = \{1, \cdots , k\} \).
More generally, we might consider when a countably-infinite index set can enumerate all possible outcomes.
Discrete random variables
A RV \( X \) is discrete if the values that \( X \) can attain, \( x_i \), can be made into a collection \( \{x_i\}_{i=1}^k \) for a finite \( k \) or for an infinite collection \( \{x_i\}_{i=1}^\infty \).
A probability distribution is a complete description of the outcomes and associated probabilities of a random trial.
The distribution of a discrete RV is typically described by its probability mass function \( p(x_i) \) and the cumulative distribution function \( P(x) \).
Probability mass function
The probability mass function \( p \) of a discrete RV \( X \) is a function that returns the probability that \( X \) is exactly equal to some value. For some indexed possible outcome \( x_j \), \[ \begin{align} p(x_i) = \mathcal{P}(X=x_i). \end{align} \]
Cumulative distribution function
Let \( X \) be a discrete RV, then the mapping \[ P:\mathbb{R} \rightarrow [0,1] \] defined by \[ \begin{align} P(x) :&= \mathcal{P}(X \leq x) \\ &= \sum_{i \in \mathcal{I}, x_i < x} p(x_i) \end{align} \] is called the cumulative distribution function (cdf) of the RV \( X \).
Expected value of a discrete RV Let \( X \) be a discrete RV with a range of possible values \( \{x_i\}_{i \in \mathcal{I}} \) where \( \mathcal{I}\subset\mathbb{Z} \). The expected value is defined \[ \begin{align} \overline{x} := \mathbb{E}\left[ X\right] &:= \sum_{i\in \mathcal{I}} x_i \mathcal{P}(X=x_i)\\ &= \sum_{i\in \mathcal{I}} x_i p(x_i) \end{align} \]
Using the property that events associated to distinct observeable values (\( X=x_i \)) are disjoint,
\[ \sum_{i\in \mathcal{I}} \mathcal{P}(X = x_i) =1 \] as seen by taking the probability of all events as a union.
\( \mathbb{E} \) in the above represents a weighted average of all possible values that \( X \) might attain.
Suppose we were able to take infinite replications of a random trial and record the observed value of \( X \) over all infinite trials;
The expected value describes how the probability mass for the observable of the trial is centered.
However, a critical notion is “how much spread or dispersion will there be in the data around this center of mass?”
Variance of a discrete RV
Let \( X \) be a discrete RV with distinct values \( \{x_i \}_{i\in \mathcal{I}} \) and probability mass function \( p(x_i)=\mathcal{P}(X = x_i ) \in [0, 1] \) for \( i \in \mathcal{I} \). Then the variance of \( X \) is defined to be, \[ \begin{align} \sigma^2 := \mathrm{var}(X) &:= \mathbb{E}\left[\left\{X - \mathbb{E}\left(X\right)\right\}^2\right]\\ &=\sum_{i\in \mathcal{I}} \left\{x_i - \overline{x}\right\}^2 p(x_i) \end{align} \]
The two forms above can be shown equivalent by defining
\[ Y := X - \overline{x}, \] and following the definition of \( \mathbb{E}\left[Y\right] \).
Notice that the variance describes a weighted-average, squared-deviation of each possible observable from the center of mass.
This means that the variance gives a non-negative measure of the average dispersion of the possible-to-observe outcomes around the center of probability mass.
Unlike discrete RVs, continuous RVs can take on an uncountably infinite number of possible values.
A good example to think of is if \( X \) is the daily high temperature in degrees Celsius.
This type of measurement differs fundamentally from counting, e.g., heads of coin flips.
Cumulative distribution function
Let \( X \) be a continuous RV, then the mapping \[ P:\mathbb{R} \rightarrow [0,1] \] defined by \( P(x) = \mathcal{P}(X \leq x) \), is called the cumulative distribution function (cdf) of the RV \( X \).
Probability density function
A mapping \( p: \mathbb{R} \rightarrow \mathbb{R}^+ \) is called the probability density function (PDF) of an RV \( X \) if:
- \( p(x) = \frac{\mathrm{d} P}{\mathrm{d}x} \) exists for all \( x\in \mathbb{R} \); and
- the density is integrable, i.e., \[ \int_{-\infty}^\infty p (x) \mathrm{d}x = 1. \]
Question: How can you use the definition of the PDF and the fundamental theorem of calculus \[ \begin{align} p(x) = \frac{\mathrm{d} P}{\mathrm{d}x} & & \text{ and }& & \int_{a}^b \frac{\mathrm{d}f}{\mathrm{d}x} \mathrm{d}x = f(b) - f(a) \end{align} \] to find another form for the CDF?
\[ \begin{align} \int_{s}^t p(x) \mathrm{d}x &= \int_{s}^t \frac{\mathrm{d} P}{\mathrm{d}x} \mathrm{d}x\\ &= P(t) - P(s) \\ & = \mathcal{P}(X \leq t) - \mathcal{P}(X \leq s) = \mathcal{P}(s < X \leq t) \end{align} \]
\[ \begin{align} \lim_{s\rightarrow - \infty} \int_{s}^t p(x) \mathrm{d} x & = \lim_{s \rightarrow -\infty} \mathcal{P}(s < X \leq t) \\ & = \mathcal{P}(X\leq t) = P(t) \end{align} \]
This is analgous to the property for discrete RVs, where we write,
\[ \begin{align} P(t) = \mathcal{P}(X \leq t) = \sum_{j \in \mathcal{I},\, x_j \leq t} \mathcal{P}(X = x_j) = \sum_{j \in \mathcal{I},\, x_j \leq t} p(x_j) \end{align} \]
Probability ranges for continuous RVs
For a continuous RV \( X \), for any \( x_1< x_2 \) \[ \mathcal{P}(x_1 \leq X \leq x_2 ) = \mathcal{P}(x_1 < X \leq x_2) = \mathcal{P}(x_1 \leq X < x_2) = \mathcal{P}(x_1 < X < x_2). \]
Expected value of a continuous RV
Let \( X \) be a continuous RV with a density function \( p(x) \), then the expected value of \( X \) is defined as \[ \overline{x} := \mathbb{E}\left[X\right] := \int_{-\infty}^{+\infty} x p(x)\mathrm{d}x. \]
Variance of a continuous RV
If the expectation of \( X \) exists, the variance is defined as \[ \begin{align} \sigma^2:= \mathrm{var} \left(X\right)& := \mathbb{E}\left[\left(X − \mathbb{E}\left[X\right]\right)^2\right] \\ &=\int_{-\infty}^\infty \left(x - \overline{x} \right)^2 p(x)\mathrm{d}x \end{align} \]
Once again, this is a measure of dispersion by averaging the deviation of each case from the mean in the square sense, weighted by the probability density.
This quantity is also known as the second centered moment, as part of a general family of moments for the distribution.
General moments formula
Let \( X \) be a random variable with PDF \( p \), then the \( n \)-th moment is defined as \[ \begin{align} \mathbb{E}\left[X^n\right]=\int_{-\infty}^\infty x^n p(x) \mathrm{d}x \end{align} \]
In particular, we call the second moment \( \mathbb{E}\left[X^2\right] \) the mean-square.
In this way, like the first moment as the center of mass, the second centered moment (the variance) is analogous to the rotational inertia at the center of mass.
It can be shown that the formula for variance can be reduced as follows:
\[ \begin{align} \mathrm{var}(X) = \mathbb{E}\left[X^2\right] - \mathbb{E}\left[X\right]^2, \end{align} \] i.e., as the difference between the mean-square and the mean squared.
Although this is a useful formula for simple calculations, this formula can produce very poor performance in terms of numerical stability in floating point arithmetic.
More generally with multivariate distributions, we will consider a numerically stable approach to compute the second centered moment.
Standard deviation of a RV
Let \( X \) be a RV (continuous or discrete) with variance \( \sigma^2 \). We define the standard deviation of \( X \) as \[ \mathrm{std}(X):=\sqrt{\mathrm{var}\left(X\right)} = \sigma. \]
Chebyshev’s Theorem
Let \( X \) (integrable) be a RV with finite expected value \( \overline{x} \) and finite non-zero variance \( \sigma^2 \). Then for any real number \( k > 0 \), \[ \begin{align} \mathcal{P}(\vert X -\overline{x}\vert > k \sigma ) \leq \frac{1}{k^2} \end{align} \]
While together the mean \( \overline{x} \) and the standard deviation \( \sigma \) give a picture of the center and dispersion of a probability distribution, we can analyze this in a different way.
For example, while the mean is the notion of the “center of mass”, we may also be interested in where the upper and lower \( 50\% \) of values are separated as a different notion of “center”.
More generally, for any univariate cumulative distribution function \( P \), and for \( 0 < q < 1 \), we can identify \( q \) as a percent of the data that lies under the graph of a density curve.
Percentile of \( P \)
Let \( X \) be a random variable with CFD \( P \). The quantity \[ \begin{align} P^{-1}(q)=\inf \left\{x \vert P(x) \geq q \right\} \end{align} \] is called the theoretical \( q \)-th quantile or percentile of \( P \).
The “\( \inf \)” in the above refers to the smallest possible quantity (infimum) in the set on the right-hand-side.
We will usually refer to the \( q \)-th quantile as \( x_q \).
\( P^{-1} \) is called the quantile function.
The mode of \( p \)
Let \( X \) be a RV with PDF \( p \). We say \( x^\ast \) is a (local) mode of \( p \) provided that there is a neighborhood \( x^\ast \in \mathcal{N} \) such that for all \( x\in \mathcal{N} \) not equal to \( x^\ast \), \[ \begin{align} p\left(x^\ast \right) > p(x). \end{align} \]
Courtesy of Diva Jain CC via Wikimedia Commons.