Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The outcomes of a probabilistic experiment can be described by a random variable \( X \).
If \( \Omega \) is the event space, a random variable will take an event \( \omega\in\Omega \) to a real value in \( \mathbb{R} \).
\[ \Omega = \{ \{H,H\}, \{H,T\}, \{T, H\}, \{T,T\}\} \]
With a finite number \( k \) of possible outcomes, e.g., the outcome of two fair coin flips, we can create a set of possible values \( x_j \) for \( X \), as \( j = \{1, \cdots , k\} \).
More generally, we will say that a random variable \( X \) is discrete if the values that \( X \) can attain \( x_j \) can be made into a collection \( \{x_j\}_{j=1}^k \) for a finite \( k \) or even for an infinite collection \( \{x_j\}_{j=1}^\infty \), so long as they can be enumerated in this way.
This differs from continuous random variables, for which it is impossible to index the entire collection of possible values as above.
Intuitively the difference between a discrete random variable and a continuous random variable has to do with the units in which they are measured.
A continuous random variable has a unit of measurement that can be arbitrarily sub-divided,
However, discrete random variables usually take the form of a counting unit.
The distribution of a discrete rv is described by its probability mass function \( f(x_j) \) and the cumulative distribution function \( F(x_j) \).
The probability mass function \( f \) of a discrete rv \( X \) is a function that returns the probability that \( X \) is exactly equals to some value, i.e., for some indexed possible outcome \( x_j \), \[ \begin{align} f(x_j) = P(X=x_j). \end{align} \]
The cumulative distribution function \( F \) (cdf) is defined for variables with the ordering, \( x_j < x_{j+1} \), and returns the probability, that \( X \) is smaller than or equal to some value: \[ F(x_j) = P( X \leq x_j). \]
Using the notion of the probability mass function, we can define the expected value of a discrete random variables as, \[ \begin{align} \mathbb{E}\left[ X\right] &= \sum_{j=1}^k x_j P(X=x_j) \end{align} \]
Note that by definition, \( \sum_{j=1}^k P(X = x_j) =1 \); this can be shown by taking the probability of all events as a union and using the property that these simple events are disjoint.
Therefore, \( \mathbb{E} \) in the above represents a weighted average of all possible values that \( X \) might attain, where higher weight is given to higher probability outcomes.
The expected value can also be understood as the mean in a special sense:
The expected value gives an important interpretation of a random variable in terms of how the mass is centered, or equivalently, the region around which we believe most observations will lie.
However, another critical notion is that of how much spread or dispersion will there be in the data around this center of mass.
Let \( X \) be a discrete rv with distinct values \( \{x_1 , \cdots , x_k \} \) and probability mass function \( P(X = x_j ) \in [0, 1] \) for \( j \in \{1, \cdots , k\} \).
Then the variance of \( X \) is defined to be, \[ \begin{align} \mathrm{var}(X) &= \mathbb{E}\left[\left\{X - \mathbb{E}\left(X\right)\right\}^2\right]\\ &=\sum_{j=1}^k \left\{x_j - \mathbb{E}\left(X\right)\right\}^2 P(X=x_j) \end{align} \]
These same ideas generalize to when \( k \rightarrow \infty \) provided that the sums remain finite (i.e., they are convergent in the same sense as Riemann sums).
In fact, the same approach of understanding this in terms of Riemann sums will be the basis for the expected value of a continuous random variable;
A Bernoulli experiment is a random experiment with two outcomes: success or failure.
Let \( p \) denote the probability of the success of each trial and let the rv \( X \) be equal to \( 1 \) if the outcome is a success and \( 0 \) if a failure.
Then the probability mass function of \( X \) is \[ \begin{align} P(x) = \begin{cases} P(X=1) = p \\ P(X=0) = 1-p \end{cases} \end{align} \]
Q: Using the definition of the expected value,
\[ \begin{align} \mathbb{E}\left[ X\right] &= \sum_{j=1}^k x_j P(X=x_j) \end{align} \]
can you find the expected value of of the Bernoulli rv \( X \)?
A: Using the definition, we have
\[ \begin{align} \mathbb{E} &= 1\times P(X=1) + 0 \times P(X=0) \\ & = p \end{align} \]
Q: Using the definition of the variance,
\[ \begin{align} \mathrm{var}\left(X\right) &=\sum_{j=1}^k \left\{x_j - \mathbb{E}\left(X\right)\right\}^2 P(X=x_j) \end{align} \]
can you find the variance of of the Bernoulli rv \( X \)?
A: Using the definition, we have
\[ \begin{align} \mathrm{var}(X) &= \left\{1 - p\right\}^2 P(X=1) + \left\{0 - p \right\}^2 P(X=0) \\ & = \left\{1 - 2p + p^2 \right\}p + p^2(1-p)\\ & = (1-p)p \end{align} \]
While we can prototypically think of a Bernoulli variable as representing a (possibly) unfair coin flip, it can be used to represent any success failure trial based on some proportion.
Consider a box containing two red marbles and eight blue marbles.
Let \( X = 1 \) if the drawn marble is red and \( 0 \) otherwise.
The probability of randomly selecting one red marble and the expectation of \( X \) at one try is
\[ \begin{align} \mathbb{E}\left[ X \right]= P(X = 1) = 1/5 = 0.2. \end{align} \]
The variance of \( X \) is
\[ \begin{align} \mathrm{var}\left(X\right) = \frac{1}{5}\left(1 − \frac{1}{5}\right) = \frac{4}{25}. \end{align} \]
Courtesy of Härdle, W.K. et al. Basic Elements of Computational Statistics. Springer International Publishing, 2017.
Courtesy of Härdle, W.K. et al. Basic Elements of Computational Statistics. Springer International Publishing, 2017.
Returning to a marbles example, suppose we randomly draw ten marbles one at a time from an urn, while putting it back each time before drawing again.
Putting the marbles back in the urn keeps the trials independent with the same probability on each trial.
Suppose that we have two red marbles in the urn and eight blue marbles in the urn.
Recall the definition of the probability mass function,
\[ \begin{align} P(X=j) = {n \choose j} p^j \left( 1 - p\right)^{n - j} & & j\in\{0, 1, \cdots, n\}. \end{align} \]
Q: what is the probability of drawing exactly two red marbles? Can you fill in the appropriate substitutions?
A: here, the number of draws \( n = 10 \) and we define getting a red marble as a success with \( p = 0.2 \) and \( j = 2 \).
Hence \( X \) is binomial distributed with a probability of \( j=2 \) successes as
\[ \begin{align} P(X=2) &= {10 \choose 2} (0.2)^2 \left(0.8\right)^{8} \approx \end{align} \]
dbinom(x=2, size=10, prob=0.2)
[1] 0.3019899
The function dbinom()
calculates probability mass function for a particular outcome X=x
given a probability of success prob
and a number of independent trials size
.
We can plot the entire probability distribution as a histogram as in the following:
par(cex = 2.0, mar = c(5, 4, 4, 2) + 0.3)
f = dbinom(x=0:10, size=10, prob=0.2)
plot(x=0:10, f, type = "s", main = "n=10 p=0.2", xlab = "x", ylab = "Prob.")
par(cex = 2.0, mar = c(5, 4, 4, 2) + 0.3)
f = dbinom(x=0:10, size=10, prob=0.5)
plot(x=0:10, f, type = "s", main = "n=10 p=0.5", xlab = "x", ylab = "Prob.")
The cumulative distibution function is given as,
\[ \begin{align} F_X(j) = P(X\leq j) = \sum_{i=1}^j {n \choose i} p^i \left(1-p\right)^{n-i} \end{align} \]
Notice that this is is written as the sum of the probability mass functions for each number of successes less than or equal to the max value \( j \).
This corresponds, again, to the notion that we can take the sum of probabilities to find the probability of a union of disjoint events – pictorially, this corresponds identically to the total area under the density curve for \( j\in[0,2] \):
par(cex = 2.0, mar = c(5, 4, 4, 2) + 0.3)
f = dbinom(x=0:10, size=10, prob=0.2)
plot(x=0:10, f, type = "s", main = "n=10 p=0.2", xlab = "x", ylab = "Prob.")
pbinom(q=2, size=10, prob=0.2)
[1] 0.6777995
sum(dbinom(x=0:2, size=10, prob=0.2))
[1] 0.6777995
In our recommended book, there is an example of how one can visualize the CDF of the binomial by creating a user-defined plotting function;
create.binomial.cdf = function(N, p, colour = "black", pch = 16) {
n = max(length(N), length(p), length(colour), length(pch))
N = rep(N, length = n)
p = rep(p, length = n)
colour = rep(colour, length = n)
pch = rep(pch, length = n)
add.one.series = function(N, p, colour, pch, maxN) {
cdf = pbinom(0:N, N, p)
# lines(0:N, cdf, type='s', col=colour)
for (i in 1:N) lines(c(i - 1, i), c(cdf[i], cdf[i]), type = "s", col = colour, lwd = 3)
lines(c(N, maxN), c(1, 1), type = "b", col = colour, lwd = 3)
points(0:N, cdf, col = colour, pch = pch)
}
# par(lwd=1.5, cex=1, mar=c(3.1,2.5,0.5,0.5))
plot(1, xlim = c(0, max(N)), ylim = c(0, 1), type = "n", xlab = "x", ylab = "Probability")
for (i in 1:n) add.one.series(N[i], p[i], colour[i], pch[i], max(N))
}
par(cex = 2, mai = c(b = 1.2, l = 1.2, t = 0.7, r = 0.5))
create.binomial.cdf(c(10, 10), c(0.2, 0.6), c("red", "black"))
Notice in the arguments above, we have supplied two CDF functions to be plotted for \( 10 \) trials each, but for probabilities of success as \( p=0.2 \) in red and \( p=0.6 \) in black.
We can clearly see the faster rate at which the CFD with probability of success \( p=0.2 \) approaches \( 1 \)…
This can be understood again as this corresponds to the area under the density curve for \( n=10 \) for either \( p=0.2 \) or \( p=0.6 \):
Recall that for a Bernoulli rv with probability \( p \) of success, the expected value is \( \mathbb{E}\left[X\right] = p \).
Q: using the fact that a Binomial rv \( Y \) in \( n \) trials can be written as a sum of \( n \) independent Bernoulli variables \( \{X_j\}_{j=1}^n \) with the same probability of succes \( p \), what is the expected value of \( Y \)?
Q: can you use a similar property to find the variance of the Binomial rv \( Y \)?
Given two binomial rvs \( X \sim B(n, p) \) and \( Y \sim B(m, p) \), the sum of the two rvs also follows a binomial distribution \( X + Y \sim B(n + m, p) \) with expectation \( (m + n) p \).
Intuitively, \( n \) independent Bernoulli experiments and another \( m \) independent Bernoulli experiments again follow a binomial distribution: that of \( (n + m) \) independent Bernoulli experiments.
The Poisson distribution is closely related to the binomial distribution.
The example of tossing a die has already been introduced for the binomial distribution,
This is an example of a distribution for the arrival of an event, which itself has a low probability, over a large number of trials.
A good practical example is how we model the arrival times of customers in lines.
We will “derive” the Poisson distribution by taking an approximate limit of the binomial distribution.
Suppose that \( X \) is a \( B(n, p) \) variable, then the probability mass function is \[ \begin{align} P(X=j) = {n \choose j} p^j \left(1 - p\right)^{n-j} \end{align} \]
We will define the product \( \lambda = n p \), so that we have the following substitutions:
Using the previous two substitutions we have \[ \begin{align} P(X=j) &= \frac{n!}{j!\left(n-j\right)!} \left(\frac{\lambda}{n}\right)^j \left(1 - \frac{\lambda}{n}\right)^{n-j} \\ \\ &= \frac{n!}{j!\left(n-j\right)!} \frac{\lambda^j}{n^j} \frac{\left(1 - \frac{\lambda}{n}\right)^{n}}{\left(1 - \frac{\lambda}{n}\right)^{j}} \\ \\ &= \left(\frac{n!}{n^j\left(n-j\right)!}\right)\left(\frac{\lambda^j}{j!}\right)\left(\frac{\left(1 - \frac{\lambda}{n}\right)^{n}}{\left(1 - \frac{\lambda}{n}\right)^{j}}\right) \end{align} \]
The reason to write the probability mass function in this way is to take advantage now of some special properties when \( n \) is large and \( p \) is small.
Particularly, \[ \begin{align} \left(\frac{n!}{n^j\left(n-j\right)!}\right) \approx 1, & & \left(1 - \frac{\lambda}{n}\right)^{n} \approx \exp\{-\lambda\}, & & \text{ and } & & \left(1 - \frac{\lambda}{n}\right)^{j}\approx 1 \end{align} \]
Putting together the relationship in the last slide, we have for \( n \) large and \( p \) small,
\[ P(X=j) \approx \frac{\lambda^j}{j!} \exp\{-\lambda\} \]
In the above, the right-hand-side is the probability mass function for a Poisson variable giving the probability of an event occurring \( j \) times in a fixed interval, with the mean number of times this will occur in an interval equal to \( \lambda \).
As the number of trials \( n \) goes to infinity or the probability of success \( p \) goes to zero, then the equality holds in a limiting manner.
As a rule of thumb, if \( p \leq 0.1 \), \( n \geq 50 \) and \( np \leq 5 \), the above approximation will work well.
Let's recall our probability mass function,
\[ \begin{align} P(X=j) = \frac{\lambda^j}{j!} \exp\{-\lambda\} \end{align} \]
As an example, suppose a typist makes on average one typographical error per page under normal working conditions.
Q: Can we model the number of typographical errors per page with a Poisson distribution? What needs to be satisfied?
Q: what is the probability of exactly two errors in a one-page text?
dpois(x=2, lambda=1)
[1] 0.1839397
Using the same kind of plotting code from the previous examples, and using the dpois
probability mass function, we can visualize the distribution as
If instead, the mean number of typos per page is instead \( \lambda=10 \), we can see how the distribution will change its shape,
Poisson random variables have some very nice properties that we will will just mention in the following.
Let \( X_j \sim Pois(\lambda_j) \) be independent and Poisson distributed rvs with parameters \( \lambda_1, \lambda_2, \cdots, \lambda_n \).
Then their sum is an rv also following the Poisson distribution with \( \lambda = \sum_{j=1}^n\lambda_j \), i.e.,
\[ \left(\sum_{j=1}^n X_j \right) \sim Pois(\lambda) \]
Moreover, the Poisson distribution belongs to the exponential family of distributions.
Generally, a rv \( X \) follows a distribution belonging to the exponential family if its probability mass function with a single parameter \( \theta \) has the form \[ \begin{align} P(X=x) = h(x) g(\theta)\exp\{h(\theta)t(x)\} \end{align} \]
For the Poisson distribution \( h(x) = \frac{1}{x!} \), \( g(\theta) = \lambda^x \), \( \eta(\theta ) = −\lambda \) and \( t(x) = 1 \).
Other popular distributions, such as the normal / Gaussian, exponential, \( \Gamma \), \( \chi^2 \) and Bernoulli, belong to the exponential family and will be discussed further on in the course.