Bayes' Theorem and Probability Distributions

02/10/2021

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Bayes' theorem
- Random variables
- Probability distributions
- Probability Mass Functions

Bayes’ Theorem

Let us suppose that \( A \) and \( B \) are events for which \( P(A)\neq 0 \) and \( P(B)\neq 0 \).
Consider the statement of the multiplication rule, \[ P(A \cap B) = P(A\vert B) P(B); \]
yet it is also true that, \[ P(B \cap A) = P(B \vert A) P(A); \]
and \( P( A \cap B) = P(B \cap A) \) by definition.
Putting these statements together, we obtain, \[ \begin{align} &P(A\vert B) P(B) = P(B \vert A ) P(A)\\ \Leftrightarrow & P(A \vert B) = \frac{P(B\vert A) P(A)}{ P(B)} \end{align} \]
The statement that \[ P(A \vert B) = \frac{P(B\vert A) P(A)}{ P(B)} \] is known as Bayes' theorem for \( P(B)>0 \).
This is nothing more than re-writing the multiplication rule as discussed above, but the result is extremely powerful.
Bayes' theorem wasn’t widely used in statistics for hundreds of years, until advances in digital computers.
When digital computers became available, many tools became available using Bayes' theorem as the basis.

Bayes' theorem continued

Often, Bayes \[ P(A \vert B) = \frac{P(B\vert A) P(A)}{ P(B)} \] is used as a way to update the probability of \( A \) when you have new information \( B \).

For example, let the events \( A= \)"it snows in the Sierra" and \( B= \)"it rains in my garden".
I might think there is a \( P(A) \) prior probability for snow, without knowing any other information.
\( P(A\vert B) \) is the posterior probability of snow in the Sierra given rain in my garden.
If I found out later in the day that there was rain in my garden, I could update \( P(A) \) to \( P(A\vert B) \) by multiplying \[ P(A\vert B) = P(A) \times \left(\frac{P(B\vert A)}{P(B)}\right) \] directly.
Although this is a simplistic example, this logic is the basis of many weather prediction techniques.

Bayes' theorem example 1

EXAMPLE: suppose that 20% of email messages are spam. The word free occurs in 60% of the spam messages. 13% of the overall messages contain the word free.
Question: How can we use Bayes' theorem,

\[ P(A\vert B) = \frac{P(B\vert A) P(A)}{P(B)} \] to compute the probability of a message being spam, given that it includes the word “free”?
- Let the events be
- \( S= \) “message is spam” \[ P(S)=0.2 \]
- \( F= \) “message contains the word free” \[ P(F)=0.13 \]
- We are looking for \( P(S|F) \)
- The probability of a message that has free in it given that is spam is \[ P(F|S)=0.6 \]
- From Bayes' theorem \[ P(S|F)=\frac{P(F|S)P(S)}{P(F)} \]
- \[ P(S|F)=\frac{0.6(0.2)}{0.13}=0.923 \]

Bayes' theorem example 2

Table of high contamination levels during chip manufacturing

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

EXAMPLE: recall the chips subject to high levels of contamination. The information is summarized in the table on the left.
Question: How can we use Bayes' theorem, \[ P(A\vert B) = \frac{P(B\vert A) P(A)}{P(B)} \] to find the conditional probability of a high level of contamination present, given that a failure occurred?

Let the events be

\( H= \)"chip is exposed to high levels of contamination" \[ P(H)=0.20 \]
\( F= \)"product fails"
Earlier we computed \( P(F) \) using the total probability rule as \[ P(F)=P(F|H)P(H)+P(F|H')P(H')=0.024 \] with \[ P(F|H)=0.10 \text{ and } P(F\vert H') = 0.005 \]

The probability of \( P(H | F) \) is determined from Bayes' theorem \[ \begin{align} P(H|F)&=\frac{P(F|H)P(H)}{P(F)} =\frac{0.10(0.20)}{0.024}=0.83\end{align} \]

Random Variables

Probability distribution for two coin flips with x number of heads.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

The first concept that we will need to develop is the random variable.
Prototypically, we can consider the coin flipping example from the motivation:

\( x \) is the number heads in two coin flips.

Every time we repeat two coin flips \( x \) can take a different value due to many possible factors:

how much force we apply in the flip;
air pressure;
wind speed;
etc…

The result is so sensitive to these factors that are beyond our ability to control, we consider the result to be by chance.
Before we flip the coin twice, the value of \( x \) has yet-to-be determined.
After we flip the coin twice, the value of \( x \) is fixed and possibly known.
Formally we will define:
Random Variable
A random variable is a function that assigns a real number to each outcome in the sample space of a random experiment.
Notation
A random variable is denoted by an uppercase letter such as \( X \). After an experiment is conducted, the measured value of the random variable is denoted by a lowercase letter such as \( x \)

Random variables continued

Random variables are the numerical measure of the outcome of a random process.

Courtesy of Ania Panorska CC

Suppose we are considering our sample space \( \mathbf{S} \) of all possible outcomes of a random process.
Then for any particular outcome of the process,

e.g., for the coin flips one outcome is \( \{H,H\} \),

mathematically the random variable \( x \) takes the outcome to the numerical value \( x=2 \) in the range \( \mathbf{R} \).

Note: \( x \) must always take a numerical value.
Because a random variable takes a numerical value (not categorical), we must consider the units that \( x \) takes:

Discrete random variable is a random variable with a finite (or countably infinite) range.

In particular, the unit of \( x \) cannot be arbitrarily sub-divided.

We can think of “how many coin flips heads” is measured in counting units because \( 1.45 \) heads does not make sense.

However, the values \( x \) takes don’t strictly need to be whole numbers;

the units just cannot be arbitrarily sub-divided.

The scale of units for \( x \) can be finite or infinite depending on the problem.

Random variables continued

Courtesy of Ania Panorska CC

Continuous random variable is a random variable with an interval (either finite or infinite) of real numbers for its range.

The units of \( x \) can be arbitrarily sub-divided and \( x \) can take any value in the sub-divided units.
Necessarily, \( x \) can take infinitely many values when it is continuous.

A good example to think of is if \( x \) is the daily high temperature in Reno in degrees Celsius.
If we had a sufficiently accurate thermometer, we could measure \( x \) to an arbitrary decimal place and it would make sense.
\( x \) thus takes today’s weather from the outcome space and gives us a number in a continuous unit of measurement.

Probability distributions

Given a random variable, our method for analyzing its behavior is typically through a probability “distribution”.
In this chapter, we present the analysis of several random experiments and discrete random variables that frequently arise in applications.
The probability distribution of a random variable \( X \) is a description of the probabilities associated with the possible values of \( X \).

A probability distribution can thus be considered a complete description of the random variable.

For any possible value that \( x \) might attain given any possible outcome, we know with what probability this will occur.

It is often expressed in the format of a table, formula, or graph.

Probability distribution for a discrete random variable –example 1

EXAMPLE: the time to recharge the flash is tested in three cell-phone cameras.
The probability that a camera meets the recharge specification is 0.8, and the cameras perform independently.
Because the cameras are independent, the probability that the first and second cameras pass the test and the third one fails, denoted as \( ppf \), is \[ P(ppf) = (0.8)(0.8)(0.2) = 0.128 \]

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

The table on the right describes the sample space for the experiment and associated probabilities.
The random variable \( X \) denotes the number of cameras that pass the test.
The last column of the table shows the values of \( X \) assigned to each outcome of the experiment

Probability distribution for a discrete random variable – example 2

EXAMPLE: there is a chance that a bit transmitted through a digital transmission channel is received in error.

Let \( X \) equal the number of bits in error in the next four bits transmitted. The possible values for \( X \) are \( \{0, 1, 2, 3, 4\} \).
Suppose that the probabilities are \[ \begin{align} P(X=0)=0.6561 &\;\; P(X=1)=0.2916\\ P(X=2)=0.0486 &\;\; P(X=3)=0.0036\\ P(X=4)=0.0001 & \end{align} \]
The probability distribution of \( X \) is specified by the possible values along with the probability of each.

Probability distribution for bits in error.

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

A graphical description of the probability distribution of \( X \) is shown in the figure on the left.
Practical Interpretation: A random experiment can often be summarized with a random variable and its distribution.
The details of the sample space can often be omitted.

Probability Mass Function

For a discrete random variable \( X \), its distribution can be described by a function that specifies the probability at each of the possible discrete values for \( X \).
Probability Mass Function
For a discrete random variable \( X \) with possible values \( x_1, x_2,\dots, x_n \), a probability mass function is a function such that
1. \( f(x_i)\geq 0 \)
2. \( \sum_{i=1}^n f(x_i)=1 \)
3. \( f(x_i)=P(X=x_i) \)

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

As with the previous example, we see that the probability mass function describes the probability distribution.
Particularly, we see \[ \begin{align} f(x) = \begin{cases} P(X=0)=0.6561 & \text{when }x=0\\ P(X=1)=0.2916 & \text{when }x=1\\ P(X=2)=0.0486 & \text{when }x=2\\ P(X=3)=0.0036 & \text{when }x=3\\ P(X=4)=0.0001 & \text{when }x=4 \end{cases} \end{align} \]
The input of the probability mass function is a possible outcome for the random variable, and the output is its associated probability.