Parameters of probability distributions and the binomial distribution

03/03/2021

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Review of mean, standard deviation and variance of probability distribution
- Binomial distribution
- Parameters of the binomial distribution

Motivation

Our goal in this course is to use statistics from a small, representative sample to say something general about the larger, unobservable population or phenomena.

Recall, the measures of the population are what we referred to as parameters.
Parameters are generally unknown and unknowable.

If we have a representative sample we can compute the sample mean.
The sample mean will almost surely not equal population mean, due to the natural variation (sampling error) that occurs in any given sample.

Random variables and probability distributions give us the model for estimating population parameters.
Generally, we will have to be satisfied with estimates of the parameters that are uncertain, but also include measures of “how uncertain”.

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

In statistics, we try to characterize data and populations by a number of the features that they exhibits.
For a single variable, the most common measures are:

Center: A representative value that indicates where the middle of the data set is located.
Spread: A measure of the amount that the data values vary around the center.

We saw last time how the mean and standard deviation are related quantities describing these features of sample data or a population.
The above figure represents the theoretic description of a normal population.
In particular:

\( \approx 68\% \) of the population lies within one standard deviation of the mean, \( [\mu - \sigma, \mu+\sigma] \);
\( \approx 95\% \) of the population lies within two standard deviations of the mean, \( [\mu - 2 \sigma , \mu + 2 \sigma] \)
\( \approx 99.7% \) of the population lies within three standard deviations of the mean, \( [\mu - 3 \sigma, \mu + 3 \sigma] \).

This is known as the empirical rule, which holds for all normal populations.
Sample data will tend to follow this, but not exactly, if the measurements come from a normal population.

Chebyshev's Theorem

A very similar rule is known as Chebyshev’s theorem:
The proportion (or fraction) of any set of data lying within \( K \) standard deviations of of the mean is always at least \( 1-\frac{1}{K^2} \) where \( K>1 \).
Q: suppose \( K=2 \), what does this statement tell us?

For \( K=2 \), we say at least \[ 1 - \frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} \] of data lies within \( K=2 \) standard deviations of the mean.
Note, this holds for any distribution whereas the empirical rule only holds for normal data.
If we know the population is in fact normal, then \( 95\% > 75\% =1 - \frac{1}{2^2} \) of the population lies within \( K=2 \) standard deviations of the mean.

We note thus there are two major differences between Chebyshev’s theorem and the empirical rule:

The empirical rule only holds for normal data, while Chebyshev’s theorem holds for any type of data.
However, Chebyshev’s theorem is only a lower bound on at least how much data lies within standard deviations and is a much weaker statement on how much.

In either case, these parameters tell us a lot about the center and spread of data, with some qualifications.
We will now recall how we compute the mean, variance and standard deviation from a probability distribution.

The mean of probability distributions

Let \( X \) be a random variable that can attain the possible-to-observe values \( \{x_\alpha \in\mathbf{R}\} \).
We say the the probability density function \( f(x_\alpha) = P(X=x_\alpha) \).
The mean of the probability distribution \[ \mu = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha) = \sum_{x_\alpha \in \mathbf{R}} x_\alpha f(x_\alpha) \] is computed like the formula for the mean of a frequency distribution.
However, because of the difference in the interpretation, the mean of a probability distribution has a special name:

For a random variable \( x \) with probability distribution defined by the pairs of values \( \{x_\alpha\} \) and \( P(X=x_\alpha) \), the expected value of \( x \) is defined, \[ \mu = \mathbb{E}\left[X\right] = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha). \]
We call the mean of the probability distribution the expected value, because it can be thought of as the theoretical mean if we repeated an experiment infinitely many times or sampled the entire population;
we would expect this value on average, relative to infinitely many experiments.

Standard deviation of a sample

Suppose we are considering finite population of size \( N \).
Usually, we will not have access to the entire population \( x_1, \cdots, x_N \).
Instead, we will only have some smaller subset of values in a sample \( x_1, \cdots, x_n \) for \( n< N \).
Our sample standard deviation can be computed as \[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]
or \[ s = \sqrt{\frac{n \left(\sum_{i=1}^n x_i^2 \right) - \left(\sum_{i=1}^n x_i\right)^2}{n\left(n-1\right)}}. \]
However, this differs from computing the population standard deviation as \[ \sigma = \sqrt{\frac{\sum_{i=1}^N \left(x_i - \mu\right)^2}{N}}. \]
One key difference to remember is that for the sample standard deviation, we have a very different denominator than with the population standard deviation.

Variance

The word variance also has a specific meaning in statistics and is another tool for describing the variation / dispersion / spread of the data.
Suppose that the data has a population standard deviation of \( \sigma \) and a sample standard deviation of \( s \).
Then, the data has a population variance of \( \sigma^2 \).
Likewise, the data has a sample variance of \( s^2 \).
Therefore, for either a population parameter or a sample statistic, the variance is the square of the standard deviation.
- Because of this, the variance has units which are the square of the original units.
For example, measuring the heights of students in inches, the standard deviation is in the units inches.
- However, the variance is in the unit \( \text{inches}^2 \).

The standard deviation and variance of probability distributions

Let’s recall the formula now for the standard deviation of a population with members \( \{x_i\}_{i=1}^N \) \[ \sigma = \sqrt{\frac{\sum_{i=1}^N \left(x_i - \mu\right)^2}{N}}, \] where we take a denominator of \( N \) for the population instead of \( N-1 \) as in samples.
Let’s suppose that the population members \( x_i \) equal values \( x_\alpha \) in the range \( \mathbf{R} \) with frequencies \( n_\alpha \).
If we re-write the above formula in terms of \( x_\alpha \) and \( n_\alpha \) we can say, \[ \begin{align} \sigma = \sqrt{\frac{\sum_{x_\alpha\in \mathbf{R}} n_\alpha\left(x_\alpha - \mu\right)^2}{N}} = \sqrt{\sum_{x_\alpha \in \mathbf{R}} \frac{n_\alpha}{N} \left(x_\alpha - \mu\right)^2} = \sqrt{\sum_{x_\alpha\in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 }. \end{align} \]
We will denote, \[ \sigma = \sqrt{\sum_{x_\alpha\in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 } \] the standard deviation of the probability distribution associated to the random variable \( X \).
That is to say that the population standard deviation \( \sigma \) is exactly the standard deviation of the probability distribution.
For infinite populations and ranges, we can use the same argument (with calculus) to show this holds in general.

The standard deviation and variance of probability distributions continued

Using the derivation from the last slide, \[ \sigma = \sqrt{\sum_{x_\alpha\in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 }, \]
we can show directly that the variance of a probability distribution is given as, \[ \sigma^2 = \sum_{x_\alpha \in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 . \]
We can also derive the alternative forms for the population standard deviation and variance in terms of the probability distribution as \[ \begin{align} \sigma &= \sqrt{\sum_{x_\alpha \in \mathbf{R}} x_\alpha^2 P(X=x_\alpha) - \mu^2 }, \\ \\ \sigma^2 &= \sum_{x_\alpha\in\mathbf{R}} x_\alpha^2 P(X=x_\alpha) - \mu^2 . \end{align} \]
This again just amounts to some algebraic manipulation and these forms are totally equivalent to the other forms.

Example of the variance and standard deviation

Outcome	Observed value for \( X=x \)	Probability
\( \{H,H\} \)	\( x=2 \)	\( f(x)=\frac{1}{4} \)
\( \{H,T\}, \{T,H\} \)	\( x=1 \)	\( f(x)=\frac{2}{4} \)
\( \{T,T\} \)	\( x=0 \)	\( f(x)=\frac{1}{4} \)

Let’s recall the probability distribution for the two coin flipping experiment.
We showed already that \( \mu = \mathbb{E}\left[X\right] =1 \).
Thus we say that the expected value of two coin flips is to observe one heads.
Recall that our variance is computed as \[ \sigma^2 = \sum_{x_\alpha \in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 \]

Using the above, we can show that \[ \begin{align} \sigma^2 &= \sum_{x_\alpha \in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2\\ &= \frac{1}{4}\left(2 - 1 \right)^2 + \frac{2}{4}\left(1 - 1\right)^2 + \frac{1}{4}\left(0 - 1\right)^2 \\ &= \frac{1}{4} + \frac{1}{4} = \frac{1}{2} \end{align} \]
Therefore, the standard deviation has to be given by \[ \begin{align} \sigma = \sqrt{\frac{1}{2}} \end{align} \]

Binomial distribution

A coin flipping experiment is actually a simple example of a broad category of experiments.

For example, we can consider an experiment in which we describe two possible outcomes:

(success / \( S \) / \( 1 \) / \( H \)); or
(failure / \( F \) / \( 0 \) / \( T \)).

We can encode the outcomes any way we like;

it is common to encode the outcomes as \( S \) or \( F \), where the choice of “success” is arbitrary.

More generally than coin flipping, we might consider the case where the probabilities, \[ \begin{align} P(\text{success}) \neq P(\text{failure}) \end{align} \]
Recall, if the experiment has only two possible outcomes, if \( A= \)"success" then \( \overline{A}= \)"failure".
Therefore, \[ \begin{align} P(\text{success}) + P(\text{failure}) = 1. \end{align} \]
Suppose we run the experiment a total of \( n \) trials.

Because there are only two possible outcomes for each trial, and a finite number of trials, we can create a list of all possible outcomes for \( n \) trials.

For example, if there are \( x_\alpha \) total successes, there must be exactly \( n - x_\alpha \) failures in \( n \) trials.

More importantly, we can also make a list of all possible ways we can have \( x_\alpha \) successes and \( n-x_\alpha \) failures.
Let \( X \) be the random variable equal to the number of successful trials – we can therefore calculate the probability, \[ P(X = x_\alpha); \] however, the classical model for probability (equal probability of all outcomes) will no longer apply.

Binomial distribution continued

Recall our random variable \( X \) equal to the number of successful trials in \( n \) total trials.

Unlike with coin flipping, we suppose that it is possible for \[ \begin{align} P(\text{success}) \neq P(\text{failure}) \end{align} \]

However, there are finite trials, finite possible outcomes and, for each possible number of successes \( x_\alpha \), there are a finite number of ways \( X=x_\alpha \).

Provided all trials are independent (like coin flipping) and the probability of success is constant, we can still make a counting argument using

The rule of complementary probability \[ \begin{align} P(\text{success}) + P(\text{failure}) = 1; \end{align} \]
independence;
the list of all possible ways we can make \( x_\alpha \) successes;
the list of all possible ways we can make \( n-x_\alpha \) failures; and
a total of \( n \) trials exactly;

to compute the probability exactly for each \( x_\alpha \) where \( x_\alpha \) ranges from \( 0, 1, \cdots, n \).
The list of all possible number of successes \( x_\alpha = 0, 1, \cdots, n \) and the associated probabilities \( P(X= x_\alpha) \) for \( x_\alpha = 0, 1, \cdots, n \) is called the binomial distribution.

The argument itself is somewhat long, but it really only uses tools we already know.

Therefore, if you can understand the principles of the points 1 - 5 above, we don’t need to belabor the details.

Binomial distribution continued

Formally, we will now describe the binomial distribution.

Suppose we run an experiment with two possible outcomes \( S= \)"success" and \( F= \)"failure", where \[ \begin{align} P(S) = p && P(F) = 1 - P(S) = q. \end{align} \]
Suppose we run exactly \( n \) total trials of the above experiment and suppose that:

each trial is independent; and
\( P(S)=p \) for every trial.

Let \( X \) be the random variable equal to the total number of successful trials.
Let \( x_\alpha \) be one of the possible number of successful trials in the range \( 0, 1 ,\cdots , n \).

Then the probability of exactly \( x_\alpha \) successful trials (the event \( X= x_\alpha \)) is given by \[ \begin{align} P(X=x_\alpha) = \frac{n!}{\left( n - x_\alpha\right)! x_\alpha !} p^{x_\alpha} q^{(n - x_\alpha)}, \end{align} \]
where:

The total number of ways that we can have exactly \( x_\alpha \) successes in \( n \) trials is given by \[ \frac{n!}{\left( n - x_\alpha\right)! x_\alpha !}. \]
The probability of \( x_\alpha \) independent successes (or \( n-x_\alpha \) independent failures) is \( p^{x_\alpha} \) \( \big( \) or \( q^{(n-x_\alpha)}\big) \) respectively.

Binomial distribution example

Recall our notation:
1. \( n \) - the number of trials;
2. \( X \) - the random variable;
3. \( x_\alpha \) - a specific number of successes that \( X \) could possibly attain;
4. \( P(S)= p \) - the probability of an independent trial’s success;
5. \( P(F)=q \) - the probability of an independent trial’s failure.
Suppose that when an adult is randomly selected with replacement, there is a \( 0.85 \) probability that this person knows what Twitter is (based on results from a Pew Research Center survey from several years ago…).
Suppose that we want to find the probability that exactly three of five random adults know what Twitter is.
Q: can you identify what \( n \), \( X \), \( x_\alpha \), \( p \) and \( q \) are in the above word problem?

Here we consider the random selection to be a “trial” so that the number of trials is \( n=5 \)
If we consider a “successful” trial to be “select an adult who knows what Twitter is”, then \( X \) is “number of adults who know what Twitter is out of five”.
\( x_\alpha \) is the specific number of successful trials we are interested in, i.e., \( x_\alpha = 3 \).
\( p \) is the probability of an independent trial’s successs, i.e, \( p=0.85 \)
\( q \) is the probability of an independent trial’s failure, i.e., \( q=1-p = 0.15 \).

Binomial distribution example continued

Let’s recall our values from the last slide,
- Here we consider the random selection to be a “trial” so that the number of trials is \( n=5 \)
- If we consider a “successful” trial to be “select an adult who knows what Twitter is”, then \( X \) is “number of adults who know what Twitter is out of five”.
- \( x_\alpha \) is the specific number of successful trials we are interested in, i.e., \( x_\alpha = 3 \).
- \( p \) is the probability of an independent trial’s success, i.e, \( p=0.85 \)
- \( q \) is the probability of an independent trial’s failure, i.e., \( q=1-p = 0.15 \).
Suppose we wanted to compute the probability of one particular outcome,

say, \( S_i = \)"the \( i \)-th particpant knows what Twitter is" and \( F_i= \)"the \( i \)-th participant does not know what twitter is", where \[ A = S_1 \text{ and } S_2 \text{ and } S_3 \text{ and } F_4 \text{ and } F_5. \]
We can use independence and the multiplication rule to show \[ \begin{align} P(A) &= P(S_1)\times P(S_2)\times P(S_3)\times P(F_4)\times P(F_5) \\ &= 0.85 \times 0.85 \times 0.85 \times 0.15 \times 0.15 \\ &= 0.85^3 \times 0.15^2. \end{align} \]

This shows how we get one part of the binomial distribution formula.

However, there are many combinations of \( S_i \) and \( F_i \) that arise in \( X=3 \).

Using a counting argument, we can show that the total number of ways \( X=3 \) is \[ \begin{align} \frac{n!}{(n- x_\alpha)! x_\alpha!} = \frac{5!}{(5 - 3)! 3!} = \frac{5!}{(2)!3!} = 10 \end{align} \]

Binomial distribution example continued

Let’s recall our values from the last slide,
- Here we consider the random selection to be a “trial” so that the number of trials is \( n=5 \)
- If we consider a “successful” trial to be “select an adult who knows what Twitter is”, then \( X \) is “number of adults who know what Twitter is out of five”.
- \( x_\alpha \) is the specific number of successful trials we are interested in, i.e., \( x_\alpha = 3 \).
- \( p \) is the probability of an independent trial’s successs, i.e, \( p=0.85 \)
- \( q \) is the probability of an independent trial’s failure, i.e., \( q=1-p = 0.15 \).
- The total number of ways \( X=3 \) is \[ \begin{align} \frac{n!}{(n- x_\alpha)! x_\alpha!} = \frac{5!}{(5 - 3)! 3!} = \frac{5!}{(2)!3!} = 10 \end{align} \]
The binomial distribution formula can then be read as,

The probability of finding exactly \( 3 \) out of \( 5 \) independently, randomly selected adults who know what Twitter is, is equal to \[ \begin{align} \frac{n!}{(n- x_\alpha)! x_\alpha!} p^{x_\alpha} q^{n- x_\alpha} = 10 \times 0.85^3 \times 0.15^2 \approx 0.138, \end{align} \]
or in plain English,
the probability of three independent successful trials, times the probability of two independent failure trials, times all possible ways we can have exactly \( 3 \) successful trials out of five.
Again, the counting argument follows directly from the material we studied before the midterm.

Binomial distribution example continued

Let's now take a graphical look at the last problem in R.

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(0:5), dbinom(c(0:5), size=5, prob=0.85),  ylab="Probability", xlab="Number of Successses", cex=3, cex.lab=3, cex.axis=2.5, main="Probability mass function", cex.main=3)

plot of chunk unnamed-chunk-1

In the above, the dbinom is the function for the probability mass function for the binomial distribution.
We are setting a range of values to plot c(0:5) and the size of the trial size=5 and the probability of succcess prob=0.85.

Binomial distribution example continued

We should remark the following on the last calculation.

Technically, we could only make use of the binomial distribution because we sampled with replacement to enforce independent trials.
If sampled our population without replacement, we know

that the trials are dependent; and
that the probability of success changes at each trial.

These conditions make it so the binomial distribution does not apply to the random variable \( x \) when we do not replace samples.
However it is common to approximate sampling without replacement as independent when the sample size is less than \( 5\% \) of the population.
In practice for polls of, e.g., all US adults, this approximation will often be used.

Parameters of the binomial distribution

Random variables are the numerical measure of the outcome of a random process.

Courtesy of Ania Panorska CC

We saw earlier the following definitions for the mean and the standard deviation of a probability distribution:
- Suppose we have a random variable \( X \) which assigns a numerical value to each outcome in the sample space \( \mathbf{S} \).
- Suppose all values that \( X \) can attain are given by a collection \( \{x_\alpha\} \) in the range \( \mathbf{R} \) of \( X \).

Then the mean (or expected value) of the probability distribution is given, \[ \mu = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha) \]
The standard deviation of the probability distribution is given \[ \sigma = \sqrt{\sum_{x_\alpha\in \mathbf{R}} P(X=x_\alpha) \left(x_\alpha - \mu\right)^2 } \]
These formulas hold for all probability distributions (with a slight modification when the variable is continuous by using calculus).

Parameters of the binomial distribution continued

Public domain via Wikimedia Commons

The binomial distribution has a very nice structure so that the parameters have a nice form.
For the binomial distribution the mean is given as, \[ \mu = n \times p . \]
For the binomial distribution the variance is given as, \[ \sigma^2 = n \times p \times q. \]
For the binomial distribution the standard deviation is given as, \[ \sigma = \sqrt{ n \times p \times q}. \]
Q: what is \( \mu \) and \( \sigma \) for the binomial distribution for \( 20 \) trials and probability of success \( p=0.5 \)?

Notice that these are given as, \[ \begin{align} \mu = 20 \times 0.5 = 10 & & \sigma = \sqrt{20 \times 0.5 \times 0.5} = \sqrt{ 5}\end{align} \]

Q: what is \( \mu \) and \( \sigma \) for the binomial distribution for \( 40 \) trials and probability of success \( p=0.5 \)?

Notice that these are given as, \[ \begin{align} \mu = 40 \times 0.5 = 20 & & \sigma = \sqrt{40 \times 0.5 \times 0.5} = \sqrt{ 10}\end{align} \]

Q: what is \( \mu \) and \( \sigma \) for the binomial distribution for \( 20 \) trials and probability of success \( p=0.7 \)?

Notice that these are given as, \[ \begin{align} \mu = 20 \times 0.7 = 14 & & \sigma = \sqrt{20 \times 0.7 \times 0.3} = \sqrt{ 4.2}\end{align} \]

Review of the binomial distribution

The binomial distribution is a key distribution that gives us a way to model a wide range of experiments probabilistically.
This applies when we run an experiment with two possible outcomes \( S= \)"success" and \( F= \)"failure", where \[ \begin{align} P(S) = p && P(F) = 1 - P(S) = q. \end{align} \]
When we run exactly \( n \) total trials of the above experiment, assuming that:

each trial is independent; and
\( P(S)=p \) for every trial.

We can model the probability of a particular number of successes \( x_\alpha \) like a (possibly) non-fair coin flipping experiment.
We model the probability of exactly \( x_\alpha \) successful trials as \[ \begin{align} P(X=x_\alpha) = \underbrace{\frac{n!}{\left( n - x_\alpha\right)! x_\alpha !}}_{(1) } \times \underbrace{ p^{x_\alpha}}_{(2)} \times \underbrace{q^{(n - x_\alpha)}}_{(3)} \end{align} \] where:

Total number of ways to find exactly \( x_\alpha \) successful trials out of \( n \) total trials;
Probability of \( x_\alpha \) independent succesful trials;
Probability of \( n-x_\alpha \) independent failure trials;

The special structure of this distribution also allows us to compute the mean and standard deviation directly as \[ \begin{align} \mu = n \times p & & \sigma = \sqrt{n \times p\times q} \end{align} \]