Mean and variance of a random variable

03/01/2021

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Mean, median and mode
- Weighted mean
- Mean of a frequency distribution
- Mean of a probability distribution
- Standard deviation
- Variance

Motivation

Our goal in this course is to use statistics from a small, representative sample to say something general about the larger, unobservable population or phenomena.

Recall, the measures of the population are what we referred to as parameters.
Parameters are generally unknown and unknowable.

For example, the mean age of every adult living in the United States is a parameter for the adult population of the USA.
We cannot possibly know this value exactly as there are people who cannot be surveyed and / or don't have accurate records.
If we have a representative sample we can compute the sample mean.
The sample mean will almost surely not equal population mean, due to the natural variation (sampling error) that occurs in any given sample.
However, if we have a good probabilistic model for the ages of adults, we can use the sample statistic to estimate the general, unknown population parameter.

Random variables and probability distributions give us the model for estimating population parameters.
Note: we can only “find” the parameters exactly in very simple examples like games of chance.
Generally, we will have to be satisfied with estimates of the parameters that are uncertain, but also include measures of “how uncertain”.

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

In statistics, we try to characterize data and populations by a number of the features that they exhibits.
For a single variable, the most common measures are:

Center: A representative value that indicates where the middle of the data set is located.
Spread: A measure of the amount that the data values vary around the center.

We will now recall some measures of center:

mean;
median; and
mode.

Each of these usually gives a different view of where the “most central point” of the data lies.

Mean

The (arithmetic sample) mean is usually the most important measure of center.
Suppose we have a sample of \( n \) total measurements of some random variable \( X \).
- We will denote these measurements \( x_1, x_2, \cdots, x_n \)
Then, the (arithmetic sample) mean is defined

\[ \text{Sample mean} = \overline{x} = \frac{x_1 +x_2 +\cdots + x_n}{n}= \frac{\sum_{i=1}^n x_i}{n} \]
Q: is the sample mean a statistic or a parameter?
- A: the sample mean is computed from a sample and thus is a statistic.
- For this reason, if we took new measurements from a new sample of the population, we could get a different value.
- The random difference between the sample mean and the mean of the true population mean is called sampling error.
An important property of the sample mean is that it tends to vary less over re-sampling than other statistics.
- That is, it tends to stay close to the same value.
However, the sample mean is very sensitive to outliers.
- If outliers exist in the data, the mean can be drawn far away from the “main” cluster of data.
A statistic is called resistant if it doesn't change very much with respect to outlier data.

Median and mode

A different notion of center is the middle of the data.
For a numerical measurement, we can always order the data so that we go from low to high or high to low.
Median – the median is the middle of the ordered data set.

If there are an odd number of measurements, the median is defined as the middle value exactly.
If there are an even number of measurements, we split the data into the lower \( 50\% \) and upper \( 50\% \) of the measurements;
then we take the median to be the mean of the:

largest of the lower \( 50\% \); and
smallest of the upper \( 50\% \).

Another notion of the most “central point” in the data can be the value that is measured most frequently.
Mode – the mode is the observed value that is most frequent in the data.

Differences in mean, median and mode

Differences between mean, median and mode with non-symmetric distributions.

Courtesy of Diva Jain CC via Wikimedia Commons.

Usually, the mean, median and mode tell us different characteristics of what we call the “center” of the data.
In the special case when data is normal, these coincide.
In the left, we see data that is all uni-modal, but with three different cases.
In the left case, we have right skewness:
- Here, the mean and median are discplaced to the right away from the mode.
- Additionally, the mean and median do not match.

In the right case, we have left skewness:

In this case, the mean and the median are skewed to the left away from the mode.

Note: the precise location of the mean and median do not need to hold this way for all skew distributions – this is only one example of how this can look .
“Physically”, the mean corresponds to the center of mass of the distribution, if each observation is weighted by the measurement value.
Even if \( 50\% \) of values lie above and below the median, the weights of the observations can move the mean away from the median.

Weighted means

Let us suppose that we have a sample \( x_1, x_2, \cdots, x_n \).
Suppose each measurement is given a corresponding weight \( w_i \) so that there are pairs of the form \[ \begin{matrix} x_1 & w_1\\ x_2 & w_2 \\ \vdots & \vdots \\ x_n & w_n \end{matrix} \]
We compute a weighted mean using the following formula,

\[ \frac{\sum_{i=1}^n x_i \times w_i}{\sum_i^n w_i} = \frac{x_1 \times w_1 + x_2 \times w_2 + \cdots + x_n \times w_n}{w_1 + w_2 + \cdots + w_n} \]
- i.e., this is the sum of the measurements times the weights, divided by the sum of the weights .

Weighted means example

Let's suppose that we want to compute the grade point mean (GPA) for some student.
We will suppose that the student gets letter grades as follows: \( A, B, C, A, B \).
The letters are given point values as \( A=4.0, B=3.0, C=2.0, D=1.0 \)
The GPA is computed as a weighted mean of the grade points, weighted by the number of credits for the class.

\[ \begin{matrix} A & 3 \text{ credits} \\ B & 2 \text{ credits} \\ C & 2 \text{ credits} \\ A & 1 \text{ credit} \\ B & 3 \text{ credits} \end{matrix} \]
Q: how do we compute the weighted mean in this case? What is the GPA?
- The weighted mean (GPA) is computed as,
\[ \frac{4.0 \times 3 + 3.0 \times 2 + 2.0 \times 2 + 4.0 \times 1 + 3.0 \times 3}{3 + 2 + 2 + 1 + 3}= \frac{35}{11} \approx 3.18 \]

Calculating the mean from a frequency distribution

Computing the mean from a frequency distribution for an experiment is very similar to computing a weighted mean.
Let’s suppose we have an experiment in which we flip two coins, and \( X \) represents the number of heads.
Then suppose, we replicate this experiment with 20 independent trials.

In the right, we will suppose that this table represents the frequency of different outcomes of the experiment over the 20 replicated trials.

Outcome	Observed value for \( X=x \)	Frequency
\( \{H,H\} \)	\( x=2 \)	\( n=6 \)
\( \{H,T\} \)	\( x=1 \)	\( n=3 \)
\( \{T,H\} \)	\( x=1 \)	\( n=7 \)
\( \{T,T\} \)	\( x=0 \)	\( n=4 \)

The formula to compute the mean of a frequency distribution is as follows: \[ \text{Mean of a frequency distribution} = \frac{\sum_\text{Possible values} \text{(Observed value)} \times \text{(Frequency of observation)}}{\sum_\text{Possible values} \text{(Frequency of observation)}}. \]
In the above table, we will denote \( x=\text{Observed value} \) and \( n=\text{Frequency} \).
The formula thus becomes \[ \text{Mean of a frequency distribution} = \frac{\sum_\text{Possible values} x \times n}{\sum_\text{Possible values} n}. \]
For this frequency distribution, we thus get the equation \[ \text{Mean of a frequency distribution} = \frac{2\times 6 + 1 \times 3 + 1\times 7 + 0 \times 4}{6+3+7+4} =\frac{22}{20}=1.1. \]
Notice, this is very close to the number of heads we would expect in two coin flips if we averaged over infinite replications…

The mean of probability distributions

Frequency distributions are derived from samples, and therefore the measures of frequency distributions are statistics.

In the last example, this was equivalent to computing the sample mean \( \overline{x} \) of the random variable \( X \).

On the other hand, a probability distribution represents the entire population, where the population may be abstract.

Outcome	Observed value for \( X=x \)	Probability
\( \{H,H\} \)	\( x=2 \)	\( f(x)=\frac{1}{4} \)
\( \{H,T\}, \{T,H\} \)	\( x=1 \)	\( f(x)=\frac{2}{4} \)
\( \{T,T\} \)	\( x=0 \)	\( f(x)=\frac{1}{4} \)

E.g., for the two coin flips, the probability distribution for \( x \) represents the relative frequency of outcomes over the population of all possible experiments or infinite replications.
Therefore, if we compute the mean of \( x \) for the probability distribution, we have the population parameter \( \mu \).

To compute the mean of the probability distribution, we follow a formula like the mean of a frequency disribution.

Let \( \{x_\alpha\} \) be the collection of all possible values for \( x \) in its range \( \mathbf{R} \).

For a table as above, this corresponds to all row values in the middle.

Let \( \{P(X=x_\alpha)\} \) be all the associated probabilities for \( x \) over its range of values \( \mathbf{R} \).

For a table as above, this corresponds to all row values in the right-hand-side.

Then the mean of the probability distribution is given, \[ \mu = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha) = \sum_{x_\alpha \in \mathbf{R}} x_\alpha f(x_\alpha) \]

The mean of probability distributions continued

Notice that the mean of the probability distribution \[ \mu = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha) = \sum_{x_\alpha \in \mathbf{R}} x_\alpha f(x_\alpha) \] is really identical to the formula for the mean of a frequency distribution.

Suppose there are \( N \) total possibilities for the outcome of our experiment;
suppose for each \( x_\alpha \) in the range there are \( n_\alpha \) total ways that \( X \) can attain the value \( x_\alpha \).
If we look at the formula for the mean of a frequency distribution \[ \begin{align} \frac{ \sum_{x_\alpha \in \mathbf{R}} x_\alpha \times n_\alpha}{\sum_{x_\alpha \in \mathbf{R}} n_\alpha}= \frac{ \sum_{x_\alpha \in \mathbf{R}} x_\alpha \times n_\alpha}{N} = \sum_{x_\alpha \in \mathbf{R}} x_\alpha \times \frac{ n_\alpha}{N} = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha) = \mu \end{align} \]

Therefore, the formula is really the same, but the interpretation is different because we are dealing with population values.
Because of the difference in the interpretation, the mean of a probability distribution has a special name:

For a random variable \( x \) with probability distribution defined by the pairs of values \( \{x_\alpha\} \) and \( P(X=x_\alpha) \), the expected value of \( x \) is defined, \[ \mu = \mathbb{E}\left[X\right] = \sum_{x_\alpha \in \mathbf{R}} x_\alpha P(X=x_\alpha). \]
We call the mean of the probability distribution the expected value, because it can be thought of as the theoretical mean if we repeated an experiment infinitely many times or sampled the entire population;
we would expect this value on average, relative to infinitely many experiments.

Example of the expected value

Outcome	Observed value for \( X=x \)	Probability
\( \{H,H\} \)	\( x=2 \)	\( f(x)=\frac{1}{4} \)
\( \{H,T\}, \{T,H\} \)	\( x=1 \)	\( f(x)=\frac{2}{4} \)
\( \{T,T\} \)	\( x=0 \)	\( f(x)=\frac{1}{4} \)

Let’s consider the probability distribution for the two coin flipping experiment.
If we follow the calculation from the last slide \[ \begin{align} \mu &= \mathbb{E}\left[X\right] \\ &= \sum_{x_i} x_i \times P(X=x_i) \\ &= \sum_{x_i} x_i \times f(x_i) \\ &= 2 \times \frac{1}{4} + 1 \times \frac{2}{4} + 0 \times \frac{1}{4} \\ &= 1 \end{align} \]
Thus we say that the expected value of two coin flips is to observe one heads.

Basic concepts of variation

IQ scores table, including the mean estimate computed from the table.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

In the figure to the left, we examine the frequency of wait times in seconds at a bank to see a teller.
On the top, the customers were fed into a single line to wait for an open teller among multiple tellers;
on the bottom, customers are fed into one of multiple lines to see one an open teller at the front of the line.
We note that both frequency plots have the same mean, median and mode of 100 seconds.
If we only characterize data in terms of the center we actually don’t have a very complete picture – indeed, we can’t distinguish the two scenarios by these statistics .
Particularly, the outcomes with multiple lines have much more variation than the outcomes with a single line.

In this example, the bank actually chose to combine the multiple lines into a single line.
Note: this does not actually reduce the mean waiting time for customers.
However, customers were happier with this option because the variability in waiting time was reduced.
In practice, our statistics for variation / spread / dispersion of the data can be the most important statistics.
We will study range, standard deviation and variance to quantify this.

Standard deviation

Standard deviation is probably the most important measure of the spread of data.
Suppose that we have a sample \( x_1,x_2,\cdots, x_n \), and let us denote the sample mean as \[ \overline{x} = \frac{\sum_{i=1}^n x_i}{n}. \]
Then, the sample standard deviation denoted \( s \) is defined as, \[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]
This formula can be understood as follows:

In each term of the sum \( \sum_{i=1}^n\left({\color{blue} {x_i - \overline{x}}} \right)^{\color{red}2} \), we see how much measurement \( i \) deviates from the sample mean.
The square in this equation keeps the quantity positive.
The numerator thus gives the total sum of square differences of each sample value from the sample mean.
The denominator divides by the number of total sample size minus one;

there are good mathematical reasons for this, but for now we will say that we take an average over \( n-1 \) because the sample mean required one calculation of a statistic of the samples already.

Finally, we take a square root to make the standard deviation in the same units as the variable \( x \);

without this, the standard deviation would still be in the units squared.

Standard deviation continued

Note, we could have considered a different way to measure the variation than standard deviation.
Consider, if we want to measure the total deviation we could instead write this as,

\[ \sum_{i=1}^n \vert x_i - \overline{x}\vert \]
We could then divide this by the total number of observations, which gives

\[ \text{Mean absolute deviation} = \frac{\sum_{i=1}^n \vert x_i - \overline{x}\vert}{n-1} \]
This is a possible choice for a similar measure of the variation, but the main issue lies in that the absolute value is not an “algebraic operation”.
If we want to make calculations or inferences based on the formula above, this will become very difficult and there are few tools that work well with this statistic.
For this reason, using the square as in the sample standard deviation

\[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]

we get a similar result, but one that is mathematically easier to manipulate.

Standard deviation continued

Note – the sample standard deviation can be computed equivalently as follows.
Supposing again that we have a sample \( x_1, \cdots, x_n \), we can compute \[ s = \sqrt{\frac{n \left(\sum_{i=1}^n x_i^2 \right) - \left(\sum_{i=1}^n x_i\right)^2}{n\left(n-1\right)}} \]
This formula is totally equivalent to the last one and just requires some algebraic manipulation to show that this is the case.
Often, the above calculation is preferable because we do not need to pre-compute the sample mean.
This is also the form that is usually preferred for computer software calculation of the sample standard deviation.
We should note, the sample standard deviation is a statistic because it is computed from samples.
We can also consider the population standard deviation.
Suppose that there are \( N \) total members in the population with corresponding measurement values \( x_1 ,\cdots, x_N \).
If we had access to the entire population, could compute the population mean as \[ \text{Population mean} = \mu = \frac{\sum_{i=1}^N x_i }{N}. \]
With respect to the population mean \( \mu \) the population standard deviation is given as, \[ \text{Population standard deviation} = \sigma = \sqrt{\frac{\sum_{i=1}^N \left(x_i - \mu\right)^2}{N}}. \]

Standard deviation continued

Usually, we will not have access to the entire population \( x_1, \cdots, x_N \).
Instead, we will only have some smaller subset of values in a sample \( x_1, \cdots, x_n \) for \( n< N \).
Therefore, the formulas which we use most often are, \[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]
or \[ s = \sqrt{\frac{n \left(\sum_{i=1}^n x_i^2 \right) - \left(\sum_{i=1}^n x_i\right)^2}{n\left(n-1\right)}} \]
but not \[ \sigma = \sqrt{\frac{\sum_{i=1}^N \left(x_i - \mu\right)^2}{N}}. \]
One key difference to remember is that for the sample standard deviation, we have a very different denominator than with the population standard deviation.

Standard deviation example

We will not focus on calculating the sample standard deviation manually in this course;
- however, to demonstrate the concept, we will consider it once here.
Suppose we have the sample \( 22, 22, 26 \) and \( 24 \).
We wish to compute how much deviation there is in the data from the sample mean, so we will begin by computing this value

\[ \overline{x} = \frac{22 + 22 + 26 + 24 }{4} = \frac{94}{4}=23.5 \]
We now compute the raw deviation of each sample from the sample mean:

\[ \begin{align} x_1 - \overline{x} =& 22 - 23.5 = -1.5\\ x_2 - \overline{x} =& 22 - 23.5 = -1.5\\ x_3 - \overline{x} =& 26 - 23.5 = 2.5\\ x_4 - \overline{x} =& 24 - 23.5 = 0.5\\ \end{align} \]
Squaring each value, we obtain \( 2.25, 2.25, 6.25, 0.25 \), so that

\[ s = \sqrt{\frac{\sum_{i=1}^4 \left(x_i - \overline{x}\right)^2}{3}} = \sqrt{\frac{11}{3}}\approx 1.9 \]
This shows how the sample standard deviation can be computed, but we will want a few ways to interpret the value.

Variance

The word variance also has a specific meaning in statistics and is another tool for describing the variation / dispersion / spread of the data.
Suppose that the data has a population standard deviation of \( \sigma \) and a sample standard deviation of \( s \).
Then, the data has a population variance of \( \sigma^2 \).
Likewise, the data has a sample variance of \( s^2 \).
Therefore, for either a population parameter or a sample statistic, the variance is the square of the standard deviation.
- Because of this, the variance has units which are the square of the original units.
For example, measuring the heights of students in inches, the standard deviation is in the units inches.
- However, the variance is in the unit \( \text{inches}^2 \).

Important properties of standard deviation and variance

We should introduce a few key properties of the standard deviation and the variance:

Standard deviation and variance are always non-negative by construction. \[ \begin{align} s &= \sqrt{\frac{\sum_{i=1}^n \left(x_i - \overline{x}\right)^2}{n-1}} \\ \sigma & = \sqrt{\frac{\sum_{i=1}^N \left(x_i - \mu\right )^2}{N}} \end{align} \]
Standard deviation and variance are only zero when all values are equal and larger values means there is more spread in the data.
However, the size of the standard deviation and variance is also sensitive to outliers, and they can become large with a few outliers present.
The sample variance is an unbiased estimator of the population variance.

Sampling error will mean that we usually do not have a sample variance that is equal to the population variance.
However, over all possible samples this should usually be close to the true value.

The sample standard deviation is a biased estimator of the population standard deviation.

Over repeated resampling, the different estimates for the population standard deviation via the sample standard deviation tend to be too small.
However, for large sample sizes this bias of the sample standard deviation often being too small won’t have much practical effect.

The empirical (68-95-99.7) rule

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a normal distribution.

Courtesy of Melikamp CC via Wikimedia Commons

Recall now the bell curve picture that we often consider – we will suppose we have a population that is distributed as a bell shape.
We suppose that the population mean is \( \mu \) and population standard deviation \( \sigma \).
We suppose that the histogram represents the sample data which is mostly bell-shaped, but is smaller than the population so it is not exact.
The empirical rule is as follows:

Approximately \( 68\% \) of the sample data will lie within one standard deviation \( \sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - \sigma, \mu + \sigma]. \]
Approximately \( 95\% \) of sample data will lie within two standard deviations \( 2\sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - 2\sigma, \mu + 2\sigma]. \]
Approximately \( 99.7\% \) of sample data will lie within three standard deviations \( 3\sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - 3\sigma, \mu + 3\sigma]. \]

This tells us that for normal data, the spread can be easily interpreted from the standard deviation.