Measures of center and measures of spread

02/06/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Mean
- Median
- Mode
- Midrange
- Computing the mean from frequency distributions
- Weighted means
- Basic concepts of variation

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

Recall, we try to characterize data by a number of the features that it exhibits.
Some of the key measures are:

Center: A representative value that indicates where the middle of the data set is located.
Variation: A measure of the amount that the data values vary.
Distribution: The nature or shape of the spread of the data over the range of values (such as bell-shaped).
Outliers: Sample values that lie very far away from the vast majority of the other sample values.
Time: Any change in the characteristics of the data over time.

We will now begin studying measures of center.
There are several main measures of center of a data set:

mean;
median;
mode; and
midrange.

Each of these usually gives a different view of where the “most central point” of the data lies.

Mean

The (arithmetic sample) mean is usually the most important measure of center.
Suppose we have \( n \) total sample measurements of some variable \( x \).
- We will denote these samples \( x_1, x_2, \cdots, x_n \)
Then, the (arithmetic sample) mean is defined

\[ \text{Sample mean} = \frac{x_1 +x_2 +\cdots + x_n}{n}= \frac{\sum_{i=1}^n x_i}{n} \]
Discuss with a neighbor: is the sample mean a statistic or a parameter?
- A: the sample mean is computed from samples and thus a statistic.
- For this reason, if we took new measurements from a new sample of the population, we could get a different value.
- The random difference between the sample mean and the mean of the true population mean is called sampling error.
An important property of the sample mean is that it tends to vary less over re-sampling than other statistics.
- That is, it tends to stay close to the same value.
However, the sample mean is very sensitive to outliers.
- If outliers exist in the data, the mean can be drawn far away from the “main” cluster of data.
A statistic is called resistant if it doesn't change very much with respect to outlier data.

Median

A different notion of center is the middle of the data.
For a numerical measurement, we can always order the data so that we go from low to high or high to low.
Median – the median is the middle of the ordered data set.

If there are an odd number of samples, the median is defined as the middle value exactly.
If there are an even number of samples, we split the data into the lower \( 50\% \) and upper \( 50\% \) of the samples;
then we take the median to be the mean of the:

largest of the lower \( 50\% \); and
smallest of the upper \( 50\% \).

Suppose we are given a list of the following samples \( 22, 22, 26, 24, 23 \).
- Discuss with a neighbor: what is the median of this list of samples?
- Ordering the values, we get \( 22, 22, 23, 24, 26 \) so that the middle value is obviously \( 23 \).
Suppose a new sample includes \( 22, 22, 26, 24, 23, 27 \).

Discuss with a neighbor: what is the median of this list of samples?
In this case, we have an even number of samples.
The lower \( 50\% \) is given by \( 22,22,23 \) and the upper \( 50\% \) is given by \( 24,26,27 \).
Therefore, the mean of the largest lower value and the smallest upper value is given by \[ \frac{23 + 24}{2} = 23.5. \]

Median continued

Let us consider the last example once again.
Suppose our sample includes the values \( 22, 22, 26, 24, 23, 27 \).
If we compute the (arithmetic sample) mean, we find

\[ \frac{22+22+26+24+23+27}{6} = \frac{144}{6} = 24. \]
Now, suppose that we realize that the value \( 27 \) was obtained due to measurement error and our sample should have read \( 22, 22, 26, 24, 23, 1000 \).
Discuss with a neighbor: by replacing the value \( 27 \) with \( 1000 \) does this affect the median? Does this affect the mean? Which of these statistics are resistant to outliers?
- We note, this does not affect the median – indeed the actual numerical value of the final measurement does not change which value lies in the middle.
- The lower \( 50\% \) of the measurements are given by \( 22,22,23 \) and the upper \( 50\% \) are given by \( 24,26,1000 \).
- Once again, we compute the mean of the largest lower value and the smallest upper value, given by \( \frac{23 + 24}{2} = 23.5. \) so that we say the median is resistant to outliers.
- On the other hand, the sample mean is given as \[ \frac{22+22+26+24+23+1000}{6} = \frac{1117}{6} \approx 186.1667. \]

Mode

Another notion of the most “central point” in the data can be the value that is measured most frequently.
Mode – the mode is the observed value that is most frequent in the data.
Consider the last example with samples of \( 22, 22, 26, 24, 23, 27 \). Q: What is the mode?
- In this case, we sampled \( 22 \) more than any other value, so this is the mode of the data.
When two or more values have the highest frequency, we call the data bi-modal or multi-modal.
- An exception to this above rule is when no values are repeated.
- In this case, we say there is no mode to the data.

Differences in mean, median and mode

Differences between mean, median and mode with non-symmetric distributions.

Courtesy of Diva Jain CC via Wikimedia Commons.

Usually, the mean, median and mode tell us different characteristics of what we call the “center” of the data.
In the special case when data is normal, these coincide.
In the left, we see data that is all uni-modal, but with three different cases.
In the left case, we have right skewness:
- Here, the mean and median are discplaced to the right away from the mode.
- Additionally, the mean and median do not match.

In the right case, we have left skewness:

In this case, the mean and the median are skewed to the left away from the mode.

Note: the precise location of the mean and median do not need to hold this way for all skew distributions – this is only one example of how this can look .
“Physically”, the mean corresponds to the center of mass of the distribution, if each observation is weighted by the measurement value.
Even if \( 50\% \) of values lie above and below the median, the weights of the observations can move the mean away from the median.

Differences in mean, median and mode example

11 football players from the Seattle Seahawks were randomly sampled for their weight in pounds.
The samples are \( 189, 254, 235, 225, 190, 305, 195, 202, 190, 252, 305 \).
Discuss with a neighbor: what are the mean, median and mode of this data? Does the data appear to be normally distributed? Why?
Here the mean is given by \[ \frac{189+254+235+225+190+305+195+202+190+252+305}{11} \approx 231.09. \]
The ordered data is given by \( 189, 190, 190, 195, 202, 225, 235, 252, 254, 305, 305 \).
The number of samples is odd, so the middle value can be identified as \( 225 \).
The data also has two modes, \( 190 \) and \( 305 \).
Overall, the data appears to be non-normal, as there are many values around \( 190 \), with a long tail into the upper values.
- Moreover, the probability of extreme values (weight over \( 300 \)) is relatively high.

Differences in mean, median and mode example

Histogram of the player weights in example.

If we make a histogram of the data, we see indeed, there is non-normal structure.
Even though the mean and the median are close, multi-modes are a strongly non-normal structure in data.
Examining each of the values together, along with the visual plot, tells us a lot about the data.
This also shows how each of the descriptions of center can be flawed and / or give a different picture.

Here, the modes aren’t really “central” values in some sense, especially the upper mode.

The median and mean give similar values, but the mean is more sensitive to the large outliers.

Therefore, the mean is larger than the median in this example.

When are mean, median and mode useful

Discuss with a neighbor: for each of the following, identify a major reason why the mean and median are not meaningful statistics.
- The zip codes of the White House, Air Force division of the Pentagon, Empire State Building, and Statue of Liberty: 20500, 20330, 10118, 10004.
- A: Zip codes are just category labels that have no actual quantitative meaning. For example, taking the mean of red, green and blue has no mathematical meanining in the same way.
- Rank (by sales) of selected statistics textbooks: 1, 4, 5, 3, 2, 15.
- A: although these are ordered values, so intercomparision makes sense, mathematical operations are not useful. For example, it doesn't make any sense to say the mean of the first place textbook and the fifth place text book is a third place text book in sales.
- The most selling textbook may sell many more copies than the second place one, so that their rankings don't exactly correspond to a physical quantity of anything.

Midrange

As a final measure of center, we can consider what is the mid-point between the maximum observation and the minimum observation.
Midrange – suppose we have samples \( x_1,\cdots, x_n \) and

\[ \begin{align} x_\text{max} = \max_i(x_i) & & x_\text{min} = \min_i(x_i) \end{align} \]
Then, the midrange is computed as

\[ \text{midrange} = \frac{x_\text{max} + x_\text{min} }{2} \]
Discuss with your neighbor: can you give an example of when the midrange does not equal the median?
- A: a simple example is where we have data \( 0,0,0,0,100 \).
- In this case, the median is \( 0 \), while the midrange is \( \frac{100 + 0 }{2} = 50 \).
As we can see, midrange is extremely sensitive to outliers, both small and large.
Midrange is not used as often as the other measures in practice, but it can give a more complete picture of the data when used with the other measures.

Round off rules

In many cases, we will need to use approximations to caclulate these values;

usually some kind of rounding will be used for computing, e.g., the mean.

Some key considerations are the following:

Never round any values until the final step. We only want to use an approximation in the final computation estimating the statistic.

For example, with data \( 2.5, 3.7, 4.9 \) we should compute the mean as, \[ \frac{2.5 + 3.7 + 4.8}{3} = \frac{11}{3} = 3 + \frac{2}{3} \approx 3.66. \]
If we round any values before the final step, such as \[ \frac{2.5 + 3.7 + 4.8}{3} \approx \frac{3 + 4 + 5}{3} = \frac{12}{3} = 4, \] we can get a very bad approximation.

For the mean, median, and midrange, carry one more decimal place than is present in the original set of values.
For the mode, keep the value as is without any kind of rounding.

A common mistake in computing means

There are additional ways in which we must be careful in how we calculate and interpret these statistics.
Let’s suppose that California has \( 6,421,880 \) students enrolled in primary and secondary school, with \( 307,267 \) total teachers at this level.

This leaves a mean of \( \frac{6,421,880}{307,267} \approx 20.9 \) students per teacher in California.

Let’s suppose that Alaska has \( 117,549 \) students enrolled in primary and secondary school, with \( 6,998 \) total teachers at this level.

This leaves a mean of \( \frac{117,549}{6,998}\approx 16.8 \) students per teacher in Alaska.

Suppose we want to compute the mean number of students per teacher for California and Alaska combined;

this should look like \[ \frac{\left(\text{Total number of students in California} \right)+ \left(\text{Total number of students in Alaska}\right)}{\left(\text{Total number of teachers in California}\right) + \left(\text{ Total number of teachers in Alaska}\right) }. \]

Discuss with a neighbor: can you obtain this same result by using the formula

\[ \frac{\text{mean of 20.9 students per teacher} + \text{mean of 16.8 students per teacher}}{2}? \] Why or why not?

A:There is a combined number of 6,539,429 students and 315,013 teachers, so the student/teacher ratio for California and Alaska combined is \( \frac{6,539,429}{315,013} \approx 20.8 \) (not the value of 18.85 from the above calculation).

A common mistake in computing means continued

The issue with this formula, \[ \frac{\text{mean of 20.9 students per teacher} + \text{mean of 16.8 students per teacher}}{2} \] lies in the fact that California is much more populous than Alaska, with more students and teachers by far;
by computing the mean from each of the individual state-means, we lose the information on how many students and teachers are in each state.
Whenever we compute a “mean of means” we need to take into account the size of the respective populations which we draw from.
We will discuss this further in the context of a weighted mean.

Weighted means

Let us suppose that we have samples \( x_1, x_2, \cdots, x_n \).
Suppose each sample is given a corresponding weight \( w_i \) so that there are pairs of the form \[ \begin{matrix} x_1 & w_1\\ x_2 & w_2 \\ \vdots & \vdots \\ x_n & w_n \end{matrix} \]
We compute a weighted mean using the following formula,

\[ \frac{\sum_{i=1}^n x_i \times w_i}{\sum_i^n w_i} = \frac{x_1 \times w_1 + x_2 \times x_2 + \cdots + x_n \times w_n}{w_1 + w_2 + \cdots + w_n} \]
- i.e., this is the sum of the samples times the weights, divided by the sum of the weights .

Weighted means example

Let's suppose that we want to compute the grade point mean (GPA) for some student.
We will suppose that the student gets letter grades as follows: \( A, B, C, A, B \).
The letters are given point values as \( A=4.0, B=3.0, C=2.0, D=1.0 \)
The GPA is computed as a weighted mean of the grade points, weighted by the number of credits for the class.

\[ \begin{matrix} A & 3 \text{ credits} \\ B & 2 \text{ credits} \\ C & 2 \text{ credits} \\ A & 1 \text{ credit} \\ B & 3 \text{ credits} \end{matrix} \]
Discuss with a neighbor: how do we compute the weighted mean in this case? What is the GPA?
- The weighted mean (GPA) is computed as,
\[ \frac{4.0 \times 3 + 3.0 \times 2 + 2.0 \times 2 + 4.0 \times 1 + 3.0 \times 3}{3 + 2 + 2 + 1 + 3}= \frac{35}{11} \approx 3.18 \]

Weighted means example

If we return to the question of how to compute the mean of means, we can consider how we can weight each value proportionally to the population from which it is drawn.
We can instead write a formula,

\[ \frac{\text{mean of 20.9 students per teacher}\times w_1 + \text{mean of 16.8 students per teacher} \times w_2}{w_1 + w_2}. \]
The units in the above equation are \( \frac{\text{students in state X}}{\text{teachers in state X}} \).
By understanding the units in this example, we can derive the proper weights – we want to end in the units \( \frac{\text{total of all students}}{\text{total of all teachers}} \).
If we choose weights in the units \( \frac{\text{teachers in state X}}{\text{total of all teachers}} \), we obtain:

\[ \begin{align} \frac{20.9 \frac{\text{Students in C}}{\text{Teachers in C}} \times \frac{307,267}{ 315,013} \frac{\text{ Teachers in C}}{ \text{ Total all teachers}} + 16.8 \frac{\text{Students in A}}{\text{Teachers in A}} \times \frac{6,998}{ 315,013 } \frac{\text{ Teachers in A}}{ \text{ Total all teachers}}}{ \frac{307,267}{ 315,013} \frac{\text{ Teachers in C}}{ \text{ Total all teachers}} + \frac{6,998}{ 315,013 } \frac{\text{ Teachers in A}}{ \text{ Total all teachers}}}\end{align} \]
\[ \begin{align} =\frac{20.9 \times \frac{307,267}{ 315,013} \frac{\text{Students in C}}{ \text{ Total all teachers}} + 16.8 \times \frac{6,998}{ 315,013 } \frac{\text{Students in A} }{ \text{ Total all teachers}}}{ \frac{\text{315,013 Total all teachers}}{ \text{315,013 Total all teachers}}} \end{align} \]
\[ \begin{align} =\frac{20.9 \times 307,267 \text{ Students in C} + 16.8 6,998 \text{ Students in A} }{\text{ 315,013 Total all teachers}} \end{align} \]

Weighted means example

From the last slide, using the analysis of the units, we arrived at the formula which matches the formula for the combined state average

\[ \frac{20.9 \times 307,267 \text{ Students in C} + 16.8 6,998 \text{ Students in A} }{\text{ 315,013 Total all teachers}}\approx 20.8 \]
In this case, the properly chosen weights were given by

\[ \begin{align} w_1 &= \frac{307,267}{ 315,013} \frac{\text{ Teachers in C}}{ \text{ Total all teachers}}\\ w_2 &= \frac{6,998}{ 315,013 } \frac{\text{ Teachers in A}}{ \text{ Total all teachers}} \end{align} \]
That is, we were able to find the true combined state mean by finding weights proportional to the sub-populations.

Calculating the mean from a frequency distribution

IQ scores table, including the mean estimate computed from the table.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Computing the mean from a frequency distribution is very similar to computing a weighted mean.

The biggest difference lies in the fact that we do not know the true values of the measurements.

Instead, we will approximate the true values with the class midpoint.
We then weight the class midpoint by the frequency of observations in the class.

The formula to compute the mean of a frequency distribution is as follows: \[ \text{Mean of a frequency distribution} = \frac{\sum_\text{Classes} \text{(Class midpoint)} \times \text{(Class frequency)}}{\sum_\text{Classes} \text{(Class frequency)}}. \]
In the above table, we will denote \( x=\text{Class midpoint} \) and \( f=\text{Class frequency} \).
The formula thus becomes \[ \text{Mean of a frequency distribution} = \frac{\sum_\text{Classes} x \times f}{\sum_\text{Classes} f}. \]

Calculating the mean from a frequency distribution continued

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

In the table to the left, we can thus make the computation of the mean of the freqency distribution as follows.

For each class, we multiply the frequency \( f \) in the second column with the class midpoint \( x \) in the third column to get the fourth column value.
The total at the bottom row of the second column is the sum of the weights \( \sum f \)
The total at the bottom row of the fourth column is the sum of the weighted class midpoints \( \sum (f \times x) \).

The formula thus becomes \[ \text{Mean of a frequency distribution} = \frac{\sum_\text{Classes} x \times f}{\sum_\text{Classes} f} = \frac{7201.0}{78} \approx 92.32 \]

Basic concepts of variation

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

In the figure to the left, we examine the frequency of wait times in seconds at a bank to see a teller.
On the top, the customers were fed into a single line to wait for an open teller among multiple tellers;
on the bottom, customers are fed into one of multiple lines to see one an open teller at the front of the line.
We note that both frequency plots have the same mean, median and mode of 100 seconds.
If we only characterize data in terms of the center we actually don’t have a very complete picture – indeed, we can’t distinguish the two scenarios by these statistics .
Particularly, the outcomes with multiple lines have much more variation than the outcomes with a single line.

In this example, the bank actually chose to combine the multiple lines into a single line.
Note: this does not actually reduce the mean waiting time for customers.
However, customers were happier with this option because the variability in waiting time was reduced.
In practice, our statistics for variation / spread / dispersion of the data can be the most important statistics.
We will study range, standard deviation and variance to quantify this.