Measures of center and measures of spread

02/06/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Mean
    • Median
    • Mode
    • Midrange
    • Computing the mean from frequency distributions
    • Weighted means
    • Basic concepts of variation

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

  • Recall, we try to characterize data by a number of the features that it exhibits.
  • Some of the key measures are:
    1. Center: A representative value that indicates where the middle of the data set is located.
    2. Variation: A measure of the amount that the data values vary.
    3. Distribution: The nature or shape of the spread of the data over the range of values (such as bell-shaped).
    4. Outliers: Sample values that lie very far away from the vast majority of the other sample values.
    5. Time: Any change in the characteristics of the data over time.
  • We will now begin studying measures of center.
  • There are several main measures of center of a data set:
    1. mean;
    2. median;
    3. mode; and
    4. midrange.
  • Each of these usually gives a different view of where the “most central point” of the data lies.

Mean

  • The (arithmetic sample) mean is usually the most important measure of center.

  • Suppose we have \( n \) total sample measurements of some variable \( x \).

    • We will denote these samples \( x_1, x_2, \cdots, x_n \)
  • Then, the (arithmetic sample) mean is defined

    \[ \text{Sample mean} = \frac{x_1 +x_2 +\cdots + x_n}{n}= \frac{\sum_{i=1}^n x_i}{n} \]

  • Discuss with a neighbor: is the sample mean a statistic or a parameter?

    • A: the sample mean is computed from samples and thus a statistic.
    • For this reason, if we took new measurements from a new sample of the population, we could get a different value.
    • The random difference between the sample mean and the mean of the true population mean is called sampling error.
  • An important property of the sample mean is that it tends to vary less over re-sampling than other statistics.

    • That is, it tends to stay close to the same value.
  • However, the sample mean is very sensitive to outliers.

    • If outliers exist in the data, the mean can be drawn far away from the “main” cluster of data.
  • A statistic is called resistant if it doesn't change very much with respect to outlier data.

Median

  • A different notion of center is the middle of the data.
  • For a numerical measurement, we can always order the data so that we go from low to high or high to low.
  • Median – the median is the middle of the ordered data set.
    • If there are an odd number of samples, the median is defined as the middle value exactly.
    • If there are an even number of samples, we split the data into the lower \( 50\% \) and upper \( 50\% \) of the samples;
    • then we take the median to be the mean of the:
      1. largest of the lower \( 50\% \); and
      2. smallest of the upper \( 50\% \).
  • Suppose we are given a list of the following samples \( 22, 22, 26, 24, 23 \).
    • Discuss with a neighbor: what is the median of this list of samples?
    • Ordering the values, we get \( 22, 22, 23, 24, 26 \) so that the middle value is obviously \( 23 \).
  • Suppose a new sample includes \( 22, 22, 26, 24, 23, 27 \).
    • Discuss with a neighbor: what is the median of this list of samples?
    • In this case, we have an even number of samples.
    • The lower \( 50\% \) is given by \( 22,22,23 \) and the upper \( 50\% \) is given by \( 24,26,27 \).
    • Therefore, the mean of the largest lower value and the smallest upper value is given by \[ \frac{23 + 24}{2} = 23.5. \]

Median continued

  • Let us consider the last example once again.

  • Suppose our sample includes the values \( 22, 22, 26, 24, 23, 27 \).

  • If we compute the (arithmetic sample) mean, we find

    \[ \frac{22+22+26+24+23+27}{6} = \frac{144}{6} = 24. \]

  • Now, suppose that we realize that the value \( 27 \) was obtained due to measurement error and our sample should have read \( 22, 22, 26, 24, 23, 1000 \).

  • Discuss with a neighbor: by replacing the value \( 27 \) with \( 1000 \) does this affect the median? Does this affect the mean? Which of these statistics are resistant to outliers?

    • We note, this does not affect the median – indeed the actual numerical value of the final measurement does not change which value lies in the middle.
    • The lower \( 50\% \) of the measurements are given by \( 22,22,23 \) and the upper \( 50\% \) are given by \( 24,26,1000 \).
    • Once again, we compute the mean of the largest lower value and the smallest upper value, given by \( \frac{23 + 24}{2} = 23.5. \) so that we say the median is resistant to outliers.
    • On the other hand, the sample mean is given as \[ \frac{22+22+26+24+23+1000}{6} = \frac{1117}{6} \approx 186.1667. \]

Mode

  • Another notion of the most “central point” in the data can be the value that is measured most frequently.

  • Mode – the mode is the observed value that is most frequent in the data.

  • Consider the last example with samples of \( 22, 22, 26, 24, 23, 27 \). Q: What is the mode?

    • In this case, we sampled \( 22 \) more than any other value, so this is the mode of the data.
  • When two or more values have the highest frequency, we call the data bi-modal or multi-modal.

    • An exception to this above rule is when no values are repeated.
    • In this case, we say there is no mode to the data.

Differences in mean, median and mode