02/11/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Courtesy of M. W. Toews CC via Wikimedia Commons.
The simplest measure of variation is the range, which measures the width of the data values.
Suppose we have samples \( x_1,\cdots, x_n \) and
\[ \begin{align} x_\text{max} = \max_i(x_i) & & x_\text{min} = \min_i(x_i) \end{align} \]
Then the range is computed as
\[ \text{Range} = x_\text{max} - x_\text{min} \]
For example, if our samples are \( 22, 22, 26, \) and \( 24 \) then
\[ \text{Range} = 26 - 22 = 4.0 \]
Discuss with a neighbor: is the range resistant to outliers? Why or why not?
For example, suppose we have the sample values \( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1000 \). The range will be 1000 due the outlier;
Note, we could have considered a different way to measure the variation than standard deviation.
Consider, if we want to measure the total deviation we could instead write this as,
\[ \sum_{i=1}^n \vert x_i - \overline{x}\vert \]
We could then divide this by the total number of observations, which gives
\[ \text{Mean absolute deviation} = \frac{\sum_{i=1}^n \vert x_i - \overline{x}\vert}{n} \]
This is a possible choice for a similar measure of the variation, but the main issue lies in that the absolute value is not an “algebraic operation”.
If we want to make calculations or inferences based on the formula above, this will become very difficult and there are few tools that work well with this statistic.
For this reason, using the square as in the sample standard deviation
\[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]
we get a similar result, but one that is mathematically easier to manipulate.
We will not focus on calculating the sample standard deviation manually in this course;
Suppose we have the samples \( 22, 22, 26 \) and \( 24 \).
We wish to compute how much deviation there is in the data from the sample mean, so we will begin by computing this value
\[ \overline{x} = \frac{22 + 22 + 26 + 24 }{4} = \frac{94}{4}=23.5 \]
We now compute the raw deviation of each sample from the sample mean:
\[ \begin{align} x_1 - \overline{x} =& 22 - 23.5 = -1.5\\ x_2 - \overline{x} =& 22 - 23.5 = -1.5\\ x_3 - \overline{x} =& 26 - 23.5 = 2.5\\ x_4 - \overline{x} =& 24 - 23.5 = 0.5\\ \end{align} \]
Squaring each value, we obtain \( 2.25, 2.25, 6.25, 0.25 \), so that
\[ s = \sqrt{\frac{\sum_{i=1}^4 \left(x_i - \overline{x}\right)^2}{3}} = \sqrt{\frac{11}{3}}\approx 1.9 \]
This shows how the sample standard deviation can be computed, but we will want a few ways to interpret the value.
Courtesy of Mario Triola, Essentials of Statistics, 6th edition
Courtesy of Mario Triola, Essentials of Statistics, 6th edition
As a very rough estimate, we can approximate the sample standard deviation with the range rule.
The only time we should consider using this approximation is when we have no computer or calculator on hand, and need a quick “back-of-an-envelope” calculation.
The range rule of thumb for estimating the standard deviation is given as
\[ s \approx \frac{\text{Range}}{4} \]
Suppose we have the samples \( 22, 22, 26 \) and \( 24 \) once again.
The sample standard deviation of the data is \( \approx 1.9 \).
Discuss with a neighbor: what is the range rule of thumb estimate for the sample standard deviation? Is this very accurate in this case?
The range rule of thumb gives \( \frac{26 - 22}{4} = \frac{4}{4}=1 \), which is not that accurate.
This shows that we should only consider this as a very loose approximation, and in practice we should compute the sample standard deviation directly whenever possible.
When we describe the amount of variation in data, it is commonly described as the dispersion or spread in the data.
The word variance also has a specific meaning in statistics and is another tool for describing the variation / dispersion / spread of the data.
Suppose that the data has a population standard deviation of \( \sigma \) and a sample standard deviation of \( s \).
Then, the data has a population variance of \( \sigma^2 \).
Likewise, the data has a sample variance of \( s^2 \).
Therefore, for either a population parameter or a sample statistic, the variance is the square of the standard deviation.
For example, measuring the heights of students in inches, the standard deviation is in the units inches.
Courtesy of Melikamp CC via Wikimedia Commons
Courtesy of Melikamp CC via Wikimedia Commons
The proportion (or fraction) of any set of data lying within \( K \) standard deviations of of the mean is always at least \( 1-\frac{1}{K^2} \) where \( K>1 \).
Courtesy of Mario Triola, Essentials of Statistics, 6th edition
Coefficient of variation: the coefficient of variation (or CV) for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean.
The CV thus puts all standard deviations on a relative scale and in percentage units so that they are all comparable.
When you compare populations of a similar size with similar means, it is preferable to look at the standard deviation directly as you keep the original units of the data.
However, the coefficient of variation will work effectively in any case.
Note: we will typically round the coefficient of variation to one decimal place.
Listed below are amounts (in millions of dollars) collected from parking meters by Brinks and others in New York City.
\[ \begin{matrix} \text{Collection contractor was Brinks:} & 1.3 & 1.5& 1.3& 1.5& 1.4& 1.7& 1.8& 1.7& 1.7& 1.6\\ \text{Collection contractor was not Brinks:}& 2.2 & 1.9& 1.5& 1.6& 1.5& 1.7& 1.9& 1.6& 1.6& 1.8 \end{matrix} \]
A larger data set was used to convict five Brinks employees of grand larceny.
The data were provided by the attorney for New York City, and they are listed on the DASL Web site.
We consider, the means and standard deviations are given,
\[ \begin{matrix} \overline{x}_\text{Brinks} =1.55 & &s_\text{Brinks} = 0.178 \\ \overline{x}_\text{not Brinks} =1.73 & & s_\text{not Brinks} = 0.221 \end{matrix} \]
Discuss with a neighbor: what is the coefficient of variation for the two sample data sets? Does the data listed here show evidence of stealing by Brinks employees?
The coefficients of variation are given as
\[ \begin{matrix} CV_\text{Brinks} = \frac{ 0.178}{1.55}\times 100\% \approx 11.5\% \\ CV_\text{not Brinks} =\frac{0.221}{1.73} \times 100\% \approx 12.8\% \end{matrix} \]
Both data sets are on the same scale of millions of dollars, so that Brinks collection varies less than other collections by a factor of over \( 1\% \) of a million dollars, indicating something amiss.