03/01/2021
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The following topics will be covered in this lecture:
Our goal in this course is to use statistics from a small, representative sample to say something general about the larger, unobservable population or phenomena.
Recall, the measures of the population are what we referred to as parameters.
Parameters are generally unknown and unknowable.
Random variables and probability distributions give us the model for estimating population parameters.
Note: we can only “find” the parameters exactly in very simple examples like games of chance.
Generally, we will have to be satisfied with estimates of the parameters that are uncertain, but also include measures of “how uncertain”.
Courtesy of M. W. Toews CC via Wikimedia Commons.
The (arithmetic sample) mean is usually the most important measure of center.
Suppose we have a sample of \( n \) total measurements of some random variable \( X \).
Then, the (arithmetic sample) mean is defined
\[ \text{Sample mean} = \overline{x} = \frac{x_1 +x_2 +\cdots + x_n}{n}= \frac{\sum_{i=1}^n x_i}{n} \]
Q: is the sample mean a statistic or a parameter?
An important property of the sample mean is that it tends to vary less over re-sampling than other statistics.
However, the sample mean is very sensitive to outliers.
A statistic is called resistant if it doesn't change very much with respect to outlier data.
Courtesy of Diva Jain CC via Wikimedia Commons.
Let us suppose that we have a sample \( x_1, x_2, \cdots, x_n \).
Suppose each measurement is given a corresponding weight \( w_i \) so that there are pairs of the form \[ \begin{matrix} x_1 & w_1\\ x_2 & w_2 \\ \vdots & \vdots \\ x_n & w_n \end{matrix} \]
We compute a weighted mean using the following formula,
\[ \frac{\sum_{i=1}^n x_i \times w_i}{\sum_i^n w_i} = \frac{x_1 \times w_1 + x_2 \times w_2 + \cdots + x_n \times w_n}{w_1 + w_2 + \cdots + w_n} \]
Let's suppose that we want to compute the grade point mean (GPA) for some student.
We will suppose that the student gets letter grades as follows: \( A, B, C, A, B \).
The letters are given point values as \( A=4.0, B=3.0, C=2.0, D=1.0 \)
The GPA is computed as a weighted mean of the grade points, weighted by the number of credits for the class.
\[ \begin{matrix} A & 3 \text{ credits} \\ B & 2 \text{ credits} \\ C & 2 \text{ credits} \\ A & 1 \text{ credit} \\ B & 3 \text{ credits} \end{matrix} \]
Q: how do we compute the weighted mean in this case? What is the GPA?
\[ \frac{4.0 \times 3 + 3.0 \times 2 + 2.0 \times 2 + 4.0 \times 1 + 3.0 \times 3}{3 + 2 + 2 + 1 + 3}= \frac{35}{11} \approx 3.18 \]
Outcome | Observed value for \( X=x \) | Frequency |
---|---|---|
\( \{H,H\} \) | \( x=2 \) | \( n=6 \) |
\( \{H,T\} \) | \( x=1 \) | \( n=3 \) |
\( \{T,H\} \) | \( x=1 \) | \( n=7 \) |
\( \{T,T\} \) | \( x=0 \) | \( n=4 \) |
Outcome | Observed value for \( X=x \) | Probability |
---|---|---|
\( \{H,H\} \) | \( x=2 \) | \( f(x)=\frac{1}{4} \) |
\( \{H,T\}, \{T,H\} \) | \( x=1 \) | \( f(x)=\frac{2}{4} \) |
\( \{T,T\} \) | \( x=0 \) | \( f(x)=\frac{1}{4} \) |
Outcome | Observed value for \( X=x \) | Probability |
---|---|---|
\( \{H,H\} \) | \( x=2 \) | \( f(x)=\frac{1}{4} \) |
\( \{H,T\}, \{T,H\} \) | \( x=1 \) | \( f(x)=\frac{2}{4} \) |
\( \{T,T\} \) | \( x=0 \) | \( f(x)=\frac{1}{4} \) |
Courtesy of Mario Triola, Essentials of Statistics, 5th edition
Note, we could have considered a different way to measure the variation than standard deviation.
Consider, if we want to measure the total deviation we could instead write this as,
\[ \sum_{i=1}^n \vert x_i - \overline{x}\vert \]
We could then divide this by the total number of observations, which gives
\[ \text{Mean absolute deviation} = \frac{\sum_{i=1}^n \vert x_i - \overline{x}\vert}{n-1} \]
This is a possible choice for a similar measure of the variation, but the main issue lies in that the absolute value is not an “algebraic operation”.
If we want to make calculations or inferences based on the formula above, this will become very difficult and there are few tools that work well with this statistic.
For this reason, using the square as in the sample standard deviation
\[ s = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \overline{x}\right)^2}{n-1}} \]
we get a similar result, but one that is mathematically easier to manipulate.
We will not focus on calculating the sample standard deviation manually in this course;
Suppose we have the sample \( 22, 22, 26 \) and \( 24 \).
We wish to compute how much deviation there is in the data from the sample mean, so we will begin by computing this value
\[ \overline{x} = \frac{22 + 22 + 26 + 24 }{4} = \frac{94}{4}=23.5 \]
We now compute the raw deviation of each sample from the sample mean:
\[ \begin{align} x_1 - \overline{x} =& 22 - 23.5 = -1.5\\ x_2 - \overline{x} =& 22 - 23.5 = -1.5\\ x_3 - \overline{x} =& 26 - 23.5 = 2.5\\ x_4 - \overline{x} =& 24 - 23.5 = 0.5\\ \end{align} \]
Squaring each value, we obtain \( 2.25, 2.25, 6.25, 0.25 \), so that
\[ s = \sqrt{\frac{\sum_{i=1}^4 \left(x_i - \overline{x}\right)^2}{3}} = \sqrt{\frac{11}{3}}\approx 1.9 \]
This shows how the sample standard deviation can be computed, but we will want a few ways to interpret the value.
The word variance also has a specific meaning in statistics and is another tool for describing the variation / dispersion / spread of the data.
Suppose that the data has a population standard deviation of \( \sigma \) and a sample standard deviation of \( s \).
Then, the data has a population variance of \( \sigma^2 \).
Likewise, the data has a sample variance of \( s^2 \).
Therefore, for either a population parameter or a sample statistic, the variance is the square of the standard deviation.
For example, measuring the heights of students in inches, the standard deviation is in the units inches.
Courtesy of Melikamp CC via Wikimedia Commons