Introduction to estimation part II

04/09/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Review of point estimates, confidence intervals and critical values
    • Margin of error
    • Estimating a population proportion
    • Finding the right sample size
    • Estimating a population mean
    • The student t distribution
    • Confidence intervals for the mean
    • The special case when \( \sigma \) is known
    • Finding the right sample size

Review of point estimates and confidence intervals

Sample proportions tend to be normally distributed about the true parameter as the mean.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • In the last lecture, we saw how a sample proportion generates a random variable.
  • That is, when we take a sample of a population and compute the proportion of the sample for which some statement is true.
  • Suppose we want to replicate this sampling procedure infinitely many times
    • It impossible to replicate the sampling infinitely many times, but we can construct a probabilistic model for this replication process with a probability distribution.
  • Formally, we will define \( \hat{p} \) to be the random variable equal to the proportion derived from a random sample of \( n \) observations.
    • For each replication, \( \hat{p} \) attains a different value based on chance.
  • Then, for random, independent samples, \( \hat{p} \) tends to be normally distributed about \( p \).
    • We can thus use the value of \( \hat{p} \) and the distribution of \( \hat{p} \) to estimate \( p \) and how close we are to it.
  • We know that \( \hat{p} \) is an unbiased estimator of the true population proportion \( p \).
    • That is, over infinitely many resamplings, the expected value (mean of the probability distribution) for \( \hat{p} \) is equal to \( p \).
  • When we have a specific sample data set, and a specific value for \( \hat{p} \) associated to it, \( \hat{p} \) is called a point estimate for \( p \).
    • The measure of “how-close” we think this is to the true value is called a confidence interval.

Review of confidence intervals

Common confidence levels.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • Let’s recall how we constructed confidence intervals (CI) in the last lecture.
    • Suppose that we want to estimate the true proportion \( p \) with some level of confidence:
      • if we replicated the sampling procedure infinitely many times, the average number of times we found \( p \) in our confidence interval would be equal to the level of confidence.
  • Let’s take an example confidence level of \( 95\% \) – this corresponds to a rate of failure of \( 5\% \) over infinitely many replications.
  • Generally, we will write the confidence level as, \[ (1 - \alpha) \times 100\% \] so that we can associate this confidence level with its rate of failure \( \alpha \).
  • Recall, we earlier studied ways that we can compute the critical value associated to some \( \alpha \) for the normal distribution.
  • We will use the same principle here to find how wide is the interval around \( p \) for which \( \hat{p} \) will lie in this interval \( (1-\alpha)\times 100\% \) of the time.
  • This is equivalent to a two-sided measure of extremeness in the normal distribution, i.e.,
    • we want to find the critical value \( z_\frac{\alpha}{2} \) for which:
      • \( (1-\frac{\alpha}{2})\times 100\% \) of the area under the normal density lies to the left of \( z_\frac{\alpha}{2} \); and
      • \( (1-\frac{\alpha}{2})\times 100\% \) of the area under the normal density lies to the right of \( -z_\frac{\alpha}{2} \).
    • Put together, \( (1-\alpha)\times 100\% \) of values lie within \( [-z_\frac{\alpha}{2},z_\frac{\alpha}{2}] \).
    Area between alpha over two critical value.

    Confidence intervals for population proportions and critical values

    Confidence interval for alpha value.

    Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • In the figure to the left, we see exactly thus how to find the width of the region for a given confidence level.
      • For a given confidence level \( (1 -\alpha)\times 100\% \), we will find the particular \( \alpha \).
      • We then find the associated two-sided measure of extremeness with the \( z_\frac{\alpha}{2} \) critical value.
      • This critical value is associated to the measure of extremeness of finding an observation that lies far away from the mean.
      • Particularly, only \( \alpha \times 100\% \) of the population lies outside of the region \( [-z_\frac{\alpha}{2},z_\frac{\alpha}{2}] \).
    • Put another way, if you randomly draw a population member, there is \( (1 -\alpha)\times 100\% \) chance that the mean lies at a distance at most \( z_\frac{\alpha}{2} \) away from this observation.
    • The sample proportion point estimate \( \hat{p} \) can be considered a random draw from population of all point estimates distributed around \( p \).
    • We do not know how close a particular point estimate (depending on a particular sample) is to \( p \).
    • However, we know that over all possible replications of the sampling procedure, the point estimate \( \hat{p} \) will lie at a distance at most \( z_\frac{\alpha}{2} \) exactly \( (1-\alpha)\times 100\% \) of the time.
    • Therefore, we have \( (1-\alpha)\times 100\% \) confidence that the true \( p \) lies within this region about \( \hat{p} \).

    Standard error

    • Consider from the last slide, the interval \[ [- z_\frac{\alpha}{2}, z_\frac{\alpha}{1}] \] corresponds to the \( (1-\alpha)\times 100\% \) confidence interval of the standard normal with mean \( \mu = 0 \) and standard deviation \( \sigma=1 \).
    • However, we know that \( \hat{p} \) is actually distributed with mean \( p \).
    • Therefore, to compute a confidence interval for where we believe \( p \) lies, the natural place to center the confidence interval is around \( \hat{p} \).
      • The sample proportion \( \hat{p} \) is our best guess for \( p \); if we make a construction correctly, we can be assured that our \( (1-\alpha)\times 100\% \) confidence interval centered at \( \hat{p} \) will contain \( p \) \( (1-\alpha)\times 100\% \) of the time under resampling.
    • However, the standard deviation for the distribution of \( \hat{p} \) with \( n \) observations intuitively should depend on how many observations we have.
      • Particularly, we should generally get better estimates of \( p \) with \( \hat{p} \) when \( n \) is large and the standard deviation of \( \hat{p} \), \( \sigma_\hat{p} \), should get smaller when we take a larger number of observations \( n \).
    • Therefore, taking the interval \[ [\hat{p} - z_\frac{\alpha}{2}, \hat{p} + z_\frac{\alpha}{2}] \] is not in of itself enough.
    • We will need to scale the width of confidence interval \( z_\frac{\alpha}{2} \) by the standard deviation of \( \hat{p} \).
    • This will result in the confidence interval of the form, \[ [\hat{p} - z_\frac{\alpha}{2} \times \sigma_\hat{p}, \hat{p} + z_\frac{\alpha}{2} \times \sigma_\hat{p} ]. \]
    • The standard devation of the sampling variable \( \sigma_\hat{p} \) is called the standard error.

    Standard error continued

    • A sample proportion for which some statement about the population holds true can be considered a simple yes / no or success / failure sampling trial.
    • Each observation in a sample of \( n \) observations will be either a yes or no, and we find that \( \hat{p} \) of the sample corresponds to yes.
    • We can then consider this as a binomial sampling, for which there is \( p \) probability of success and \( n \) trials.
      • This also means that there is a \( 1 -p = q \) probability of failure.
    • Let \( X \) be a new random variable equal to the number of successful trials with the above binomial distribution.
    • This means that \[ \hat{p} = \frac{X}{n}, \] because \( \hat{p} \) is the sample proportion of successful trials out of all trials.
    • For the binomial random variable \( X \), the expected value is given by \[ \mu_X = n \times p \] and variance is given by \[ \sigma_X^2 = n \times p \times q. \]
    • Then the central limit theorm can be used to show in terms of the true parameters \[ \frac{X - n \times p}{\sqrt{n \times p\times q}} \] is very close to a standard normal in distribution.

    Standard error continued

    • On the last slide, we showed that \[ \frac{X - n \times p}{\sqrt{n \times p\times q}} \] is very close to a standard normal in distribution.
    • Therefore, with some re-arrangements we can say \[ \frac{X - n \times p}{\sqrt{n \times p\times q}} = \frac{\frac{X}{n} - p}{\sqrt{\frac{ p \times q }{n}}} = \frac{\hat{p} - p}{\sqrt{\frac{ p \times q }{n}}} \] is very close to standard normal.
    • The above gives us our approximation for \( \sigma_\hat{p} \), which we can write in terms of the sample-based estimates \[ \sigma_\hat{p} \approx \sqrt{\frac{\hat{p} \times \hat{q} }{n}} \]
    • Recall now our confidence interval, \[ [\hat{p} - z_\frac{\alpha}{2} \times \sigma_\hat{p}, \hat{p} + z_\frac{\alpha}{2} \times \sigma_\hat{p} ]. \]
    • For \( \sigma_\hat{p} \) approximated as above, we have a good approximation of the interval around \( \hat{p} \) which corresponds to \( z_\frac{\alpha}{2} \) standard deviations in either direction.
    • The radius around \( \hat{p} \), \( z_\frac{\alpha}{2} \times \sigma_\hat{p} \), has a special name, the margin of error.

    Margin of error

    Diagram of the percent of outcomes contained within each standard deviation of the mean
for a normal distribution.

    Courtesy of Fadethree Public domain via Wikimedia Commons

    • Formally, we will define \[ E = z_\frac{\alpha}{2} \sqrt{\frac{\hat{p}\times \hat{q}}{n}} \] to be the margin of error.
    • This is precisely the radius of the confidence interval \[ [\hat{p} - z_\frac{\alpha}{2} \times \sigma_\hat{p}, \hat{p} + z_\frac{\alpha}{2} \times \sigma_\hat{p} ] \] around \( \hat{p} \).
    • Our intiution here is correct in that we get a more accurate estimate with \( \hat{p} \) when we have larger numbers of observations \( n \).
    • Notice that as \( n \) gets larger, the term \[ \sqrt{\frac{\hat{p}\times \hat{q}}{n}} \] gets smaller.
    • On the other hand, for any fixed confidence level \( (1 - \alpha)\times 100\% \), the term \( z_\frac{\alpha}{2} \) remains fixed.
    • On the left, we see an example of this phenomena in an example where \( p=0.5 \) and the number of samples \( n \) is varied for \( \alpha=0.05 \) fixed.

    Estimating a population proportion summary

    • We have now established how to estimate a population proportion from a sample.
    • We will review the steps:
      1. Suppose we have \( n \) total observations generated by simple random sampling, where each selection can be considered (at least up to approximation) independent.
      2. Then let \( \hat{p} \) be the proportion of observations for which some statement holds true;
        • this is our point estimate for \( p \), but we need a confidence interval to tell us “how close” we think \( \hat{p} \) is to \( p \) with some level of confidence \( (1 - \alpha)\times 100\% \).
      3. \( \hat{p} \) is approximately normally distributed around \( p \) when \( n \times p \geq 5 \) and \( q \times n \geq 5 \) – usually we say this is satisfied if there are at least \( 5 \) successes and \( 5 \) failures.
      4. The standard error is the standard deviation of this approximately normal distribution, approximated as \[ \sigma_\hat{p} \approx \sqrt{\frac{ \hat{p} \times \hat{q}}{n}}. \]
      5. Suppose have some level of confidence \( (1 -\alpha)\times 100\% \) (usually \( 90\%,95\%, 99\% \)), and \( z_\frac{\alpha}{2} \) is the associated two-sided critical value for the standard normal.
      6. The confidence interval for \( (1-\alpha)\times 100\% \) confidence is \[ \left[\hat{p} - z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p}\times\hat{q}}{n}},\hat{p} + z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p}\times\hat{q}}{n}} \right] = \left[\hat{p} - E, \hat{p} + E\right] \]
    • The true population proportion \( p \) lies in this interval \( (1- \alpha)\times 100\% \) of the time under replications of samples of size \( n \).
    • As a rule-of-thumb, the values above should be rounded up to three significant digits.

    Estimating a population proportion summary continued

    • We should note a few things about the last procedure:
      1. It is important to understand the meaning of each of the:
        • \( z_\frac{\alpha}{2} \) – the two-sided critical value for the standard normal;
        • \( \sigma_\hat{p} \approx \sqrt{\frac{\hat{p} \times \hat{q}}{n}} \) – the our estimate for the standard deviation of \( \hat{p} \); and
        • \( E = z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p} \times \hat{q}}{n}} \) – the margin of error, i.e., the radius of the confidence interval.
      2. However, with modern statistical software we will not need to compute all of these values individually, and in StatCrunch you can compute the entire confidence interval at once..
    • Because some questions in the homework will ask you to compute these pieces individually, in this lecture I will show you how to compute the confidence interval both ways.
    • However, in general you are recommended to compute the confidence interval directly in statistical software whenever possible.
    • Let’s now consider the example of the Gallup poll result where there were \( 1487 \) adults surveyed to see if they had a Facebook account.
    • A total of \( 639 \) adults surveyed responded they had an account.
    • Consider the following: what assumptions does the sample need to satisfy if we are going to apply this procedure?
      • Recall, the sampling should follow a simple random procedure, and usually it is sufficient that there are at least \( 5 \) successes and \( 5 \) failures.
    • Consider the following: what is the approximate standard error for the distribution for the sample proportions?

    Estimating a population proportion example

    • Recall that the standard error approximation is given as \[ \sigma_\hat{p} \approx \sqrt{\frac{\hat{p} \times \hat{q}}{n}}. \]
    • In our example, we have that:
      1. \( n=1487 \);
      2. \( \hat{p} = \frac{639}{1487} \approx 43\% \);
      3. \( \hat{q} = 1 - \hat{p} \approx 57\% \).
    • Therefore, we can write \[ \sigma_\hat{p} \approx \sqrt{\frac{0.43 \times 0.57 }{1487}} \approx 0.0001648. \]
    • Consider the following how can we use StatCrunch to compute the correct critical value \( z_{0.025} \) so that we can compute the standard error?
    • Using StatCrunch, we found that \( z_{0.025} \approx 1.96 \), so that \[ E = z_{0.025} \times \sqrt{\frac{0.43 \times 0.57 }{1487}} \approx 0.0251636. \]
    • Consider the following: with the above values for \( E \) and \( \hat{p} \), what is the \( 95\% \) confidence interval of \( p \)?

    Estimating a population proportion example continued

    • Using the values for \[ E \approx 0.0251636, \]
    • and \( \hat{p} \approx 0.43 \) we can find that
    • \[ \begin{align} \left[\hat{p} - z_\frac{\alpha}{2}\times \sigma_\hat{p} , \hat{p} + z_\frac{\alpha}{2} \times \sigma_{p} \right] &= [ \hat{p} - E , \hat{p} + E] \\ &\approx [0.43 - 0.0251636, 0.43 + 0.0251636] \\ &\approx [0.405, 0.455] \end{align} \]
    • While it is possible to go through and calculate the confidence interval like this manually, it is faster and more accurate to do so with statistical software.
    • To do so, we will only need to know the total number of trials (\( n = 1487 \) observations in the sample) and the total number of successful trials (\( 639 \) respondents who had a Facebook account).
    • We will show this in the following.

    Analyzing polls

    • Now that we know how to estimate proportions of a population, we should make some notes about how to analyze polling data.
      1. When a poll is conducted, it should be conducted using some kind of sampling method that keeps draws random and for practical purposes independent.
      2. The proportion of the sample for which some statement (will vote yes) holds true should be considered a point estimate.
      3. This point estimate is a random variable that will change in value when samples are replicated, and almost surely does not equal the true population proportion.
      4. A level of confidence and confidence interval or margin of error should be included to understand how uncertain the point estimate is.
      5. The sample size should be provided with the estimates.
      6. Provided the sample size is sufficiently large, usually on the order of 100’s to 1000’s, the results should be reliable, though not guaranteed to correctly contain the true population proportion.
    • Regarding the last point, the results will be reliable in the sense that under replication of samples by the same method, the true proportion should lie in the confidence interval \( (1-\alpha)\times 100\% \) of the time.
    • This does not actually depend on the size of the population, only on the sample size.
      • That is, the number of samples necessary is actually independent of the size of the population itself.

    Analyzing polls example

    • Let’s consider the following example where we will try to find the values of the point estimate and margin of error from a confidence interval.
    • In the data from “High-dose nicotine patch therapy” in the textbook, we know that \( 70 \) subjects were given a nicotine patch treatment.
    • The article states that the confidence interval for the number of subjects who did not smoke during the therapy is equal to \[ [0.58, 0.81]. \]
    • Consider the following: can you deduce what the value of the point estimate (the sample proportion) is given the above confidence interval?
      • Recall, the point estimate \( \hat{p} \) is just the midpoint of the interval \[ [\hat{p}- E , \hat{p} + E]. \]
      • This says that if we take the midpoint formula for the two endpoints, \[ \begin{align} \frac{\text{Right endpoint} + \text{Left endpoint}}{2} &= \frac{\hat{p} + E + \hat{p} - E}{2}\\ &= \frac{2 \hat{p}}{2} = \hat{p} \end{align} \]
    • Therefore, we have \( \hat{p} = 0.695 \).

    Analyzing polls example continued

    • Recall, the confidence interval from the article for the number of subjects who did not smoke during the therapy is equal to \[ [0.58, 0.81]. \]
    • Consider the following: can you now deduce the margin of error from the confidence interval?
      • We will first note, \( E \) is half of the width of the confidence interval.
      • Thus, using half of the width formula for the interval, \[ \begin{align} \frac{\text{Right endpoint} - \text{Left endpoint}}{2} &= \frac{\hat{p} + E - (\hat{p} - E) }{2}\\ &= \frac{2 E }{2} = E \end{align} \]
      • Using this observation, we find that the margin of error is given as \( E= 0.115 \).

    Determining the necessary sample size

    • Instead of just analyzing results from a survey, we may also be interested how to plan a survey to reduce errors.
    • A vital question is “how many samples are needed to get a certain level of error in the estimated parameters?”
    • With the formula for the margin of error, \[ E = z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p} \times \hat{q}}{n}} \] we can also solve for the number of samples given a desired margin of error.
    • If we solve for \( n \), we get \[ n=\frac{z_\frac{\alpha}{2}^2 \times \hat{p} \times \hat{q}}{E^2} \] where \( E \) is our target margin of error.
    • However, the number of samples \( n \) depends on whether we already have an estimate for \( \hat{p} \) in hand before this study is conducted.
    • We might have another estimate \( \hat{p} \) before we conduct our replication, if there are other surveys already conducted.
    • If not, we should include more samples to compensate for not already knowing this value.
    • For a target margin of error \( E \) the largest number of samples corresponds to the case when \( \hat{p}=\hat{q} =0.5 \), so that our conservative estimate is, \[ n=\frac{z_\frac{\alpha}{2}^2 \times 0.5 \times 0.5}{E^2} = \frac{z_\frac{\alpha}{2} \times 0.25}{E^2}. \]
    • Notice, this expression doesn’t depend on the size of the population, only on desired margin of error and the confidence level.

    Determining the necessary sample size example

    • Recall, solving the margin of error formula for \( n \) the number of samples we get \[ n=\frac{z_\frac{\alpha}{2}^2 \times \hat{p} \times \hat{q}}{E^2} \] where \( E \) is our target margin of error.
    • Let’s suppose that we want to conduct a survey to find the number of US adults that make online purchases of household goods.
    • Suppose we had seen another survey stating this was approximately \( \hat{p}\approx 0.8 \) or \( 80\% \) of US adults.
    • Consider the following if we want to obtain a \( 3\% \) margin of error in the \( 95\% \) confidence level estimate of the true population proportion, what is the necessary number of samples to ensure this?
      • Notice, if \( 3\% \) is our target margin of error, then \[ E^2 = (0.03)^2 =0.0009. \]
      • We also have that \[ \begin{align} \hat{p} = 0.8 & & \hat{q} = 0.2 \end{align} \]
      • Putting these pieces together, and noting that \( z_\frac{\alpha}{2} \approx 1.96 \) we have \[ n =\frac{1.96^2 \times 0.8 \times 0.2}{0.0009} \approx 683 \] total samples.
      • Once again, this didn’t depend on the size of the US population, only the target margin of error and confidence level.

    Determining the necessary sample size example

    • Recall, solving the margin of error formula for \( n \) the number of samples we get \[ n=\frac{z_\frac{\alpha}{2}^2 \times \hat{p} \times \hat{q}}{E^2} \] where \( E \) is our target margin of error.
    • Let’s suppose that we want to conduct a survey to find the number of US adults that make online purchases of household goods but we do not have any estimate for \( \hat{p} \) already.
    • Remember, our conservative estimate is then to take \( \hat{p}=\hat{q}=0.5 \).
    • Consider the following if we want to obtain a \( 3\% \) margin of error in the \( 95\% \) confidence level estimate of the true population proportion, what is the necessary number of samples to ensure this?
      • Notice, if \( 3\% \) is our target margin of error, then \[ E^2 = (0.03)^2 =0.0009. \]
      • We also have that \[ \begin{align} \hat{p} = 0.5 & & \hat{q} = 0.5 \end{align} \]
      • Putting these pieces together, and noting that \( z_\frac{\alpha}{2} \approx 1.96 \) we have \[ n =\frac{1.96^2 \times 0.5 \times 0.5}{0.0009} \approx 1068 \] total samples.
      • In this case the number of necessary samples went up, because we had to choose this conservatively without already knowing an estimate for \( \hat{p} \).
      • Notice, however, if we already had an estimate for \( \hat{p}=0.5 \) this would have remained the same.

    Sampling distributions for means review

    Sample proportions tend to be normally distributed about the true parameter as the mean.

    Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • We will now recall how to estimate a population mean \( \mu \).
      • Suppose there is some population and there is some numerical measure of the population \( x \) that we wish to find the true population mean \( \mu \) for.
        • For example,
          • the population can be all US adults,
          • the numerical measure can be \( x= \)"age"
          • the true mean would be the average age of all US adults.
        • Let’s suppose that we will draw exactly \( n \) observations of the population by random sampling.
    • Suppose we want to replicate this sampling procedure infinitely many times
      • It is once again impossible to replicate the sampling infinitely many times, but we can construct a probabilistic model for this replication process with a probability distribution.
    • Formally, we will define \( \overline{x} \) to be the random variable equal to the mean derived from a random sample of \( n \) observations.
      • For each replication, \( \overline{x} \) attains a different value based on chance.
    • Then, for large numbers of random, independent samples, \( \overline{x} \) tends to be normally distributed about \( \mu \).
      • We can thus use the value of \( \overline{x} \) and the distribution of \( \overline{x} \) to estimate \( \mu \) and how close we are to it.
    • Our method for estimating \( \mu \) will be very similar to estimating p but there will be a practical difference when we know or do not know \( \sigma \) beforehand.

    Estimating population means

    • We usually do not know the true population standard deviation \( \sigma \) when we begin sampling, and therefore we will focus on this case which is most common.
    • Like \( \hat{p} \), the sample mean \( \overline{x} \) is our best estimate for the true population \( \mu \).
    • Suppose we choose some level of confidence \( (1-\alpha)\times 100\% \), we will similarly compute a confidence interval with a margin of error \( E \), \[ \left[ \overline{x} - E, \overline{x} + E\right]. \]
    • However, our definition of the margin of error \( E \) will change in this context, where:
      1. The standard error is given by, \[ \sigma_\overline{x} = \frac{\sigma}{\sqrt{n}}; \]
        • however, with \( \sigma \) unknown we will approximate this with, \[ \sigma_\overline{x} \approx \frac{s}{\sqrt{n}} \] i.e., with the sample standard deviation \( s \) over the square root of the number of observations.
      2. The critical value is given by \( t_\frac{\alpha}{2} \), the two-sided critical value of the student t distribution in \( n-1 \) degrees of freedom.
    • Together, the margin of error for the estimate of the population mean is given by \[ \begin{align} E &= t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}} \end{align} \]
    • We will discuss what the student t distribution is in the following, and how we can use it to find the above margin of error.

    Student t distribution

    Student t distribution depends on the degrees of freedom.

    Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • The “student t” distribution became fully developed when the statistician and brewer William Sealy Gosset was trying to model the quality of raw material like barley for beermaking with very few samples.
    • Gosset worked at Guinness brewery and was not allowed to publish under his own name, so instead published under the name “student”.
    • The student t is very similar to a normal distribution, but has wider variability, especially in the tails.
    • Let’s suppose that we have \( x_1, \cdots, x_n \) total observations sampled from a normal distribution with mean \( \mu \) and standard deviation \( \sigma \).
    • We can compute the sample mean \( \overline{x} \) and the sample standard deviation \( s \) from the above observations.
    • Then, it is an extremely important result that the random variable, \[ \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \] is distributed according to a student t with \( n-1 \) degrees of freedom.
    • The number of “degrees of freedom” in the student t is a parameter that determines the shape, as shown on the diagram above.
    • We will not need to belabor the details, but you should remember that for \( n \) observations, the number of degrees of freedom for the associated student t is \( n-1 \).

    Estimating population means continued

    • Returning to the problem, we choose some level of confidence \( (1-\alpha)\times 100\% \) and we will similarly compute a confidence interval with a margin of error \( E \), \[ \left[ \overline{x} - E, \overline{x} + E\right]. \]
    • Our definition of the margin of error \( E \) for the estimate of the population mean is given in terms of:
      1. The estimated standard error, \[ \sigma_\overline{x}\approx \frac{s}{\sqrt{n}}. \]
      2. And the critical value is given by \( t_\frac{\alpha}{2} \), the two-sided critical value of the student t distribution in \( n-1 \) degrees of freedom.
    • Together, the margin of error for the estimate of the population mean is given by \[ \begin{align} E &= t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}} \end{align} \]
    • When we have \( n>30 \) observations or observations that are drawn from a normal population, we can use the central limit theorem to approximate the distribution of \( \overline{x} \) as close-to-normal, distributed around \( \mu \).
    • If we do not know \( \sigma \) already, then \( \frac{s}{\sqrt{n}} \) is the best estimate for the standard deviation, and it turns out that \[ \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \] is distributed according to a student t with \( n-1 \) degrees of freedom.
    • We can calculate \( \overline{x} \), \( s \) and \( \sqrt{n} \) directly from the observations, but we will generally find \( t_\frac{\alpha}{2} \) using technology like StatCrunch.

    Estimating population means example

    • It is once again important to understand the meaning of all the pieces that we work with here:
      1. \( \overline{x} \) the point estimate for the true population mean \( \mu \).
      2. \( \frac{s}{\sqrt{n}} \)the estimated standard deviation of the distribution of \( \overline{x} \) around \( \mu \) .
      3. \( t_\frac{\alpha}{2} \)the student t, two-sided critical value associated to the confidence level \( (1-\alpha)\times 100\% \).
      4. \( E = t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}} \) the radius for the confidence interval for the estimate of the population mean.
    • These pieces together give us the \( (1-\alpha)\times 100\% \) confidence interval for the mean \( \mu \), \[ \left[ \overline{x} - t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}, \overline{x} + t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}} \right] = \left[ \overline{x} - E, \overline{x} + E\right] \]
    • The homework will ask questions that require us to compute each of these pieces individually, but note, with modern statistical software it is faster and more accurate to compute the entire confidence interval at once.
    • For this reason, we will go over both approaches in the examples but you are encouraged whenever possible to compute the confidence interval in modern statistical software directly.
    • We will begin by showing how to find the \( t_\frac{\alpha}{2} \) critical value for some \( (1-\alpha)\times 100\% \) confidence level in StatCrunch.
    • Consider the following: suppose we have \( 15 \) observations \( x_1,\cdots, x_{15} \) from a normally distributed population and we want to find a \( 95\% \) confidence interval for the population mean. What assumptions do we need to have satisfied to use the \( t_\frac{\alpha}{2} \) critical value? What is the number of degrees of freedom?
      • Note, we should have \( n>30 \) observations or they should come from a normally distributed population. With \( n=15 \) we have \( n-1=14 \) degrees of freedom in the student t distribution.
      • We will now show how to compute the critical value in StatCrunch.

    Estimating population means example continued

    • From the last example, we saw that the associated student t, two-sized critical value \[ t_\frac{\alpha}{2} \approx 2.145. \]
    • Note: for a different number of degrees of freedom, we will need to re-compute this as the critical value depends on the shape parameter.
    • Let’s now suppose that we have actual samples of the weights of \( 15 \) girls randomly selected at birth – the population weights are assumed to be normally distributed.
    • Let’s suppose that the sample mean is given as \( \overline{x} =30.9 \) hectograms (hg) with a sample standard deviation of \( s= 2.9 \) hg.
    • Consider the following: what is the standard error for the distribution of \( \overline{x} \) around \( \mu \) given the above values?
      • Given the above, we have \( \sigma_\overline{x} = \frac{s}{\sqrt{n}} = \frac{2.9}{\sqrt{15}} \approx 0.749 \).
    • Consider the following: what is the \( 95\% \) confidence interval for the population mean birth weight of baby girls given the above values?
      • Given the above, our margin of error is given by, \[ E = t_\frac{\alpha}{2} \times \sigma_\overline{x} = 2.145 \times \frac{2.9}{\sqrt{15}} \approx 1.606 \]
      • The \( 95\% \) confidence interval can then be computed as, \[ [\overline{x} - E, \overline{x} + E] \approx [30.9 - 1.606, 30.9 + 1.606] = [29.3, 32.5] \]
    • We will now demonstrate how this can be computed at once in StatCrunch. We will only need the values of \( \overline{x}=30.9 \), \( s=2.9 \) and \( n=15 \) to compute this directly.

    The case where \( \sigma \) is known

    • Note that the use of the student t distribution in calculating the standard error arises precisely due to the fact that we are estimating \( \sigma \) with \( s \).
    • In the unusual case when we actually do know the population \( \sigma \) in advanced, we can actually take an approach that is similar to estimating a population proportion.
    • That is, if we know \( \sigma \) already, we can calculate the standard error exactly as \[ \sigma_\overline{x} = \frac{\sigma}{\sqrt{n}}. \]
    • This is actually the exact standard deviation of the sample mean \( \overline{x} \) distributed around the true population mean \( \mu \).
    • Then, if \( n>30 \) or the observations are drawn from a normal distribution, the quantity \[ \frac{\overline{x} - \mu}{\frac{\sigma}{\sqrt{n}}} \] is distributed approximately according to a standard normal.
    • When \( \sigma \) is known, we can use the \( z_\frac{\alpha}{2} \) critical value and the true standard deviation \( \frac{\sigma}{\sqrt{n}} \) directly to compute the margin of error, \[ E = z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}}. \]
    • The \( (1-\alpha)\times 100\% \) confidence interval, in the case where we know \( \sigma \) in advanced, then is written \[ \left[\overline{x} - z_\frac{\alpha}{2}\times \frac{\sigma}{\sqrt{n}} , \overline{x} + z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}} \right] = \left[ \overline{x} - E, \overline{x} + E\right] \]

    The case where \( \sigma \) is known continued

    • Recall then our problem in which we have the weights of \( 15 \) girls randomly selected at birth – the population weights are assumed to be normally distributed.
    • Suppose that the sample mean is given as \( \overline{x} =30.9 \) hectograms (hg) with a population standard deviation of \( \sigma= 2.9 \) hg.
    • Consider the following: what is the standard error for the distribution of \( \overline{x} \) around \( \mu \) given the above values?
      • Given the above, we have \( \sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} = \frac{2.9}{\sqrt{15}} \approx 0.749 \).
    • Notice that in the above, the calculation of the standard error only now changed by using \( \sigma \) directly.
    • Consider the following: what is the \( 95\% \) confidence interval for the population mean birth weight of baby girls given the above values?
      • In this case, we again substitute \( \sigma \) for \( s \) in the equation, but most importantly, we now use the standard normal critical value \( z_\frac{\alpha}{2}\approx 1.96 \).
      • Notice that \( z_\frac{\alpha}{2} \approx 1.96 < 2.145 \approx t_\frac{\alpha}{2} \)
      • In particular, we have the confidence interval given as \[ \left[\overline{x} - z_\frac{\alpha}{2}\times \frac{\sigma}{\sqrt{n}} , \overline{x} + z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}} \right] \approx [30.9 - 1.96, 30.9 + 1.96] = [29.4, 32.4] \]
    • Consider the following: if \( z_\frac{\alpha}{2} \approx 1.96 < 2.145 \approx t_\frac{\alpha}{2} \), what does this say about the width of the confidence interval when \( \sigma \) is known versus when \( \sigma \) is unkown?
      • This says that the confidence interval is wider when \( \sigma \) is unkown because \( t_\frac{\alpha}{2} \) is larger than \( z_\frac{\alpha}{2} \).
      • This is related to how we said the t distribution is like the normal but with wider variability.

    Review of estimating the population mean

    We should use critical values of the student t when the population standard deviation is unknown.

    Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • To summarize when and how we can estimate the population mean:
      • In general, we should have \( n>30 \) observations in our data sample or the observations \( x_i \) should come from a normally distributed population.
      • We will select some level of confidence \( (1-\alpha)\times 100\% \) at which we want to estimate the population mean.
    • If the true population standard deviation \( \sigma \) is unknown, we need to use the student t critical value \( t_\frac{\alpha}{2} \).
      • We will also estimate the standard error \( \sigma_\overline{x} \approx \frac{s}{\sqrt{n}} \) so that the confidence interval is given as \[ \left[ \overline{x} - E, \overline{x} + E\right] = \left[\overline{x} - t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}, \overline{x} + t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}} \right]. \]
    • If the true population standard deviation \( \sigma \) is known, we can use the standard normal critical value \( z_\frac{\alpha}{2} \) for a more acccurate confidence interval.
      • We can compute the standard error \( \sigma_\overline{x}=\frac{\sigma}{\sqrt{n}} \) exactly, so that the confidence interval is given as \[ \left[ \overline{x} - E, \overline{x} + E\right] = \left[\overline{x} - z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}}, \overline{x} + z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}} \right] . \]

    Finding the necessary sample size

    • When designing a survey or a collection of data, we may once again be interested in how many samples are necessary to collect to estimate the mean within a desired margin of error.
    • In the same way we estimated the number of samples necessary to compute a population proportion, we can solve for \( n \) in the margin of error formula \[ n = \left(\frac{z_\frac{\alpha}{2} \times \sigma}{E}\right)^2. \]
    • Thus if we have a desired margin of error \( E \) that is known, an some estimate for \( \sigma \) from previous experience or from some small exploratory study, we can solve for \( n \) as the unknown above.
    • The process from this point is effectively the same, though we should keep in mind that \( n>30 \) or normally distributed population assumptions still apply.