Normal probability distributions part III and introduction to estimation

03/31/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Assessing normality of data
    • Histograms
    • Q-Q plots
    • Examples in StatCrunch
    • Point estimates for population proportions
    • Confidence intervals for population proportions
    • Critical values again

Motivation

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a normal distribution.

Courtesy of Melikamp CC via Wikimedia Commons

  • Recall now the bell curve picture that we often consider – we will suppose we have a population that is distributed as a bell shape.
  • We suppose that the population mean is \( \mu \) and population standard deviation \( \sigma \).
  • Normally distributed data is characterized by the following features:
    1. The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
    2. The distribution is approximately symmetric.
    3. There are few if any extreme values.
  • These features can be understood as a direct application of the empirical rule.
  • We suppose that the histogram represents the sample data which is mostly bell-shaped, but the collection is smaller than the population so it is not exact.
    • In particular, any data set is subject to sampling error and we cannot expect a perfect bell shape from a small sample even when the population is perfectly normally distributed.
  • So far, we have used histograms to examine these features in data and to assess normality.
    • However, histograms sometimes have technical issues that make them unreliable.

Motivation continued

Histogram of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Let’s consider a histogram we saw earlier with IQ scores.
    • Over the population of US adults, IQ scores are normally distributed, exhibiting the bell shape discussed earlier.
    • However, depending on the choice of bin-widths for the histogram, we can see a very different shape.
    • Here the bin-widths are quite wide and so we collect many scores together – this obscures the bell shape.
    • On the other extreme, if we chose bins so finely that every bin had one observation, the histogram would be flat.
    • With a “good” choice in between, we can see the bell shape more clearly.

Histogram of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • There is no “perfect” choice generally for how wide the bins should be in a histogram, and it makes this less reliable for assessing normality than other methods.
  • For this reason, we will introduce a more advanced visual method for assessing normality.
  • If we want to examine if data is “close-enough-to-normal”, it makes sense to compare our data to the normal distribution directly;
    • to do so, we will introduce the Q-Q or quantile-quantile plot.

Q-Q plots

Normal data q-q plot.

Courtesy of Glen_b via the Stack Exchange

  • We will discuss formaly how Q-Q plots are constructed as follows.
  • Note: we will discuss the version of Q-Q plots described in the book for simplicity, but typically the Q-Q plots used in practice have a slightly different construction.
    • Suppose we have some collection of sample data \( \{x_i\} \) for \( i=1,\cdots, n \).
    • We will order our observations, \( x_1, x_2, \cdots x_n \) (changing the index if necessary) such that, \[ x_i \leq x_{i+1} \] for every \( i=1,\cdots, n-1 \).
  • On the horizontal axis of the Q-Q plot, we use the scale of the original data.
  • On the vertical axis, we use the scale of z scores.
  • Combined, the Q-Q plot places the z-score of each observation above the raw value in the horizontal axis.
  • If the sample data is “perfectly” normal, then the z scores should form a straight line against the ordered data.
  • However, sampling error will lead to small deviations from this straight line in practice.
Histogram of IQ scores.
QQ plot of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • For the IQ data histogram earlier, we see the associated Q-Q plot to its right.
  • Here, this is mostly a straight line with only slight deviations in the tails.
  • However, the data doesn’t have many extreme values, so it isn’t a concern.

Analyzing Q-Q plots

QQ plot of uniform data.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • Generally, we will not be concerned with slight deviations from a straight line in a Q-Q plot;
    • if there is only one outlier, and it is not too extreme, we can usually say the data is “close-enough-to-normal”.
  • However, when there is some systematic structure to the data this should raise an alarm.
  • Consider the following: given the Q-Q plot on the left for some sample data, does this data look “close-enough-to-normal”? That is, do we detect any systematic structure other than a straight line?
    • In fact, the data is very non-normal, it is drawn from a uniform distribution.

Histogram of uniform data.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • What we should detect in this Q-Q plot is the following:
      1. The distribution is symmetric, as can be seen by the symmetry in the Q-Q plot.
      2. However, we can see a smaller relative concentration of values towards the mean as evidenced by the “bowing” of the curve above and below the line.
    • Near the interior, the magnitude of the z scores should be increasing faster based on the percentiles of the observed data.
    • At the tails, there are very extreme values that we would not expect to see in normal data.

Analyzing Q-Q plots continued

Q-Q plot of skewed data.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • At the left is the Q-Q plot of real data from the rainfall in inches in Boston, measured once a week for an entire year.
  • Consider the following: given the Q-Q plot on the left for some sample data, does this data look “close-enough-to-normal”? That is, do we detect any systematic structure other than a straight line?
    • This example is right-skewed, with many values concentrated close to zero and a few observations very far away with high values.
  • The associated histogram is plotted below.

Histogram of skewed data.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • What we should detect in this Q-Q plot is the following:
    1. This is strongly anti-symmetric and lies off the line which disqualifies the data from “normality” immediately.
    2. In addition, there are many extreme values in the right tail as evidenced by the “floating points” towards the right.
  • This kind of non-normality is relatively easy to detect with a histogram, but we can read the same information from the Q-Q plot in this case.

Analyzing Q-Q plots in StatCrunch

Normal quantiles on y-axis must be selected to match the book method.

Courtesy of Pearson

  • Constructing Q-Q plots manually is tedious and should be done with computer software.
  • For this reason, we will only consider how to construct and analyze Q-Q plots in StatCrunch.
  • NOTE: to construct the Q-Q plot as in the book, you must select “Normal quantiles on y-axis” as highlighted in the figure to the left.
  • We will go through some examples in the video on how to open a book data set and analyze this data set with a histogram and with a Q-Q plot.
  • In the midterm, you will be expected to be able to analyze data in StatCrunch as in the following.
  • You will likely be asked to produce plots of one of the variables and analyze the shape for signs of:
    1. normality;
    2. skewness; or
    3. uniformity
  • in the distribution.
  • You may use any combination of Q-Q plots and histograms, and we will consider both in the following.

Motivation for point estimates and confidence intervals

Sample proportions tend to be normally distributed about the true parameter as the mean.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • In the last lecture, we saw how a sample proportion generates a random variable.
  • That is, when we take a sample of a population and compute the proportion of the sample for which some statement is true.
  • Suppose we want to replicate this sampling procedure infinitely many times
    • It impossible to replicate the sampling infinitely many times, but we can construct a probabilistic model for this replication process with a probability distribution.
  • Formally, we will define \( \hat{p} \) to be the random variable equal to the proportion derived from a random sample of \( n \) observations.
    • For each replication, \( \hat{p} \) attains a different value based on chance.
  • Then, for random, independent samples, \( \hat{p} \) tends to be normally distributed about \( p \).
    • We can thus use the value of \( \hat{p} \) and the distribution of \( \hat{p} \) to estimate \( p \) and how close we are to it.
  • We know that \( \hat{p} \) is an unbiased estimator of the true population proportion \( p \).
    • That is, over infinitely many resamplings, the expected value (mean of the probability distribution) for \( \hat{p} \) is equal to \( p \).
  • When we have a specific sample data set, and a specific value for \( \hat{p} \) associated to it, \( \hat{p} \) is called a point estimate for \( p \).
    • The measure of “how-close” we think this is to the true value is called a confidence interval.

Point estimates for population proportions

Sample proportions tend to be normally distributed about the true parameter as the mean.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • In addition to the review on the last slide we will briefly discuss why \( \hat{p} \) is a “best” estimate for \( p \)
    • Recall, \( \hat{p} \) as a random variable tends to have a normal distribution around \( p \).
    • This distribution can be characterized in the same way as any other normal distribution, in terms of:
      1. its mean \( \mu_\hat{p} = p \); and
      2. its standard deviation \( \sigma_\hat{p} \).
  • Let’s suppose that \( \tilde{p} \) is some other estimate for \( p \) that is unbiased, i.e., \[ \mu_\tilde{p} = p. \]
  • It is an extremely important property that the standard deviation of the other estimate is at least as large as \( \sigma_\hat{p} \), i.e., \[ \sigma_\tilde{p} \geq \sigma_\hat{p}. \]
  • In plain English this says that
  • Even though \( \hat{p} \) tends to vary around \( p \) due to sampling error, the ammount it varies away from \( p \) tends to be less than all other unbiased estimators.
  • We can think of the sample proportion as the most accurate point estimate, if we were to replicate samples arbitrarily many times.
    • This is fortunate because it is also the most “natural” choice for an estimate in some sense.

Confidence intervals for population proportions

  • Let’s consider an example.
    • A Gallup poll was given to assess the population of US adults.
    • The poll randomly selected \( 1487 \) adults for their sample and found that \( 43\% \) of respondents had a Facebook account.
    • We should not believe that exactly \( 43\% \) of US adults actually have a Facebook account.
      • This proportion \( \hat{p} \) cannot even be exact based on their sample size, which would indicate that \( 639.41 \) respondents had a Facebook account.
    • However, \( \hat{p} \) is the best estimate (given the sample) for the true \( p \).
  • Recall, \( \hat{p} \) tends to be normally distributed about \( p \);
    • therefore we can describe the probability of finding such a \( \hat{p} \) in terms of how far \( \hat{p} \) deviates from its mean \( p \).
  • The issue is, of course, that we know \( \hat{p} \) but we do not know \( p \).
  • For this reason, we will construct a region in which we have a level of confidence for where \( p \) lies.
    • This region will be based on how far values like \( \hat{p} \) tend to lie from the true mean, based on the standard deviation \( \sigma_\hat{p} \).
  • We will suppose for the moment that the region can be constructed as, \[ 0.405 < {\color{#1b9e77} p } < 0.455, \] and we have \( 95\% \) confidence that \( p \) lies here.
  • Notice, \( p \) is a fixed, non-random (but unkown) value and it doesn’t make sense to describe the “probability” of where it lies in relation to the value \( \hat{p} \).
  • A \( 95\% \) confidence interval thus describes a procedure for producing an interval that will work \( 95\% \) of the time when samples are replicated and \( \hat{p} \) is recomputed.

Confidence intervals for population proportions continued

Confidence intervals under replication.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • As a schematic of this, consider the figure to the left.
    • In the figure, each vertical line will represent a different replication of the sampling proceedure.
    • We will suppose that the true population proportion is actually \( p=0.50 \).
    • Notice that our Gallup poll sample confidence interval, \[ 0.405 < {\color{#1b9e77} p } < 0.455, \] does not actually contain the true population parameter – this is due to sampling error.
  • However, if we take enough replications of the sampling procedure, \( p=0.50 \) should lie in the associated confidence interval about \( 95\% \) of the time.
  • Remember, in this “frequentist” statistical framework, \( p \) is not random, it is a fixed but unkown value.
  • On the other hand, our confidence intervals are random and depend on a particular outcome of the sampling replication.
  • Note: when we have a particular sample data set in hand, \( \hat{p} \) is also just a fixed point estimate, and its confidence interval will also be fixed.
  • Therefore, we say that the \( 95\% \) confidence interval represents a procedure that should work \( 95\% \) of the time;
    • we do not guarantee, however, that \( p \) actually lies within this range.

Confidence intervals for population proportions continued

Common confidence levels.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • We will now discuss confidence intervals (CI) more formally.
    • Suppose that we want to estimate the true proportion \( p \) with some level of confidence:
      • if we replicated the sampling procedure infinitely many times, the average number of times we found \( p \) in our confidence interval would be equal to the level of confidence.
  • Let’s take an example confidence level of \( 95\% \) – this corresponds to a rate of failure of \( 5\% \) over infinitely many replications.
  • Generally, we will write the confidence level as, \[ (1 - \alpha) \times 100\% \] so that we can associate this confidence level with its rate of failure \( \alpha \).
  • Recall, we earlier studied ways that we can compute the critical value associated to some \( \alpha \) for the normal distribution.
  • We will use the same principle here to find how wide is the interval around \( p \) for which \( \hat{p} \) will lie in this interval \( (1-\alpha)\times 100\% \) of the time.
  • This is equivalent to a two-sided measure of extremeness in the normal distribution, i.e.,
    • we want to find the critical value \( z_\frac{\alpha}{2} \) for which:
      • \( (1-\frac{\alpha}{2})\times 100\% \) of the area under the normal density lies to the left of \( z_\frac{\alpha}{2} \); and
      • \( (1-\frac{\alpha}{2})\times 100\% \) of the area under the normal density lies to the right of \( -z_\frac{\alpha}{2} \).
    • Put together, \( (1-\alpha)\times 100\% \) of values lie within \( [-z_\frac{\alpha}{2},z_\frac{\alpha}{2}] \).
    Area between alpha over two critical value.

    Confidence intervals for population proportions and critical values

    Confidence interval for alpha value.

    Courtesy of Mario Triola, Essentials of Statistics, 6th edition

    • In the figure to the left, we see exactly thus how to find the width of the region for a given confidence level.
      • For a given confidence level \( (1 -\alpha)\times 100\% \), we will find the particular \( \alpha \).
      • We then find the associated two-sided measure of extremeness with the \( z_\frac{\alpha}{2} \) critical value.
      • This critical value is associated to the measure of extremeness of finding an observation that lies far away from the mean.
      • Particularly, only \( \alpha \times 100\% \) of the population lies outside of the region \( [-z_\frac{\alpha}{2},z_\frac{\alpha}{2}] \).
    • Put another way, if you randomly draw a population member, there is \( (1 -\alpha)\times 100\% \) chance that the mean lies at a distance at most \( z_\frac{\alpha}{2} \) away from this observation.
    • The sample proportion point estimate \( \hat{p} \) can be considered a random draw from population of all point estimates distributed around \( p \).
    • We do not know how close a particular point estimate (depending on a particular sample) is to \( p \).
    • However, we know that over all possible replications of the sampling procedure, the point estimate \( \hat{p} \) will lie at a distance at most \( z_\frac{\alpha}{2} \) exactly \( (1-\alpha)\times 100\% \) of the time.
    • Therefore, we have \( (1-\alpha)\times 100\% \) confidence that the true \( p \) lies within this region about \( \hat{p} \).