Sampling distributions and the central limit theorem

03/31/2021

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Random samples
- Sampling distributions
- Point estimators
- Central limit theorem
- Applications of the central limit theorem
- Approximate sampling distribution of a difference in sample means

Motivation

We have now learned about the fundamentals of theoretical probabilistic models.
Particularly, we have learned about the:
- probability distribution;
- probability mass / density function; and
- cumulative probability distribution function;
for discrete and continuous random variables.
We have also learned about several fundamental probability distributions:
- the binomial;
- the uniform; and
- the normal.
We will begin to combine these models with data to produce statistical inference.
Our goal in this course is to use statistics from a small, representative sample to say something general about the larger, unobservable population or phenomena.
The process of saying something general from the smaller representative sample, while qualifying our uncertainty, is what we mean by statistical inference.

Motivation continued

The link between the probability models in the earlier chapters and the data is made as follows.
Suppose we take a sample of \( n = 10 \) observations \( \{x_{1,i}\}_{i=1}^{10} \) from a population and compute the sample average,

\[ \overline{x}_1 = \frac{1}{n} \sum_{i=1}^n x_{1,i} = \frac{1}{10}\sum_{i=1}^{10} x_{1,i} \]

getting the result \( \overline{x}_1 = 10.2 \).
Now we repeat this process, taking a second sample of \( n = 10 \) observations from the same population,

\[ \{x_{2,i}\}_{i=1}^{10} \]

and the resulting sample average is \( \overline{x}_2=10.4 \).
This discrepancy is what we call sampling error, in which the random variation in a sample of a fixed size \( n \) upon replication produces differences in the computation of a statistic.
The sample average depends on the observations in the sample, which differ from sample to sample because they are random variables.
Consequently, the sample average (or any other function of the sample data) is a random variable.
Because a statistic is a random variable, it has a probability distribution.

Motivation continued

Specifically, suppose that we want to obtain an estimate of a population parameter, where the population is modeled with a random variable \( X \).
We know that before the data are collected, the observations are considered to be random variables,
- i.e., we treat an independent sequence of measurements of \( X \),
\[ X_1, X_2, \cdots , X_n \]
- as random variables all drawn from a parent distribution \( X \sim F(x) \) (where the CDF will define the distribution).
Random sample
The random variables \( X_1 , X_2, \cdots , X_n \) are a random sample of size \( n \) if the \( X_i \)’s are independent random variables and every \( X_i \) has the same probability distribution.
We then say that the measurements we obtain are possible outcomes of the sample variables \( \{X_i\}_{i=1}^n \); particularly, if we make a computation of the sample mean,

\[ \overline{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

the above is treated as a random variable (a linear combination of random variables) which has a random outcome, dependent on the realizations of the \( X_i \).

Motivation continued

More generally, any function of the observations, i.e., any statistic, is also modeled as a random variable.
If \( h \) is a general function used to compute some statistic, we thus define

\[ \tilde{X} = h(X_1, \cdots, X_n) \]

to be a random variable that will depend on the particular realizations of \( X_1,\cdots, X_n \).
We call the probability distribution of a statistic a sampling distribution.

Sampling Distribution
The probability distribution of a statistic is called a sampling distribution.

Motivation continued

Given particular realizations of the sample random variables, we obtain a fixed numerical value.
Each numerical value in a data set is treated as the observed realization of a random variable.
Given particular realizations \( x_1,\cdots,x_n \) of the random variables \( X_1, \cdots, X_n \), the value

\[ \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i \]

is not a random variable, as this is a fixed numerical value.
Given some particular, observed realizations \( x_1, \cdots,x_n \),

\[ \tilde{x} = h(x_1, \cdots, x_n) \]

is a fixed numerical value, based on the fixed, observed data values \( x_1, \cdots, x_n \).

Motivation continued

When discussing inference problems, it is convenient to have a general symbol to represent the parameter of interest – we use the Greek symbol \( \theta \) (theta) to represent the parameter.
The symbol \( \theta \) can represent the mean \( \mu \), the variance \( \sigma^2 \), or any parameter of interest to us.
The objective of point estimation is to estimate a single number based on sample data that is the most plausible value for \( \theta \).
The numerical value of a sample statistic is used as the point estimate.
Once we describe the process of point estimation, the next step is to describe how we quantify the uncertainty of the estimate.
If \( X \) is a random variable with probability distribution \( F(x) \), characterized by the unknown parameter \( \theta \),

and if \( X_1 , X_2, \cdots , X_n \) is a random sample of size \( n \) from \( X \),
the statistic \( \hat{\Theta} = h(X_1 , X_2 , ... , X_n ) \) given as a function of the sample is called a point estimator of \( \theta \).
Note that \( \hat{\Theta} \) is a random variable because it is a function of random variables.
After the sample has been selected, \( \hat{\Theta} \) takes on a particular numerical value \( \hat{\Theta} \) called the point estimate of \( \theta \).
The uncertainty of the point estimate \( \hat{\Theta} \) can be understood as how much will the sampling error cause a discrepancy between \( \hat{\Theta} \) and the true \( \theta \).

Point estimators

We will now introduce some formal definitions:
Point estimators
A point estimate of some population parameter \( \theta \) is a single numerical value \( \hat{\Theta} \) of a statistic \( \hat{\Theta} \). This is a particular realization of the random variable \( \hat{\Theta} \), viewed as a random variable; \( \hat{\Theta} \) is called the point estimator.

Estimation problems modeled as above occur frequently in engineering.
We often need to estimate
- The mean \( \mu \) of a single population
- The variance \( \sigma^2 \) (or standard deviation \( \sigma \)) of a single population
- The proportion \( p \) of items in a population that belong to a class of interest
- The difference in means of two populations, \( \mu_1 - \mu_2 \)
- The difference in two population proportions, \( p_1 − p_2 \)

Point estimators continued

Reasonable point estimates of these parameters are as follows:
- For \( \mu \),
  - the estimate is \( \hat{\mu}=\overline{x} \), the sample mean.
- For \( \sigma^2 \),
  - the estimate is \( \hat{\sigma}^2 = s^2 \), the sample variance.
- For \( p \),
  - the estimate is \( \hat{p}=\frac{x}{n} \), the sample proportion, where \( x \) is the number of items in a random sample of size \( n \) that belong to the class of interest.
- For \( \mu_1 -\mu_2 \),
  - the estimate \( \hat{\mu}_1 - \hat{\mu}_2 = \overline{x}_1 - \overline{x}_2 \), the difference between the sample means of two independent random samples.
- For \( p_1 − p_2 \) ,
  - the estimate is \( \hat{p}_1 - \hat{p}_2 \) , the difference between two sample proportions computed from two independent random samples.
Generally, however, we may have several different choices for the point estimator of a parameter.
To decide which point estimator of a particular parameter is the best one to use, we need to examine their statistical properties and develop some criteria for comparing estimators.

Central limit theorem

Let's consider a simple argument for the sampling distribution of the sample mean \( X \).
Suppose that a random sample of size \( n \) is taken from a normal population with mean \( \mu \) and variance \( \sigma^2 \).
By definition of a random sample each observation in this sample, say, \( X_1, X_2, \cdots, X_n \), is a normally and independently distributed random variable with mean \( \mu \) and variance \( \sigma^2 \).
A special property of the normal distribution is that it can be translated and rescaled while remaining normal;
- similarly, a sum of independent, normally distributed random variables are also normally distributed.
We conclude that the sample mean

\[ \overline{X}= \frac{X_1 + X_2 + \cdots + X_n}{n} \]

has a normal distribution with mean

\[ \mu_\overline{X} = \frac{\mu + \mu + \cdots + \mu}{n} = \mu \]
- and variance
\[ \sigma^2_\overline{X} = \frac{\sigma^2 + \sigma^2 + \cdots + \sigma^2}{n^2} = \frac{\sigma^2}{n} \]

Central limit theorem continued

More generally, if we are sampling from a population that has an unknown probability distribution, the sampling distribution of the sample mean will still be approximately normal with mean \( \mu \) and variance \( \frac{\sigma^2}{n} \) if the sample size \( n \) is large.
This is one of the most useful theorems in statistics, called the central limit theorem:

The central limit theorem
Let \( X_1 , X_2 , \cdots , X_n \) be a random sample of size \( n \) taken from a population with mean \( \mu \) and finite variance \( \sigma^2 \) and \( \overline{X} \) be the sample mean. Then the limiting form of the distribution of \[ Z = \frac{X - \mu}{\frac{\sigma}{\sqrt{n}}} \] as \( n \rightarrow \infty \) is the standard normal distribution.
Put another way, for \( n \) sufficiently large, \( \overline{X} \) has approximately a \( N\left(\mu, \frac{\sigma^2}{n}\right) \) distribution – this says the following.
- Suppose we take a sample of size \( n \) and compute the sample mean \( \overline{X} \).
- Then suppose we replicate this sample and record the observed realizations for the sample mean \( \overline{x}_1, \overline{x}_2, \cdots \).
- If the sample size \( n \) is lage, these data points \( \overline{x}_1, \cdots \) will be approximately bell shaped with the following properties:
  - the bell will be centered approximately at \( \mu \), the true population mean;
  - the spread of the data around the center will be given by approximately by the standard deviation \( \frac{\sigma}{\sqrt{n}} \).
- Particularly, if \( n \) is very large, the observed sample means will be very close to the center (the true mean).

Central limit theorem continued

As a visualization of the concept, suppose again that we have a random sample indexed by \( j \) \[ X_{j,1}, \cdots, X_{j,n}. \]
We will make replications for \( j=1,\cdots,m \) and get a random variable for sample mean indexed by \( j \), \[ \overline{X}_j = \frac{1}{n}\sum_{i=1}^n X_{j,i}. \]
When we observe a realization of \( \overline{X}_j=\overline{x}_j \) or respectively the sample \[ X_{j,1}=x_{j,1}, \cdots, X_{j,n}=x_{j,n}, \] we record these fixed numerical values.

Courtesy of Mathieu ROUAUD, CC BY-SA 4.0, via Wikimedia Commons

The measurements \( X_{j,i} \) may be distributed according to any underlying distribution with mean \( \mu \) and standard deviation \( \sigma \).
However, if \( n \) is large, the \( \overline{X}_j \) is approximately normal with mean \( \mu \) and standard deviation \( \frac{\sigma}{\sqrt{n}} \).
The sample mean data from given realizations \( x_{i,j} \), \( \overline{x}_j \), will have approximately a bell shaped frequency, centered approximately at \( \mu \).
The spread of the data will be approximately \( \frac{\sigma}{\sqrt{n}} \).
Particularly, as \( n\rightarrow \infty \), the spread shrinks to zero, so that we get a better and better estimate (more peaked bell shape) with large sample sizes.

Central limit theorem continued

The central limit theorem is the underlying reason why many of the random variables encountered in engineering and science are normally distributed.
The observed variable results from a series of underlying disturbances that act together to create a central limit effect.
- This can be thought in terms of the sum of random disturbances averaged over a time interval will have an average effect like a normal variable.
It is important, however, to consider when the sample size large enough so that the central limit theorem can be assumed to apply.
The answer depends on how close the underlying distribution is to the normal:
- if the underlying distribution is normal, any sample size will work;
- if the underlying distribution is symmetric and unimodal (not too far from normal), the central limit theorem will apply for sample sizes as low as 4 or 5.
- if the sampled population is very nonnormal, if the sample size is greater than 30, the central limit theorem will usually apply; however, there are exceptions to this guideline.

Applications of central limit theorem

Suppose an electronics company manufactures resistors that have a mean resistance of \( \mu=100 \) ohms and a standard deviation of \( \sigma=10 \) ohms.
We will assume that the distribution of resistance is normal, (i.e., the sampling distribution of the sample mean is automatically normal).

I.e., the distribution for \( \overline{X} \) is the normal with mean, \[ \mu_\overline{X} = \mu = 100 \] and standard deviation \[ \sigma_\overline{X} = \frac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{n}}. \]

Suppose we want to find the probability that a random sample of \( n = 25 \) resistors will have an average resistance of fewer than \( 95 \) ohms.
Notice that for a sample size of \( n=25 \), the sampling distribution for \( \overline{X} \) is given by the normal with mean \( \mu=100 \) and standard deviation \( \frac{10}{5}=2 \).

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

Let's consider how to compute this probability in R.

Application of central limit theorem continued

Recall, we are trying to compute

\[ P\left(\overline{X} < 95\right) \]

where \( \overline{X} \) is normally distributed with \( \mu_\overline{X}=100 \) and \( \sigma_\overline{X}=2 \).
We can compute the standard normal z-scores as

\[ \frac{95-100}{2} = -2.5 \]
In R, we can use the pnorm from last time to compute

pnorm(-2.5)

[1] 0.006209665

Application of central limit theorem continued

Let's note that pnorm also has alternative settings that allow us to make the probability computation for a general normal.
pnorm can use keyword arguments mean and sd standing for the mean and standard deviation respectively.
Setting these values determines the normal distribution, so that we can compute the earlier probability directly as follows:

pnorm(95, mean=100, sd=2)

[1] 0.006209665

pnorm(-2.5)

[1] 0.006209665

The above demonstrates the equivalence of the approaches.
Generally, computing this directly is preferable so that we don't make errors in computing the z-score by hand.
This example shows that if the distribution of resistance is normal with mean \( \mu=100 \) ohms and standard deviation of \( \sigma=10 \) ohms, finding a random sample of resistors with a sample mean less than \( 95 \) ohms is a rare event.
If this actually happens, it casts doubt as to whether the true mean is really \( 100 \) ohms or if the true standard deviation is really \( 10 \) ohms.
We will come back to this idea when we introduce hypothesis testing.

Application of central limit theorem continued

Suppose that a random variable X has a continuous uniform distribution with density \[ f (x) = \begin{cases} 1∕2 & 4 ≤ x ≤ 6\\ 0 & \text{else} \end{cases} \]
We will find the distribution of the sample mean of a random sample of size \( n = 40 \).

Notice, the sample size \( n>30 \) and this is a unimodal distribution, so the central limit theorem will give a good approximation.

The mean and variance of \( X \) are \( \mu = 5 \) and \( \sigma^2 = \frac{(6 − 4)^2}{12} = 1/3 \).
The central limit theorem indicates that the distribution of \( X \) is approximately normal with mean \( \mu_X = 5 \) and variance \[ σ^2_\overline{X} = \frac{\sigma^2}{n} = \frac{1/3}{40} = \frac{1}{120}. \]
This says that the distribution for \( \overline{X} \) from the above uniform with a sample size \( n=40 \) will be extremely peaked at the mean \( \mu \).

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

Approximate sampling distribution of a difference in sample means

We will finally consider the case in which we have two independent populations.
Let the first population have mean \( \mu_1 \) and variance \( \sigma^2_1 \) and the second population have mean \( \mu_2 \) and variance \( \sigma^2_2 \).
Suppose that both populations are normally distributed.
Linear combinations of independent normal random variables follow a normal distribution, so that \( X_1 - X_2 \) is also normal.
Suppose that \( \overline{X}_1 \) is the sample mean for the distribution of \( X_1 \) with a sample size \( n_1 \);
- similarly, suppose that \( \overline{X}_1 \) is the sample mean for the distribution of \( X_1 \) with a sample size \( n_2 \).
Then, the sampling distribution of \( \overline{X}_1 − \overline{X}_2 \) is also normal with mean and variance

\[ \begin{align} \mu_{\overline{X}_1 - \overline{X}_2} &= \mu_{\overline{X}_1} - \mu_{\overline{X}_2} = \mu_{X_1} - \mu_{X_2}\\ \sigma^2_{\overline{X}_1 - \overline{X}_2} &= \sigma^2_{\overline{X}_1} - \sigma^2_{\overline{X}_2} = \frac{\sigma^2_{X_1}}{n_1} - \frac{\sigma^2_{X_2}}{n_2}\\ \end{align} \]
That is to say, we have a normal model for the difference of the two samples from two independent populations;
- in particular, the mean difference and the standard deviation of the difference can be computed like with the central limit theorem.

Approximate sampling distribution of a difference in sample means continued

More generally, we can use the above argument as an approximation when the sample size is large, i.e., usually when \( n>30 \).

Approximate sampling distribution of a difference in sample means
Suppose we have two independent populations with means \( \mu_1 \) and \( \mu_2 \) and variances \( \sigma_1^2 \) and \( \sigma_2^2 \) and if \( \overline{X}_1 \) and \( \overline{X}_2 \) are the sample means of two independent random samples of sizes \( n_1 \) and \( n_2 \) from these populations. Then the sampling distribution of \[ \begin{align} Z = \frac{\overline{X}_1 − \overline{X}_2 − (\mu_1 − \mu_2)}{\sigma_1^2/n_1 + \sigma_2^2 ∕n_2} \end{align} \] is approximately standard normal if the conditions of the central limit theorem apply. If the two populations are normal, the sampling distribution of \( Z \) is exactly standard normal.
To put this another way, we say that \( \overline{X}_1 - \overline{X}_2 \) has approximately a normal distribution with mean and variance

\[ \begin{align} \mu_{\overline{X}_1 - \overline{X}_2} &= \mu_{X_1} - \mu_{X_2}\\ \sigma^2_{\overline{X}_1 - \overline{X}_2} &= \frac{\sigma^2_{X_1}}{n_1} - \frac{\sigma^2_{X_2}}{n_2}\\ \end{align} \]

so that with technology, we can compute the probability directly (without z-scores).
To compute the probability of \( \overline{X}_1 - \overline{X}_2 \) being in some range, we can use pnorm with the appropriate parameters for mean and sd given as keyword arguments.