The central limit theorem continued and general concepts of point estimation

04/05/2021

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A review of the central limit theorem
- Applications of the central limit theorem
- Approximate sampling distribution of a difference in sample means
- General concepts in point estimation
- Bias of estimators
- Variance of estimators
- Standard Error

A review of the central limit theorem

Suppose that we want to obtain an estimate of a population parameter, where the population is modeled with a random variable \( X \).
We know that before the data are collected, the observations are considered to be random variables,
- i.e., we treat an independent sequence of measurements of \( X \),
\[ X_1, X_2, \cdots , X_n \]
- as random variables all drawn from a parent distribution \( X \sim F(x) \) (where the CDF will define the distribution).
Random sample
The random variables \( X_1 , X_2, \cdots , X_n \) are a random sample of size \( n \) if the \( X_i \)’s are independent random variables and every \( X_i \) has the same probability distribution.
We then say that the measurements we obtain are possible outcomes of the sample variables \( \{X_i\}_{i=1}^n \); particularly, if we make a computation of the sample mean,

\[ \overline{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

the above is treated as a random variable (a linear combination of random variables) which has a random outcome, dependent on the realizations of the \( X_i \).

A review of the central limit theorem

Generally, if we are sampling from a population that has an unknown probability distribution, the sampling distribution of the sample mean will still be approximately normal with mean \( \mu \) and variance \( \frac{\sigma^2}{n} \) if the sample size \( n \) is large.
This is one of the most useful theorems in statistics, called the central limit theorem:

The central limit theorem
Let \( X_1 , X_2 , \cdots , X_n \) be a random sample of size \( n \) taken from a population with mean \( \mu \) and finite variance \( \sigma^2 \) and \( \overline{X} \) be the sample mean. Then the limiting form of the distribution of \[ Z = \frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \] as \( n \rightarrow \infty \) is the standard normal distribution.
Put another way, for \( n \) sufficiently large, \( \overline{X} \) has approximately a \( N\left(\mu, \frac{\sigma^2}{n}\right) \) distribution – this says the following.
- Suppose we take a sample of size \( n \) and compute the sample mean \( \overline{X} \).
- Then suppose we replicate this sample and record the observed realizations for the sample mean \( \overline{x}_1, \overline{x}_2, \cdots \).
- If the sample size \( n \) is lage, these data points \( \overline{x}_1, \cdots \) will be approximately bell shaped with the following properties:
  - the bell will be centered approximately at \( \mu \), the true population mean;
  - the spread of the data around the center will be given by approximately by the standard deviation \( \frac{\sigma}{\sqrt{n}} \).
- Particularly, if \( n \) is very large, the observed sample means will be very close to the center (the true mean).

Central limit theorem continued

As a visualization of the concept, suppose again that we have a random sample indexed by \( j \) \[ X_{j,1}, \cdots, X_{j,n}. \]
We will make replications for \( j=1,\cdots,m \) and get a random variable for sample mean indexed by \( j \), \[ \overline{X}_j = \frac{1}{n}\sum_{i=1}^n X_{j,i}. \]
When we observe a realization of \( \overline{X}_j=\overline{x}_j \) or respectively the sample \[ X_{j,1}=x_{j,1}, \cdots, X_{j,n}=x_{j,n}, \] we record these fixed numerical values.

Courtesy of Mathieu ROUAUD, CC BY-SA 4.0, via Wikimedia Commons

The measurements \( X_{j,i} \) may be distributed according to any underlying distribution with mean \( \mu \) and standard deviation \( \sigma \).
However, if \( n \) is large, the \( \overline{X}_j \) is approximately normal with mean \( \mu \) and standard deviation \( \frac{\sigma}{\sqrt{n}} \).
The sample mean data from given realizations \( x_{i,j} \), \( \overline{x}_j \), will have approximately a bell shaped frequency, centered approximately at \( \mu \).
The spread of the data will be approximately \( \frac{\sigma}{\sqrt{n}} \).
Particularly, as \( n\rightarrow \infty \), the spread shrinks to zero, so that we get a better and better estimate (more peaked bell shape) with large sample sizes.

Central limit theorem continued

The central limit theorem is the underlying reason why many of the random variables encountered in engineering and science are normally distributed.
The observed variable results from a series of underlying disturbances that act together to create a central limit effect.
- This can be thought in terms of the sum of random disturbances averaged over a time interval will have an average effect like a normal variable.
It is important, however, to consider when the sample size large enough so that the central limit theorem can be assumed to apply.
The answer depends on how close the underlying distribution is to the normal:
- if the underlying distribution is normal, any sample size will work;
- if the underlying distribution is symmetric and unimodal (not too far from normal), the central limit theorem will apply for sample sizes as low as 4 or 5.
- if the sampled population is very nonnormal, if the sample size is greater than 30, the central limit theorem will usually apply; however, there are exceptions to this guideline.

Applications of central limit theorem

Suppose an electronics company manufactures resistors that have a mean resistance of \( \mu=100 \) ohms and a standard deviation of \( \sigma=10 \) ohms.
We will assume that the distribution of resistance is normal, (i.e., the sampling distribution of the sample mean is automatically normal).

I.e., the distribution for \( \overline{X} \) is the normal with mean, \[ \mu_\overline{X} = \mu = 100 \] and standard deviation \[ \sigma_\overline{X} = \frac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{n}}. \]

Suppose we want to find the probability that a random sample of \( n = 25 \) resistors will have an average resistance of fewer than \( 95 \) ohms.
Notice that for a sample size of \( n=25 \), the sampling distribution for \( \overline{X} \) is given by the normal with mean \( \mu=100 \) and standard deviation \( \frac{10}{5}=2 \).

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

Let's consider how to compute this probability in R.

Application of central limit theorem continued

Recall, we are trying to compute

\[ P\left(\overline{X} < 95\right) \]

where \( \overline{X} \) is normally distributed with \( \mu_\overline{X}=100 \) and \( \sigma_\overline{X}=2 \).
We can compute the standard normal z-scores as

\[ \frac{95-100}{2} = -2.5 \]
In R, we can use the pnorm from last time to compute

pnorm(-2.5)

[1] 0.006209665

Application of central limit theorem continued

Let's note that pnorm also has alternative settings that allow us to make the probability computation for a general normal.
pnorm can use keyword arguments mean and sd standing for the mean and standard deviation respectively.
Setting these values determines the normal distribution, so that we can compute the earlier probability directly as follows:

pnorm(95, mean=100, sd=2)

[1] 0.006209665

pnorm(-2.5)

[1] 0.006209665

The above demonstrates the equivalence of the approaches.
Generally, computing this directly is preferable so that we don't make errors in computing the z-score by hand.
This example shows that if the distribution of resistance is normal with mean \( \mu=100 \) ohms and standard deviation of \( \sigma=10 \) ohms, finding a random sample of resistors with a sample mean less than \( 95 \) ohms is a rare event.
If this actually happens, it casts doubt as to whether the true mean is really \( 100 \) ohms or if the true standard deviation is really \( 10 \) ohms.
We will come back to this idea when we introduce hypothesis testing.

Application of central limit theorem continued

Suppose that a random variable X has a continuous uniform distribution with density \[ f (x) = \begin{cases} 1∕2 & 4 ≤ x ≤ 6\\ 0 & \text{else} \end{cases} \]
We will find the distribution of the sample mean of a random sample of size \( n = 40 \).

Notice, the sample size \( n>30 \) and this is a unimodal distribution, so the central limit theorem will give a good approximation.

The mean and variance of \( X \) are \( \mu = 5 \) and \( \sigma^2 = \frac{(6 − 4)^2}{12} = 1/3 \).
The central limit theorem indicates that the distribution of \( X \) is approximately normal with mean \( \mu_X = 5 \) and variance \[ σ^2_\overline{X} = \frac{\sigma^2}{n} = \frac{1/3}{40} = \frac{1}{120}. \]
This says that the distribution for \( \overline{X} \) from the above uniform with a sample size \( n=40 \) will be extremely peaked at the mean \( \mu \).

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

Approximate sampling distribution of a difference in sample means

We will finally consider the case in which we have two independent populations.
Let the first population have mean \( \mu_1 \) and variance \( \sigma^2_1 \) and the second population have mean \( \mu_2 \) and variance \( \sigma^2_2 \).
Suppose that both populations are normally distributed.
Linear combinations of independent normal random variables follow a normal distribution, so that \( X_1 - X_2 \) is also normal.
Suppose that \( \overline{X}_1 \) is the sample mean for the distribution of \( X_1 \) with a sample size \( n_1 \);
- similarly, suppose that \( \overline{X}_1 \) is the sample mean for the distribution of \( X_1 \) with a sample size \( n_2 \).
Then, the sampling distribution of \( \overline{X}_1 − \overline{X}_2 \) is also normal with mean and variance

\[ \begin{align} \mu_{\overline{X}_1 - \overline{X}_2} &= \mu_{\overline{X}_1} - \mu_{\overline{X}_2} = \mu_{X_1} - \mu_{X_2}\\ \sigma^2_{\overline{X}_1 - \overline{X}_2} &= \sigma^2_{\overline{X}_1} - \sigma^2_{\overline{X}_2} = \frac{\sigma^2_{X_1}}{n_1} - \frac{\sigma^2_{X_2}}{n_2}\\ \end{align} \]
That is to say, we have a normal model for the difference of the two samples from two independent populations;
- in particular, the mean difference and the standard deviation of the difference can be computed like with the central limit theorem.

Approximate sampling distribution of a difference in sample means continued

More generally, we can use the above argument as an approximation when the sample size is large, i.e., usually when \( n>30 \).

Approximate sampling distribution of a difference in sample means
Suppose we have two independent populations with means \( \mu_1 \) and \( \mu_2 \) and variances \( \sigma_1^2 \) and \( \sigma_2^2 \) and if \( \overline{X}_1 \) and \( \overline{X}_2 \) are the sample means of two independent random samples of sizes \( n_1 \) and \( n_2 \) from these populations. Then the sampling distribution of \[ \begin{align} Z = \frac{\overline{X}_1 − \overline{X}_2 − (\mu_1 − \mu_2)}{\sigma_1^2 ∕n_1 + \sigma_2^2 ∕n_2} \end{align} \] is approximately standard normal if the conditions of the central limit theorem apply. If the two populations are normal, the sampling distribution of \( Z \) is exactly standard normal.
To put this another way, we say that \( \overline{X}_1 - \overline{X}_2 \) has approximately a normal distribution with mean and variance

\[ \begin{align} \mu_{\overline{X}_1 - \overline{X}_2} &= \mu_{X_1} - \mu_{X_2}\\ \sigma^2_{\overline{X}_1 - \overline{X}_2} &= \frac{\sigma^2_{X_1}}{n_1} - \frac{\sigma^2_{X_2}}{n_2}\\ \end{align} \]

so that with technology, we can compute the probability directly (without z-scores).
To compute the probability of \( \overline{X}_1 - \overline{X}_2 \) being in some range, we can use pnorm with the appropriate parameters for mean and sd given as keyword arguments.

General concepts of point estimation

Recall, any function of a random sample, i.e., any statistic, is modeled as a random variable.
If \( h \) is a general function used to compute some statistic, we thus define

\[ \hat{\Theta} = h(X_1, \cdots, X_n) \]

to be a random variable that will depend on the particular realizations of \( X_1,\cdots, X_n \).
We call the probability distribution of a statistic a sampling distribution.

Sampling Distribution
The probability distribution of a statistic is called a sampling distribution.
The sample mean

\[ \hat{\Theta} = \overline{X} = h(X_1, \cdots, X_n)= \frac{1}{n}\sum_{i=1}^n X_i \]

is now one example for which we have a model of the sampling distribution.
Specifically, the central limit theorem says that the sampling distribution of the sample mean is \( \overline{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \) when \( X \) is normal, or if \( n \) is sufficiently large.

General concepts of point estimation continued

Recall, we had a special name for \( \hat{\Theta} \) in relation to the true parameter value \( \theta \):
Point estimators
A point estimate of some population parameter \( \theta \) is a single numerical value \( \hat{\theta} \) of a statistic \( \hat{\Theta} \). This is a particular realization of the random variable \( \hat{\Theta} \), viewed as a random variable; \( \hat{\Theta} \) is called the point estimator.

We want an estimator to be “close” in some sense to the true value of the unknown parameter, but we know that it happens to be a random variable.
In this way, we need to describe how close this estimator is to the true value in a probabilistic sense.
As we have seen before, there are important parameters that describe a probability distribution or a data set:
1. the “center” of the data / distribution; and
2. the “spread” of the data / distribution.
The central limit theorem actually provided both of these (and the sampling distribution) for the sample mean:
1. the “center” of the distribution for \( \hat{\Theta}=\overline{X} \) was given by \( \mu \), the true population mean;
2. the “spread” of the distribution for \( \hat{\Theta}=\overline{X} \) was given by \( \frac{\sigma}{\sqrt{n}} \), the standard deviation of the population, divided by the square-root of the sample size.
The two above parameters thus give us a means of describing “how close” the sample mean \( \overline{X} \) tends to be to the population mean \( \mu \) in a probabilistic sense.

Bias of estimators

The notion of the “center” of the sampling distribution can be useful as a general criteria for estimators.
Formally, we say that \( \hat{\Theta} \) is an unbiased estimator of \( \theta \) if the expected value of \( \hat{\theta} \) is equal to \( \theta \).
This is equivalent to saying that the mean of the probability distribution of \( \hat{\Theta} \) (or the mean of the sampling distribution of \( \hat{\Theta} \)) is equal to \( \theta \).

Bias of an Estimator
The point estimator \( \hat{\Theta} \) is an unbiased estimator for the parameter \( \theta \) if \[ \mathbb{E}\left[\hat{\Theta}\right] = \theta \] If the estimator is not unbiased, then the difference \[ \mathbb{E}\left[\hat{\Theta}\right] - \theta \] is called the bias of the estimator \( \hat{\Theta} \). When an estimator is unbiased, the bias is zero; that is, \[ \begin{align} \mathbb{E}\left[\hat{\Theta}\right] - \theta &= \theta - \theta \\ &=0 \end{align} \]

If we consider the expected value to represent the average value over infinite replications;
- the above says that “over infinite replications of a random sample of size \( n \), the average value of the point estimator \( \hat{\Theta} \) will equal the true population parameter \( \theta \)”.
A particular realization of \( \hat{\Theta} \) will generally not equal the true value \( \theta \).
However, replications of the experiment will give a good approximation of the true value \( \theta \).

Bias of estimators continued

Both of the
1. sample mean
\[ \overline{X}= \frac{1}{n}\sum_{i=1}^n X_i; \] and 2. sample variance

\[ s^2 = \frac{\sum_{i=1}^n \left(X_i - \overline{X}\right)^2}{n-1} \]
are unbiased estimators.
However, there are theoretical reasons that we can use to show that the sample standard deviation is a biased estimator of the population standard deviation, i.e.,

\[ \mathbb{E}\left[ s\right] \leq \sigma \]

and it consistently underestimates the true standard deviation.
The bias tends to be small, however, and it is still the most practical estimate most of the time for the population standard deviation.

Variance of estimators

We use the bias as discussed already to measure the center of a sampling distribution

An unbiased estimator will have a distribution centered at the true population parameter.

Yet suppose we have two estimators of the same parameter \( \theta \), which we will denote \( \hat{\Theta}_1 \) and \( \hat{\Theta}_2 \) respectively.
It is possible that they are both unbiased (the sampling distributions have the same center), yet they have different spread.
That is to say, one estimator might tend to vary more than the other.

Sampling distributions with same mean and different variance.

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

The spread is a critical measure of how much variation is encountered with respect to resampling.
We might describe the two concepts with an estimator as follows:

Accuracy of an estimator - this is represented by the estimator being unbiased, so that we expect it to give an accurate result on average.
Precision of an estimator - this is represented by the estimator having a small spread, so that the estimates don’t differ wildly from sample to sample.

It is possible, in general, for an estimator to be either, both or neither of the above.
We are often interested, thus, in unbiased estimators with a minimum variance as a first choice.
In some situations biased estimators will actually be preferred, though a general discussion of the tradoffs is beyond our scope.

Variance of estimators continued

As a formal definition, we will introduce the following idea:

Minimum Variance Unbiased Estimator
If we consider all unbiased estimators of \( \theta \), the one with the smallest variance is called the minimum variance unbiased estimator (MVUE).

The practical interpretation again is demonsrated by the last figure:

Courtesy of Montgomery & Runger, Applied Statistics and Probability for Engineers, 7th edition

Suppose that \( \hat{\Theta}_1 \) is the MVUE, and \( \hat{\Theta}_2 \) is any other unbiased estimator.
Then, \[ \mathrm{var}\left(\hat{\Theta}_1\right) \leq \mathrm{var}\left(\hat{\Theta}_2\right). \]

Practically speaking, the MVUE is the most precise unbiased estimator, as its value changes the least with respect to resampling.
An important example of a MVUE is actually the sample mean.

If \( X_1, X_2 , \cdots , X_n \) is a random sample of size \( n \) from a normal distribution with mean \( \mu \) and variance \( \sigma^2 \), the sample mean \( \overline{X} \) is the MVUE for \( \mu \).

Again, other choices exist to estimate \( \mu \), but among all unbiased estimators, the sample mean is the most precise.
For non-normal distributions, however, a better choice might be, e.g., a biased estimator.

Standard error of an estimator

As noted before, the variance is a “natural” measure of spread mathematically for theoretical reasons, but it is in the units squared of the original units.
For this reason, when we talk about the spread of an estimator's sampling distribution, we typically discuss the standard error.

The standard error Let \( \hat{\Theta} \) be an estimator of \( \theta \). The standard error error of \( \hat{\Theta} \) is its standard deviation given by \[ \sigma_\hat{\Theta} = \sqrt{\mathrm{var}\left(\hat{\Theta}\right)}. \] If the standard error involves unknown parameters that can be estimated, substitution of those values into the equation above produces an estimated standard error denoted \( \hat{\sigma}_\hat{\Theta} \). It is also common to write the standard error as \( \mathrm{SE}\left(\hat{\Theta}\right) \).
Q: can anyone recall what the standard error is of the sample mean? That is, what is the standard deviation of the sampling distribution (for a normal sample or \( n \) large)?
- A: the central limit theorem states that \( \overline{X} \) follows (exactly for a normal sample or \( n \) large, approximately) a sampling distribution
\[ \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right). \]
- Therefore, the standard error of the sample mean is precisely,
\[ \sigma_{\overline{X}} = \frac{\sigma}{\sqrt{n}}. \]

Standard error of an estimator

As was discussed before, there are times that we may not know all the parameters that describe the standard error.
For example, suppose we draw \( X_1, \cdots, X_n \) from a normal population, for which we know neither the mean nor the variance.
Let the unknown and unobservable theoretical parameters be denoted \( \mu \) and \( \sigma \) as usual.
The sample mean has the sampling distribution,

\[ \overline{X} \sim N\left( \mu, \frac{\sigma^2}{n}\right), \]

and therefore standard error \( \sigma_{\overline{X}} = \frac{\sigma}{\sqrt{n}} \).
However, we stated that \( \sigma \) itself is unknown.
In this case, we will estimate the standard error as

\[ \hat{\sigma}_\overline{X} = \frac{s}{\sqrt{n}} \] with the sample standard deviation \( s \).
This is what is meant to estimate the standard error.
This particular example will be extremely important for confidence intervals, discussed next time.

Standard error of an estimator – example

An article in the Journal of Heat Transfer (Trans. ASME, Sec. C, 96, 1974, p. 59) described a new method of measuring the thermal conductivity of Armco iron.
Using a temperature of \( 100^\circ \) F and a power input of 550 watts, the following 10 measurements of thermal conductivity (in Btu/hr-ft-∘ F) were obtained:

\[ 41.60, 41.48, 42.34, 41.95, 41.86, 42.18, 41.72, 42.26, 41.81, 42.04 \]
A point estimate of the mean thermal conductivity at \( 100^\circ \) F and 550 watts is the sample mean or

\[ \overline{x} = 41.924 \]
The standard error of the sample mean is \( \sigma_\overline{X}=\frac{\sigma}{\sqrt{n}} \);
- however, \( \sigma \) is unknown so that we estimate it by the sample standard deviation \( s = 0.284 \) to obtain
\[ \hat{\sigma}_\overline{X} = \frac{s}{\sqrt{n}}= \frac{0.284}{\sqrt{10}} \approx 0.0898 \]
Notice that the standard error is about 0.2 percent of the sample mean, implying that we have obtained a relatively precise point estimate of thermal conductivity.

Standard error of an estimator – example

Assume that thermal conductivity is normally distributed, then two times the standard error is

\[ 2\hat{\sigma}_\overline{X} = 2(0.0898) = 0.1796. \]
The empirical rule says that about 95% of realizations of the sample mean lie within two standard deviations of the true mean \( \mu \).
Therefore, we are highly confident that the true mean thermal conductivity is within the interval 41.924 ± 0.1796 or between \( [41.744 , 42.104] \).
We will formalize this logic into confidence intervals next time.
For now, we will discuss how to import data into RStudio to solve the homework questions.