A review of sampling distributions and the univariate Gaussian

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Sample statistics
- Sample random variables
- Sampling distributions
- The univariate Gaussian distribution
- Properties of the univariate Gaussian
- The central limit theorem

Sample statistics

The goal statistics is to use a numerical summary of data from a small, representative sample to say something general about the larger, unobservable population or phenomena.
The measures of the population are referred to as parameters.
Parameters are generally unknown and unknowable.

For example, we cannot exactly compute the mean sea-surface temperature globally, as it is impossible to take all such measurements.

However, if we have a representative sample, we can compute the sample mean.
- Numerical values like the sample mean computed from data are referred to as statistics.
The sample mean will almost surely not equal population mean, due to the natural variation (sampling error) that occurs in any given sample.
- However, if we have a good probabilistic model for the population, we can use the sample statistic to estimate the general, unknown population parameter.
RVs and probability distributions give us the model for estimating population parameters.
Note: we can only “find” the parameters exactly in very simple examples like games of chance.
Generally, we will have to be satisfied with estimates of the parameters that are uncertain, but also include measures of “how uncertain”.

Sample mean

Suppose we have a sample of \( n \) total measurements of some RV \( X \).
- We will denote these measurements \( x_1, x_2, \cdots, x_n \in \mathbb{R} \), where these refer to fixed numerical values.
- These may correspond to the value that \( X \) attains upon \( n \) independently replicated trials.

The (arithmetic sample) mean
Given measurements \( x_1,\cdots,x_n \) of the RV \( X \), we say that the sample mean is defined \[ \text{Sample mean} = \hat{x} = \frac{x_1 +x_2 +\cdots + x_n}{n}= \frac{\sum_{i=1}^n x_i}{n} \]

We remark that \( \hat{x} \) is a fixed numerical value depending on the particular sequence of outcomes \( x_1,\cdots, x_n \) observed.
- Due to this fact, with respect to a new sample of size \( n \), we may attain a new value for the sample mean.

Sample variance and standard deviation

We can similarly define the sample variance and standard deviation as follows

Sample standard deviation
Given measurements \( x_1,\cdots,x_n \) of the RV \( X \), we say that the sample standard deviation \[ \hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n\left(x_i - \hat{x}\right)^2}{n-1}} \]

Note that the numerator in the above accounts for the fact that one degree of freedom has been utilized in the computation of \( \hat{x} \).

Sample variance
Given measurements \( x_1,\cdots,x_n \) of the RV \( X \), we say that the sample variance \[ \hat{\sigma}^2 = \frac{\sum_{i=1}^n\left(x_i - \hat{x}\right)^2}{n-1} \]

For the same reasons discussed for the sample mean, the sample standard deviation and variance will tend to differ depending on the particular sequence of outcomes \( x_1,\cdots, x_n \) measured.
This discrepancy is what we call sampling error, in which the random variation in a sample of a fixed size \( n \) upon replication produces differences in the computation of a statistic.
For this reason, we may also consider a probabilistic model for the sample statistic, depending on the replication of measurements.

Sample random variables

Specifically, suppose that we want to obtain an estimate of a population parameter, where the population is modeled with a RV \( X \).
We know that before the data are collected, the observations are considered to be RVs,
- i.e., we treat an independent sequence of measurements of \( X \),
\[ X_1, X_2, \cdots , X_n \]
- as RVs all drawn from a parent distribution \( X \sim P \) (where the CDF will define the distribution).
Random sample
The RVs \( X_1 , X_2, \cdots , X_n \) are a random sample of size \( n \) if the \( X_i \)’s are independent RVs and every \( X_i \) has the same probability distribution.
We then say that the measurements we obtain are possible outcomes of the sample variables \( \{X_i\}_{i=1}^n \);
- particularly, if we make a computation of the sample mean, \[ \hat{X} = \frac{1}{n} \sum_{i=1}^n X_i \]
the above is treated as a RV (a linear combination of RVs) which has a random outcome, dependent on the realizations of the \( X_i \).

Sampling distributions

More generally, any function of the observations, i.e., any statistic, is also modeled as a RV.

Point estimators
Let \( \{X_j\}_{j=1}^n \) be a random sample. Let \( \theta \) be a parameter of the parent population, defined by the CDF \( P \). If \( h \) is a general function used to compute some statistic estimating \( \theta \), we thus define the RV \[ \hat{\Theta} = h(X_1, \cdots, X_n) \] to be a point estimator for \( \theta \).

We call the probability distribution of a statistic or estimator as above a sampling distribution.

Sampling Distribution
The probability distribution of a statistic is called a sampling distribution.
In this framework, we will distinguish then between the estimator (a random variable) and the numerical value it might attain on a sample of measurements.

Point estimate
A point estimate of some population parameter \( \theta \) is a single numerical value
\[ \hat{\theta} = h(x_1, \cdots,x_n) \] attained as a particular realization of the RV \( \hat{\Theta} \).

Sampling distributions

The notion of the “center” of the sampling distribution can be useful as a general criteria for estimators.
Formally, we say that \( \hat{\Theta} \) is an unbiased estimator of \( \theta \) if the expected value of \( \hat{\Theta} \) is equal to \( \theta \).
This is equivalent to saying that the mean of the sampling distribution of \( \hat{\Theta} \) is equal to \( \theta \).

Bias of an Estimator
The point estimator \( \hat{\Theta} \) is an unbiased estimator for the parameter \( \theta \) if \[ \mathbb{E}\left[\hat{\Theta}\right] = \theta \] If the estimator is not unbiased, then the difference \[ \mathbb{E}\left[\hat{\Theta}\right] - \theta \] is called the bias of the estimator \( \hat{\Theta} \). When an estimator is unbiased, the bias is zero; that is, \[ \begin{align} \mathbb{E}\left[\hat{\Theta}\right] - \theta &= \theta - \theta \\ &=0 \end{align} \]

If we consider the expected value to represent the average value over infinite replications;
- the above says that “over infinite replications of a random sample of size \( n \), the average value of the point estimator \( \hat{\Theta} \) will equal the true population parameter \( \theta \)”.

Sampling distributions

Both of the
1. sample mean \[ \hat{X}= \frac{1}{n}\sum_{i=1}^n X_i; \] and
2. sample variance \[ \hat{\sigma}^2 = \frac{\sum_{i=1}^n \left(X_i - \hat{X}\right)^2}{n-1} \]
are unbiased estimators, i.e., \[ \begin{align} \mathbb{E}\left[\hat{X}\right] = \overline{x}, & & \mathbb{E}\left[\hat{\sigma}^2\right] = \sigma^2. \end{align} \]
However, there are theoretical reasons that we can use to show that the sample standard deviation is a biased estimator of the population standard deviation, i.e.,

\[ \mathbb{E}\left[ \hat{\sigma}\right] \leq \sigma \]

and it consistently underestimates the true standard deviation.
The bias tends to be small, however, and it is still the most practical estimate most of the time for the population standard deviation.

Sampling distributions

Recalling that the expected value gives the center of mass of the probability distribution, we should also be interested in the spread of the sampling distribution.
As noted before, the variance is a “natural” measure of spread mathematically for theoretical reasons, but it is in the units squared of the original units.
For this reason, when we talk about the spread of an estimator's sampling distribution, we typically discuss the standard error.

The standard error
Let \( \hat{\Theta} \) be a point estimator of \( \theta \). The standard error error of \( \hat{\Theta} \) is its standard deviation given by \[ \sigma_\hat{\Theta} = \sqrt{\mathrm{var}\left(\hat{\Theta}\right)}. \] If the standard error involves unknown parameters that can be estimated, substitution of those values into the equation above produces an estimated standard error denoted \( \hat{\sigma}_\hat{\Theta} \). It is also common to write the standard error as \( \mathrm{SE}\left(\hat{\Theta}\right) \).
With these constructions in mind, we will now introduce one of the most fundamental results of classical statistics.
This result establishes the normal or Gaussian distribution in its central importance among distributions.

The univariate Gaussian distribution

The Gaussian distribution is considered the most prominent distribution in statistics.
It is a continuous probability distribution that has a bell-shaped probability density function.
The Gaussian distribution arises from the central limit theorem (CLT),

under weak conditions, the sum of a large number of RVs drawn from the same distribution is distributed approximately normally irrespective of the form of the original distribution.

This gives mathematical justification to why we see normally distributed data quite often in practice; as was noted by Henri Poincare

“Everybody believes in the exponential law of errors [i.e., the normal / Gaussian distribution]: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.” — Poincare, Henri “Calcul Des Probabilités.”

In addition to the ubiquity of the normal distribution, it can be easily manipulated analytically in equations,

this enables one to derive a large number of results in explicit form.

Due to these two aspects, the normal distribution is used extensively in theory and practice.

The univariate Gaussian distribution continued

Unlike how we defined the density function \( p \) and used this to compute \( \overline{x} \) and \( \sigma \) formerly, we will reverse this for the normal.
That is, we will use \( \overline{x} \) and \( \sigma \) to define the density of the normal and parametrize the distribution.
Let us use the following notation for compactness where \[ \exp(x) = e^{x}. \]
The univariate Gaussian distribution
Let the Gaussian RV \( X \) have mean \( \overline{x} \) and standard deviation \( \sigma \). The probability density function is given as \[ \begin{align} p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{\left(x - \overline{x}\right)^2}{2\sigma^2}\right). \end{align} \] We will write \( X \sim N\left(\overline{x}, \sigma^2\right) \) to denote that \( X \) has the density described above.
Recall how we considered \( \overline{x} \) to be a measure of center and \( \sigma \) a measure of spread.
If we vary these two values, we can change the center of mass and the spread of the normal distribution:

In the case that \( \overline{x}=0 \) and \( \sigma=1 \), we denote \( N(0, 1) \) to be the standard normal distribution.

The univariate Gaussian distribution continued

Another useful property of the family of Gaussian distributions is that it is closed under linear transformations.

Closure of the Gaussian under linear transformations
Let \( X_1 \) and \( X_2 \) be independent, Gaussian RVs defined \[ \begin{align} X_1\sim N\left(\overline{x}_1 , \sigma_1^2 \right) & & X_2 \sim N\left(\overline{x}_2, \sigma_2^2 \right). \end{align} \] Then for \( a,b,c \in \mathbb{R} \), the linear combination satisfies \[ aX_1 + bX_2 + c \sim N\left(a \overline{x}_1 + b\overline{x}_2 + c, a^2 \sigma_1^2 + b^2 \sigma_2^2\right) \]

This is actually a general property of the family of stable distributions.
The closure property above implies that a Gaussian variable can always be “standardized” as,

\[ \begin{align} X \sim N(\overline{x}, \sigma^2) && \Rightarrow && \frac{X - \overline{x}}{\sigma} \sim N(0, 1). \end{align} \]
The closure of the Gaussian under linear transformations has extremely important implications, when we introduce a mechanistic model later.
This is at the basis of results for estimators defined in a class of models known as Gauss-Markov models.
- We will return to this subject shortly.

Central limit theorem

Suppose that a random sample of size \( n \) is taken from a normal population with mean \( \overline{x} \) and variance \( \sigma^2 \).
By definition of a random sample each observation in this sample, say, \( X_1, X_2, \cdots, X_n \), is a normally and independently distributed RV with mean \( \overline{x} \) and variance \( \sigma^2 \).
We conclude that, due to closure of the Gaussian, the sample mean

\[ \hat{X}= \frac{X_1 + X_2 + \cdots + X_n}{n} \]

has a normal distribution with mean

\[ \begin{align} \mathbb{E}\left[\hat{X}\right] &= \frac{\mathbb{E}\left[X_1\right] + \cdots + \mathbb{E}\left[X_n\right]}{n} = \overline{x} \end{align} \]
- and variance
\[ \sigma^2_\hat{X}:= \mathbb{E}\left[\left(\hat{X} - \overline{x}\right)^2\right] = \frac{\sigma^2 + \sigma^2 + \cdots + \sigma^2}{n^2} = \frac{\sigma^2}{n} \]

Central limit theorem continued

More generally, if we are sampling from a population that has an unknown probability distribution, the sampling distribution of the sample mean will still be approximately Gaussian with mean \( \overline{x} \) and variance \( \frac{\sigma^2}{n} \) if the sample size \( n \) is large.
This is one of the most useful theorems in statistics, called the central limit theorem:

The central limit theorem (CLT)
Let \( X_1 , X_2 , \cdots , X_n \) be a random sample of size \( n \) taken from a population with mean \( \overline{x} \) and finite variance \( \sigma^2 \) and \( \hat{X} \) be the sample mean. Then the limiting form of the distribution of \[ Z = \frac{\hat{X} - \overline{x}}{\frac{\sigma}{\sqrt{n}}} \] as \( n \rightarrow \infty \) is the standard normal distribution.
Put another way, for \( n \) sufficiently large, \( \hat{X} \) has approximately a \( N\left(\overline{x}, \frac{\sigma^2}{n}\right) \) distribution – this says the following.
- Suppose we take a sample of size \( n \) and compute the sample mean \( \hat{x} \).
- Then suppose we replicate this sample and record the observed realizations for the sample mean \( \hat{x}_1, \hat{x}_2, \cdots \).
- If the sample size \( n \) is large, these data points \( \hat{x}_1, \cdots \) will be approximately bell shaped with the following properties:
  - the bell will be centered approximately at \( \overline{x} \), the true population mean;
  - the spread of the data around the center will be given by approximately by the standard deviation \( \frac{\sigma}{\sqrt{n}} \).
- Particularly, if \( n \) is very large, the observed sample means will tend to be very close to the center (the true mean).

Central limit theorem continued

As a visualization of the concept, suppose that we have a random sample indexed by \( j \) \[ X_{1,j}, \cdots, X_{n,j}, \] where \( j \) refers to the replication number.
We will make replications for \( j=1,\cdots,m \) and get a RV for sample mean indexed by \( j \), \[ \hat{X}_j = \frac{1}{n}\sum_{i=1}^n X_{i,j}. \]
When we observe a realization of \( \hat{X}_j=\hat{x}_j \) or respectively the sample \[ X_{1,j}=x_{1,j}, \cdots, X_{n,j}=x_{n,j}, \] we record these fixed numerical values.

Courtesy of Mathieu ROUAUD, CC BY-SA 4.0, via Wikimedia Commons

The measurements \( X_{i,j} \) may be distributed according to any underlying distribution with mean \( \overline{x} \) and standard deviation \( \sigma \).
However, if \( n \) is large, the \( \hat{X}_j \) is approximately normal with mean \( \overline{x} \) and standard deviation \( \frac{\sigma}{\sqrt{n}} \).
The sample mean replications, defined by the realizations of \( x_{i,j} \), will have approximately a bell shaped frequency, centered approximately at \( \overline{x} \).
The spread of the data will be approximately \( \frac{\sigma}{\sqrt{n}} \).
Particularly, as \( n\rightarrow \infty \), the spread shrinks to zero, so that we get a better and better estimate (more peaked bell shape) with large sample sizes.