Simple linear regression -- part 3

09/09/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • Gaussian error regression models
    • Estimation by Maximum Likelihood
    • Gaussian correlation models

Gaussian Error Regression Models

  • Recall now our standard regression equation, \[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i. \end{align} \]
  • Up to now we have only made assumptions about the first two moments of the variation in the signal, \( \epsilon_i \), and the decorrelation of cases:
    1. \( \mathbb{E}[ \epsilon_i ] = 0 \) for each \( i \);
    2. \( \mathbb{E}[ \epsilon_i^2 ] = \sigma^2 \) for each \( i \); and
    3. \( \mathbb{E}[ \epsilon_i \epsilon_j ] = 0 \) for every \( i\neq j \).
  • Regardless of the actual distribution of the error/ variation \( \epsilon_i \), the Gauss-Markov theorem holds and the solution by least squares is the BLUE.
  • However, to create confidence intervals and perform uncertainty quantification, we will need to assume some additional structure.
  • It is often assumed that the error distribution for \( \epsilon_i \) is Gaussian for several reasons:
    1. Gaussian errors simplify the regression analysis significantly;
    2. Gaussian distributions are quite common in real applications;
    3. it can be shown that linear combinations of Gaussian random variables are also Gaussian so that, in a linear model, many other quantities will also be Gaussian; and finally
    4. even when errors are not strictly Gaussian, a Gaussian distribution is often a “good” approximation.
  • The notion of a Gaussian distribution being a “good” approximation is formalized with the Central Limit Theorem, which we now recall.

Central limit theorem

  • Suppose we have a sequence of random variables, \( \left\{x_i \right\}_{i=1}^\infty \) which are independent, identically distributed from any distribution, such that for all \( i \)
    1. \( \mathbb{E}[x_i] = \mu \); and
    2. \( \mathrm{var}(x_i) = \sigma^2 \), for some finite \( \sigma \).
  • For each \( n\geq 1 \) define the sample-based mean of the sequence \( \left\{x_i \right\}_{i=1}^n \) \[ \begin{align} \overline{x}_n \triangleq \frac{1}{n} \sum_{i=1}^n x_i \end{align} \]
  • Then, as the number of random variables \( n\rightarrow \infty \), the sample-based means \[ \begin{align} \sqrt{n}\left(\overline{x}_n - \mu\right) {\xrightarrow {d}} N(0, \sigma^2) \end{align} \] where the convergence is in distribution.
  • Put another way, for \( n \) sufficiently large, \( \overline{x}_n \) has approximately a \( N\left(\mu, \frac{\sigma^2}{n}\right) \) distribution.
  • Heuristically,
    • for large sample sizes, we can produce confidence intervals for the mean of the unbiased estimator (e.g., the true \( \beta_0,\beta_1 \)),
    • from the sample based estimate (the least squares solution \( \hat{\beta}_0 ,\hat{\beta}_1 \)) with a Gaussian assumption;
    • this is a good approximation if not strictly accurate.
  • The “goodness” of this approximation can be made more formal in another course.

Gaussian Error Regression

  • The Gaussian assumption will be introduced on top of our existing assumptions as follows:
    1. We suppose that we have a signal in the data, linear in the parameters \( \beta_0,\beta_1 \), \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
    2. \( X_i \) is a known constant, while \( \beta_0,\beta_1 \) are unkown but fixed values for all \( i \).
    3. \( \epsilon_i \) are drawn independently and identically distributed (iid) as, \[ \epsilon_i \sim N\left(0, \sigma^2\right) \] for each \( i \).
  • Independence of the \( \epsilon_i \) implies that the cases are once again uncorrelated.
  • Additionally, as we have assumed the form of the distribution as Gaussian, and the form of the first two moments, we have entirely parametrized the distribution of the variation.

Gaussian Error Implications – continued

  • As was mentioned before, Gaussianity thus simplifies the regression analysis significantly.
  • It is easy to construct confidence intervals for the mean of a sample-based estimate of the mean of a Gaussian random variable from the Student t distribution.
  • Likewise, the t-test derived from the Student t distribution can be used to perform hypothesis tests.
  • We will return to both of these points when we introduce multiple regression more formally.
  • For the moment, we will return to the addendum of the Gauss-Markov theorem:
    • If in addition, \( \epsilon\sim N\left(0,\sigma^2\right) \), then the solution by least squares is the “maximum likelihood estimator”.

Estimation of parameters by maximum likelihood

  • A concept widely used in statistical inference is a “likelihood” function.
  • Consider a probability density, denoted \( p_\theta(y) \), which depends on the value of the parameter \( \theta \).
    • Suppose we are investigating some random variable \( Y \), for which there is some “true” parameter \( \theta_0 \), such that \[ P(a \leq Y \leq b)= \int_a^b y p_{\theta_0}(y) \mathrm{d}y. \]
  • We assume that even though \( \theta_0 \) is not known, we can evaluate the density \[ \begin{align} p_\theta(Y=y) \end{align} \] for any choice of \( \theta \) and some observed piece of data \( y \).
  • Then, the likelihood of \( \theta \) based on an observed \( y \) is defined \[ \begin{align} \mathcal{L}(\theta \vert y) = p_\theta(Y=y) \end{align} \] where we evaluate the probability of \( Y \) attaining an observed piece of data \( y \) given a choice of \( \theta \).
  • The “likelihood” is thus a measure of how well does our choice of parameter make the distribution describe the observed data.
  • Maximum likelihood estimation is thus the process of finding a \( \hat{\theta} \) which maximizes the likelihood,
    • i.e., \( \hat{\theta} \) which maximizes the probability density for the observed data \( y \) given the density \( p_\theta \).

Estimation of parameters by maximum likelihood – an example

Image of Gaussian distribution with mean 230, standard deviation 10, and samples at Y_1 = 250, Y_2=265, Y_3=259

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

  • Suppose that we have three samples \( Y_1,Y_2,Y_3 \) from what we know is a population that is Gaussian distributed.
    • Moreover, we suppose that while we know the standard deviation \( \sigma=10 \), we do not know the true mean \( \mu \).
  • Let \( \mathrm{exp}\left\{x\right\} \) denote the function \( e^x \).
  • By varying the choice of \( \mu \), we can evaluate the density of any particular sample with the Gaussian density \[ f_j = \frac{1}{\sqrt{2\pi}10}\mathrm{exp}\left\{-\frac{1}{2}\left(\frac{Y_j - \mu}{10}\right)^2\right\}, \] denoted \( f_j \) for sample \( j \), based on the choice of \( \mu \).
  • We suppose that the observed data points are \( Y_1 = 250, Y_2=265,Y_3=259 \)
  • Q: given the above data points, and the associated graph to the left, does \( \mu=230 \) appear to be the “most likely” choice for the true mean?

Estimation of parameters by maximum likelihood – an example continued

Image of Gaussian distribution with mean 259, standard deviation 10, and samples at Y_1 = 250, Y_2=265, Y_3=259

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

  • Intuitively, we can tell that there are better choices for the “center of mass” of the data, given by our choice of \( \mu \).
  • One particular choice may be to set \( \mu= 259= Y_3 \) as on the left.
  • Indeed, we can compare the values for the density function for each choice of \( \mu \) as in the table below:
    Density\( \mu=230 \) \( \mu= 259 \)
    \( f_1 \) .005399 .026609
    \( f_2 \) .000087 .033322
    \( f_3 \) .000595 .039894
  • The likelihood value for the data above, assuming that these are independent samples of the true Gaussian distribution, is given by the product of the respective densities.
  • By performing a numerical optimization, we can recover the “most likely” choice of \( \mu \) given our observations.
  • However, for a Gaussian distribution as above, it can be demonstrated that the sample mean is the maximum likelyhood estimate of \( \mu \).

Estimation of the regression function by maximum likelihood

  • As with the toy example before, we can apply the same method to estimating the regression parameters \( \beta_0,\beta_1 \).
  • Recall that we assume \( \epsilon_i \sim N(0,\sigma^2) \), such that
    • each observation is iid drawn from a Gaussian distribution, \[ Y_i \sim N(\beta_0+ \beta_1 X_i, \sigma^2) \]
  • To put this in the framework of a likelihood function, we suppose that \( \overline{\beta}_0,\overline{\beta}_1,\overline{\sigma} \) are free parameters, such that \[ \begin{align}\mathcal{L}\left(\overline{\beta}_0,\overline{\beta}_1,\overline{\sigma}\vert Y_i\right)& = p_{\overline{\beta}_0,\overline{\beta}_1,\overline{\sigma}}\left(Y_i\right) \\ &=\frac{1}{\sqrt{2\pi}\overline{\sigma}}\mathrm{exp}\left\{-\frac{1}{2}\left(\frac{Y_i - \overline{\beta}_0 - \overline{\beta}_1 X_i}{\overline{\sigma}}\right)^2\right\} \end{align} \]
  • Then, due to the independence, the likelihood over the set of the \( n \) observations is given simply as the product of the individual likelihoods: \[ \mathcal{L}\left(\overline{\beta}_0,\overline{\beta}_1,\overline{\sigma}\vert Y_{i=1,\cdots,n}\right) =\prod_{i=1}^n\frac{1}{\sqrt{2\pi}\overline{\sigma}}\mathrm{exp}\left\{-\frac{1}{2}\left(\frac{Y_i - \overline{\beta}_0 - \overline{\beta}_1 X_i}{\overline{\sigma}}\right)^2\right\} \]
  • By maximizing the above product over all the observed data, with respect to the choice of the free parameters \( \overline{\beta}_0,\overline{\beta}_1,\overline{\sigma} \), we obtain the maximium likelihood estimate.

Estimation of the regression function by maximum likelihood – continued

  • Q:Suppose we want to maximize the likelihood function, \[ \mathcal{L}\left(\overline{\beta}_0,\overline{\beta}_1,\overline{\sigma}\vert Y_{i=1,\cdots,n}\right) =\prod_{i=1}^n\frac{1}{\sqrt{2\pi}\overline{\sigma}}\mathrm{exp}\left\{-\frac{1}{2}\left(\frac{Y_i - \overline{\beta}_0 - \overline{\beta}_1 X_i}{\sigma}\right)^2\right\}; \] what approach can we take to make this tractible?
  • The function \( \log() \) is monotonic, so that an increase in the argument corresponds to an increase in the output.
  • Using this fact, we can reduce our analysis of the likilhood function to something more simple, the “log-likelihood”, \[ \log(\mathcal{L}) = constant - n\log{\overline{\sigma}} - \frac{1}{2\overline{\sigma}^2}\sum_{i=1}^n \left(Y_i - \overline{\beta}_0 - \overline{\beta}_1 X_i \right)^2; \]
  • finding the maximum value of the above equation is equivalent to finding the maximum of the likelihood, due to the monotonicity.
  • Note, we write a number of values as “constant”, because for purposes of optimization in the free parameters \( \overline{\beta}_0,\overline{\beta}_1,\overline{\sigma} \), these make no difference in the outcome.
  • Q: what would be a plausible approach of finding the maximum of the above?
  • A: one approach is to take the derivative in each free parameter, as before.
  • This will result in this case in three equations which will resemble the normal equations of the least squares approach.

Estimation of the regression function by maximum likelihood – a remark

  • We stated that, as an addendum to the Gauss-Markov theorem, the parameter estimates by least squares \( \hat{\beta}_0,\hat{\beta}_1 \) are also the maximum likelihood parameters in the case of Gaussian errors.
  • However, we note, the maximum likelihood estimate above simultaneously estimates the unknown variance \( \sigma^2 \).
  • As opposed to the unbiased estimate of the variance, \[ \hat{\sigma}^2 \triangleq \frac{RSS}{n-p}, \] the maximum likelihood estimate is biased.
  • Indeed, the maximum likelihood estimate for the variance is given by the naive normalization, \[ \sum_{i=1}^n \frac{\left(Y_i - \hat{Y}_i\right)^2}{n}. \]
  • For large values of \( n \), the difference in the denominator between \( n \) and \( n-p \) may become small (with a reasonable number of parameters), but for small sample sizes this is quite an important difference.
  • We can see that maximum likelihood estimate will severely underestimate \( \sigma^2 \) when \( n \) is small or for any \( p \) close to \( n \).

Gaussian correlation models

  • So far, we have taken the relatively simple formulation of the regression problem in which the explanatory variable \( X \) is assumed to be a controllable constant value.
    • In this context, the variation that we refer to in \( \epsilon_i \) is prototypically defined in terms of repeated sampling of replicated cases.
    • Particularly, we are considering scenarios in which we can hold the value of \( X_i \) constant, and take multiple independent samples of the associated value of the response \( Y_i \).
  • However, there are situations in which we want to consider both the response \( Y \) and the predictor \( X \) to be random variables.
    • For example, an analyst may may be interested in two variables “height of person” and “weight of person” in a study of a sample of persons, with each variable being taken as random.
    • The analyst might wish to study the relation between the two variables or might be interested in
      • making inferences about weight of a person on the basis of the person’s height;
      • making inferences about height on the basis of weight;
      • or in both.
  • In this context, we can treat our regression as a correlation model.
  • In the case of a joint Gaussian distribution for \( Y \) and \( X \), our techniques in regression remain essentially the same, but the framework for our analysis and the meaning of parameters will change.
  • We will discuss this change of framework in the remaining lecture.

Bivariate Gaussian distribution

Image of bivariate Gaussian in the real 2-D plane, with vertical direction corresponding to the density value at each point

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

  • To begin, we will recall the basis of the Gaussian correlation model — the multivariate Gaussian distribution.
  • Reframing slightly, we suppose that \( Y_1 \) and \( Y_2 = X \) are jointly distributed with respect to a Gaussian distribution as on the left.
    • In this case, we will re-write \( X \) as above, as its meaning is symmetric with the response.
  • The bivariate Gaussian has the functional form of the below, where
    1. \( \mu_i,\sigma_i \) are the mean and standard deviation of \( Y_i \) respectively;
    2. \( \rho_{12} \triangleq \frac{\sigma_{12}^2}{\sigma_1 \sigma_2} \) is the coefficient of correlation between \( Y_1,Y_2 \);
    3. in the figure on the left, we identify \( f(Y_1,Y_2) \) as \( p_{Y_1,Y_2}(y_1,y_2) \).
\[ \begin{align} p_{Y_1,Y_2}(y_1,y_2) = \frac{1}{2 \pi \sigma_1 \sigma_2 \sqrt{1 - \rho_{12}^2}}\mathrm{exp}\left\{\frac{1}{2\left(1 - \rho_{12}^2\right)} \left[\left(\frac{y_1 - \mu_1}{\sigma_1}\right)^2 + \left(\frac{y_2 - \mu_2}{\sigma_2}\right)^2 - 2\rho_{12}\left( \frac{y_1 - \mu_1}{\sigma_1}\right)\left( \frac{y_2 - \mu_2}{\sigma_2}\right)\right] \right\} \end{align} \]

Marginal Distributions