Simple linear regression in matrices

09/16/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • A matrix approach to simple regression
    • Reviewing random vectors and matrices
    • Redefining regression in matrix notation
    • Consequences of the matrix / geometric interpretation

A matrix approach to simple regression

  • We have introduced now the basic framework that will underpin our regression analysis;
  • most of the ideas encountered will generalize into higher dimensions (multiple predictors) without significant changes.
  • Particularly, it will be convenient now to re-introduce our simple regression in terms of vectors and matrices.
  • This will transition directly into the case where we have, rather than a line parametrizing the mean of the response, a hyper-plane.

A review of random vectors

  • We will consider again the vector of all observed cases, Y=(Y1,,Yn)T.
  • When we want to take the expectation of a random vector, we can do so component-wise;
    • here, each component function is being integrated with respect to the random outcomes, such that, E[Y]
    • Likewise, the component-wise definition of the expectation extends to random matrices.
  • The covariance matrix is defined similarly to the variance of a scalar random variable, but in terms of a matrix product, cov(\mathbf{Y}) \triangleq \mathbb{E}\left\{ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right) \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right)^\mathrm{T} \right\}.
  • Q: recall that the covariance of two scalar random variables Y_1,Y_2 is defined cov(Y_1, Y_2) = \sigma_{Y_1,Y_2} = \mathbb{E}\left[ \left( Y_1 - \mu_{Y_1} \right) \left(Y_2 - \mu_{Y_2}\right)\right]
  • Suppose that the random vector \mathbf{Y} is given as, \mathbf{Y} \triangleq \begin{pmatrix} Y_1, & Y_2, & Y_3 \end{pmatrix}^\mathrm{T}; work with a partner to determine the entries of cov(\mathbf{Y}) . Is this the same as \mathbb{E}\left\{ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right)^\mathrm{T} \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right) \right\}?

Remarks on the residuals – continued

  • We also have a new interpretation for the RSS given our matrix form of the equation, \begin{align} RSS &= \hat{\boldsymbol{\epsilon}}^\mathrm{T} \hat{\boldsymbol{\epsilon}} \\ & = \left[ \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y}\right]^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y} \\ & =\mathbf{Y}^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y} \\ &=\mathbf{Y}^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y}, \end{align} due to the properties of symmetry and idempotence.
  • Specifically, \left(\mathbf{I} - \mathbf{H}\right) can be shown to be a symmetric matrix, i.e., \left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T} = \left(\mathbf{I} - \mathbf{H}\right) .
  • Likewise, any projection operator can also be shown to be idempotent, i.e, \mathbf{P}^2 = \mathbf{P}.
  • Taken together, we can also interpret the RSS as a weighted norm for the observation vector \mathbf{Y} .

Remarks on the parameters

  • We have shown already in the activities that the least squares estimate \hat{\boldsymbol{\beta}} are an unbiased estimator for the true \boldsymbol{\beta} .
  • In the next activity, we will show that this is the case for an arbitrary number of parameters using the matrix form for regression.
  • We will also identify that \begin{align} cov\left(\hat{\boldsymbol{\beta}}\right) &= \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \end{align}
  • The above quantity is an exact value for how much spread exists within the estimate of \hat{\boldsymbol{\beta}} about the mean \boldsymbol{\beta} .
  • Notice, while the error covariance \sigma^2 \mathbf{I} is diagonal, the covariance of the parameter estimates \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} is not guaranteed to be so.
  • That is to say, the parameter estimates themselves are generally correlated.
    • It is in the special case in which \mathbf{X}^\mathrm{T}\mathbf{X} is diagonal that we have uncorrelated estimates for the parameters;
    • we will return to this idea in a subsequent lecture.

Remarks on the parameters – continued

  • Notice, we now also have a clear form for the uncertainty in our estimated parameters.
  • Specifically, if we want to examine the standard deviation of an individual parameter, we can approximate this by the standard error, se\left(\hat{\beta}_i \right) \triangleq \hat{\sigma} \sqrt{ \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)_{ii}^{-1}}, where:
    1. \hat{\sigma} is the biased estimate for the standard deviation \sigma ;
    2. we define \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)_{ii}^{-1} to be the i -th diagonal entry of the matrix \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} .
  • Under the condition that \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2\mathbf{I}) , we can thus create confidence intervals for each \hat{\beta}_i based on the student t distribution.
    • Even when the above assumption does not hold, we will often use this as an approximation where it is deemed appropriate.
  • Particularly, while the t test is designed for the mean of a Gaussian distribution, it tends to be robust as long as departures from normality aren’t extreme.
    • This is why we will typically still appeal to this test on our parameters provided that the sample size is sufficiently large.