Simple linear regression in matrices

09/16/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A matrix approach to simple regression
- Reviewing random vectors and matrices
- Redefining regression in matrix notation
- Consequences of the matrix / geometric interpretation

A matrix approach to simple regression

We have introduced now the basic framework that will underpin our regression analysis;
most of the ideas encountered will generalize into higher dimensions (multiple predictors) without significant changes.
Particularly, it will be convenient now to re-introduce our simple regression in terms of vectors and matrices.
This will transition directly into the case where we have, rather than a line parametrizing the mean of the response, a hyper-plane.

A review of random vectors

We will consider again the vector of all observed cases, \[ \mathbf{Y} = \begin{pmatrix} Y_1 , & \cdots, & Y_n \end{pmatrix}^\mathrm{T}. \]
When we want to take the expectation of a random vector, we can do so component-wise;

here, each component function is being integrated with respect to the random outcomes, such that, \[ \begin{align}\mathbb{E}\left[ \mathbf{Y}\right] &\triangleq \begin{pmatrix} \mathbb{E}\left[Y_1\right], & \cdots, & \mathbb{E}\left[Y_n \right] \end{pmatrix}^\mathrm{T} \\ & = \begin{pmatrix} \beta_0 + \beta_1 X_1 , & \cdots, & \beta_0 + \beta_1 X_n \end{pmatrix}^\mathrm{T} \end{align} \]
Likewise, the component-wise definition of the expectation extends to random matrices.

The covariance matrix is defined similarly to the variance of a scalar random variable, but in terms of a matrix product, \[ cov(\mathbf{Y}) \triangleq \mathbb{E}\left\{ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right) \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right)^\mathrm{T} \right\}. \]
Q: recall that the covariance of two scalar random variables \( Y_1,Y_2 \) is defined \[ cov(Y_1, Y_2) = \sigma_{Y_1,Y_2} = \mathbb{E}\left[ \left( Y_1 - \mu_{Y_1} \right) \left(Y_2 - \mu_{Y_2}\right)\right] \]
Suppose that the random vector \( \mathbf{Y} \) is given as, \[ \mathbf{Y} \triangleq \begin{pmatrix} Y_1, & Y_2, & Y_3 \end{pmatrix}^\mathrm{T}; \] work with a partner to determine the entries of \( cov(\mathbf{Y}) \). Is this the same as \[ \mathbb{E}\left\{ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right)^\mathrm{T} \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right) \right\}? \]

A review of random vectors – continued

Solution: we note that the definition of the covariance is in terms of the outer product of the vectors, i.e., for a vector \( \mathbf{X} = \begin{pmatrix} X_1, & X_2, & X_3\end{pmatrix}^\mathrm{T} \) we have \[ \begin{align} \mathbf{X}\mathbf{X}^\mathrm{T} = \begin{pmatrix} X_1 * X_1 & X_1 * X_2 & X_1 * X_3 \\ X_2 * X_1 & X_2 * X_2 & X_2 * X_3 \\ X_3 * X_1 & X_3 * X_2 & X_3 * X_3 \end{pmatrix}. \end{align} \]
Therefore, the \( ij \)-th entry of, \[ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right) \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right)^\mathrm{T} , \] is precisely given by, \[ \left(Y_i - \mu_{Y_i}\right) \left(Y_j - \mu_{Y_j}\right). \]
Following the definition of the expectation component-wise on the matrix, \[ cov(\mathbf{Y}) = \begin{pmatrix} \sigma_{1}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{2}^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32}^2 & \sigma_{3}^2 \end{pmatrix}. \]

A review of random vectors – continued

Comparatively, interchanging the transpose, we find, \[ \mathbf{X}^\mathrm{T}\mathbf{X} = \sum_{i=1}^3 X_i^2, \] i.e., the standard inner product.
If we compute, \[ \begin{align}\mathbb{E}\left\{ \left(\mathbf{Y} -\mathbb{E}\left[ \mathbf{Y} \right] \right)^\mathrm{T} \left(\mathbf{Y} - \mathbb{E}\left[ \mathbf{Y} \right]\right) \right\} &= \mathbb{E}\left\{ \sum_{i=1}^3 \left(Y_i - \mu_{Y_i}\right)^2 \right\} \\ &= \sum_{i=1}^3 \sigma^2_{i}. \end{align} \]
Q: based on the previous question, if we take the assumptions of the Gauss-Markov theorem for the response variable \( \mathbf{Y} \), what are the components of, \[ cov(\mathbf{Y})? \]
A: this will be matrix with \( \sigma^2 \) on the diagonal (constant variance assumption) and zeros on the off-diagonal (uncorrelated assumption), \[ cov(\mathbf{Y} ) = \begin{pmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \sigma^2 \end{pmatrix}= \sigma^2 \mathbf{I}; \] in the above \( \mathbf{I} \) is the identity matrix.

A review of random vectors – continued

Supoose that \( \mathbf{A} \) is a constant matrix, \( \mathbf{Y} \) is a random vector and \( \mathbf{W} = \mathbf{A} \mathbf{Y} \) is a random vector by the identity.
We will recall a few basic results to use throughout:

\( \mathbb{E}\left[ \mathbf{A}\right] = \mathbf{A} \);
\( \mathbb{E}\left[ \mathbf{W}\right] = \mathbf{A} \mathbb{E}\left[\mathbf{Y}\right] \)
\( cov(\mathbf{W}) = \mathbf{A} cov(\mathbf{Y})\mathbf{A}^\mathrm{T} \)

The above results can all be found directly from the definition of the expectation being taken component-wise, and by the definition of matrix multiplication.

Taking the right rearrangement of terms, we can find each identity, and we will not stress this algebra.

Simple regression in matrices

We recall again our usual regression model and assumptions, but we will frame this in terms of a system of matrix equations: \[ \begin{align} Y_1 &= \beta_0 + \beta_1 X_1 + \epsilon_1 \\ Y_2 &= \beta_0 + \beta_1 X_2 + \epsilon_2 \\ \vdots & \\ Y_n &= \beta_0 + \beta_1 X_n + \epsilon_n \end{align} \]
We have several natural choices for vectors in this model, i.e., \[ \begin{align} \mathbf{Y}&\triangleq \begin{pmatrix} Y_1, & \cdots, & Y_n \end{pmatrix}^\mathrm{T}; & \boldsymbol{\epsilon}& \triangleq \begin{pmatrix} \epsilon_1, & \cdots, & \epsilon_n \end{pmatrix}^\mathrm{T} ;\\ \boldsymbol{\beta}& \triangleq \begin{pmatrix} \beta_0, & \beta_1 \end{pmatrix}^\mathrm{T}. \end{align} \]
Suppose we define an extended matrix for the explantory variable \( X \), which will include the intercept term, \[ \mathbf{X} \triangleq \begin{pmatrix} X_1 & X_2 & \cdots & X_n \\ 1 & 1 & \cdots & 1 \end{pmatrix}^\mathrm{T}; \]
This extended matrix will be called the design matrix.
Then by the definition of matrix multiplication, we recover \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}. \]

Regression in matrices

Our general formula for a linear model will thus be of the form, \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}. \]
Q: how do we state our usual assumptions in matrix form?
A: we can write the following:

\( \mathbb{E}\left[\boldsymbol{\epsilon}\right] = \boldsymbol{0} \);
\( cov(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I} \).

In general, we will now assume that \( \mathbf{X} \in \mathbb{R}^{n\times p} \) where \( p \) is equal to the number of the parameters in the model.

Whenever, the model contains an intercept \( \beta_0 \), \( p \) equals the number of explanatory variables plus one.
In the case where we exclude an intercept, \( \beta_0 = 0 \), \( p \) equals the number of explanatory variables.

Normal equations in matrices

The normal equations likewise have a matrix form, for which the geometric meaning will come out more clearly.
Particularly, we consider if we wish the minimize the objective function, \[ \begin{align} J \left( \overline{\boldsymbol{\beta}} \right) &= \left(\mathbf{Y} - \mathbf{X}\overline{\boldsymbol{\beta}}\right)^\mathrm{T} \left(\mathbf{Y} - \mathbf{X}\overline{\boldsymbol{\beta}}\right), \end{align} \] in the free parameters \( \overline{\boldsymbol{\beta}} \),
we see that the least squares solution \( \hat{\boldsymbol{\beta}} \) minimizes the (squared) Euclidean norm of the residuals, \[ J\left(\hat{\boldsymbol{\beta}}\right) = \hat{\boldsymbol{\epsilon}}^\mathrm{T} \hat{\boldsymbol{\epsilon}}. \]
Consider that the vector of fitted values is defined as, \[ \hat{\mathbf{Y}} = \mathbf{X} \hat{\boldsymbol{\beta}}; \]
This says that the predicted values of \( Y \) are in the subspace spanned by the columns of the explanatory variables.

Normal equations in matrices – continued

We consider, thus, differentiating the objective function and setting to zero, \( \hat{\boldsymbol{\beta}} \) satisfies, \[ \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y}. \]
let’s assume for the moment that \( \mathbf{X}^\mathrm{T}\mathbf{X} \) is invertible;
In this case, we find, \[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y}. \]
Q: if we suppose that \( \mathbf{X} \) is invertible, what does this tell us about the predicted values \( \hat{\mathbf{Y}} \)?
A: if \( \mathbf{X}^{-1} \) exists, then we find \[ \begin{align} \hat{\mathbf{Y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y} \\ & = \mathbf{X} \mathbf{X}^{-1} \mathbf{X}^{-\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{Y}\\ &= \mathbf{Y}. \end{align} \]
Q: under what conditions can we suppose that \( \mathbf{X}^{-1} \) exists?
A: the matrix \( \mathbf{X} \) must have linearly independent columns and be square, i.e., \( n=p \).
Therefore, if we increase our parameters to \( p=n \), we can find a solution by fitting the regression function to every case.
Recall, when \( n=p \) we also said we have no degrees of freedom to estimate our uncertainty — this is part of what is meant by “overfitting” the data;

particularly, we will create a regression function for which we cannot estimate the uncertainty and, almost surely, won’t generalize well to cases outside of the data at hand.

Normal equations in matrices – continued

Visualization of the orthogonal projection of the vector of observations into the span of the predictors.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

Let’s suppose that \( p \) is strictly less than \( n \) such that \( \mathbf{X}^{-1} \) does not exist.
Q: what is the vector of fitted values \( \mathbf{Y} \) that looks most like the vector \( \mathbf{Y} \), yet is confined to the span of \( \mathbf{X} \)?
A: geometrically, we should reason that \( \hat{\mathbf{Y}} \) should be the orthogonal projection of \( \mathbf{Y} \) into the span of \( \mathbf{X} \).
Indeed, for the least squares solution \( \hat{\beta} \), \[ \hat{\mathbf{Y}} = \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y} = \mathbf{H} \mathbf{Y}, \]
the matrix \[ \mathbf{H}\triangleq \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}, \] which we denote the “hat” matrix, is precisely the orthogonal projection operator.

Therefore, taking the fitted value as the orthogonal projection minimizes the (squared) Euclidean norm of the residual, \[ RSS = \hat{\boldsymbol{\epsilon}}^\mathrm{T}\hat{\boldsymbol{\epsilon}}, \] as this is naturally the shortest path.

Remarks on the hat matrix

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

We note that the hat matrix is formed by multiplying \( \mathbf{X} \) on the left of the estimated parameters \( \hat{\boldsymbol{\beta}} \).
Particularly, we see that \( \hat{\boldsymbol{\beta}} \in \mathbb{R}^{p} \), such that multiplication takes \[ \begin{align} &\mathbb{R}^{p}& & \mathbb{R}^n \\ \mathbf{X}: &\hat{\boldsymbol{\beta}} & \mapsto & \hat{\mathbf{Y}} \end{align}, \]
while on the other hand, left multiplication of \( \mathbf{Y} \) by the matrix \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \) has the effect \[ \begin{align} &\mathbb{R}^{n}& & \mathbb{R}^p \\ \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T}: &\mathbf{Y} & \mapsto & \hat{\boldsymbol{\beta}} \end{align}. \]

The matrix \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \) is known as the (left) pseudo-inverse of \( \mathbf{X} \).
This operation has the effect of transferring \( \mathbf{Y} \) into its projected representation in the invariant coordinates for the space defined by the span of the columns of \( \mathbf{X} \).
Multiplying again by \( \mathbf{X} \) gives the representation of this coordinate system back in the full space, where \( span(\mathbf{X}) \) is embedded.
Pseudo-inverses are extremely powerful tools to use in general, and come out frequently in advanced mathematical and statistical tools.

We remark here, that a deep discussion of pseudo-inverses goes beyond the focus of the course, but this should be of great interest to advanced students and curious readers.

Remarks on the residuals

In matrix form, we now have another interpretation for the residuals, \[ \begin{align} \hat{\boldsymbol{\epsilon}} &= \mathbf{Y} - \hat{\mathbf{Y}} \\ & = \mathbf{Y} - \mathbf{H} \mathbf{Y} \\ & = \left( \mathbf{I} - \mathbf{H}\right) \mathbf{Y}. \end{align} \]
We can interpret the projection operator into the \( span(\mathbf{X}) \) as the map which projects an arbitrary vector \( \mathbf{V} \) into the subspace defined by \( span(\mathbf{X}) \), leaving only the components which reside there from the orginal vector.
Q: what is the effect thus of the operator \( \left(\mathbf{I}- \mathbf{H}\right) \) on an arbitrary vector?
A: the identity operator maps \( \mathbf{V} \) to itself, and thus the difference of \( \left(\mathbf{I}- \mathbf{H}\right) \) leaves only the components that are orthogonal to the subspace \( span(\mathbf{X}) \).
Therefore, \( \left(\mathbf{I}- \mathbf{H}\right) \) is the orthogonal projection operator to the space complimentary to \( span(\mathbf{X}) \).

Remarks on the residuals – continued

The new interpretation to the residuals sheds some light on the properties we have previously explored.
Exercise (2 minutes): we have shown in the homework now, the following properties:

\( \sum_{i=1}^n \hat{\epsilon}_i X_i = \hat{\boldsymbol{\epsilon}}^\mathrm{T}\mathbf{X}^{(2)} = 0 \), where \( \mathbf{X}^{(2)} \) denotes the second column of the matrix \( \mathbf{X} \).
\( \sum_{i=1}^n \hat{\epsilon}_i \hat{Y}_i = \hat{\boldsymbol{\epsilon}}\hat{\mathbf{Y}}= 0 \)

Discuss with a partner the new geometric meaning of these statements.
Solution: we know that by construction \( \hat{\boldsymbol{\epsilon}} \) is orthogonal to \( span(\mathbf{X}) \).
Thus, the two statements are direct consequences of the orthogonality.

Remarks on the residuals – continued

We also have a new interpretation for the \( RSS \) given our matrix form of the equation, \[ \begin{align} RSS &= \hat{\boldsymbol{\epsilon}}^\mathrm{T} \hat{\boldsymbol{\epsilon}} \\ & = \left[ \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y}\right]^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y} \\ & =\mathbf{Y}^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y} \\ &=\mathbf{Y}^\mathrm{T} \left(\mathbf{I} - \mathbf{H}\right) \mathbf{Y}, \end{align} \] due to the properties of symmetry and idempotence.
Specifically, \( \left(\mathbf{I} - \mathbf{H}\right) \) can be shown to be a symmetric matrix, i.e., \( \left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T} = \left(\mathbf{I} - \mathbf{H}\right) \).
Likewise, any projection operator can also be shown to be idempotent, i.e, \[ \mathbf{P}^2 = \mathbf{P}. \]
Taken together, we can also interpret the \( RSS \) as a weighted norm for the observation vector \( \mathbf{Y} \).

Remarks on the parameters

We have shown already in the activities that the least squares estimate \( \hat{\boldsymbol{\beta}} \) are an unbiased estimator for the true \( \boldsymbol{\beta} \).
In the next activity, we will show that this is the case for an arbitrary number of parameters using the matrix form for regression.
We will also identify that \[ \begin{align} cov\left(\hat{\boldsymbol{\beta}}\right) &= \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \end{align} \]
The above quantity is an exact value for how much spread exists within the estimate of \( \hat{\boldsymbol{\beta}} \) about the mean \( \boldsymbol{\beta} \).
Notice, while the error covariance \( \sigma^2 \mathbf{I} \) is diagonal, the covariance of the parameter estimates \[ \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \] is not guaranteed to be so.
That is to say, the parameter estimates themselves are generally correlated.

It is in the special case in which \( \mathbf{X}^\mathrm{T}\mathbf{X} \) is diagonal that we have uncorrelated estimates for the parameters;
we will return to this idea in a subsequent lecture.

Remarks on the parameters – continued

Notice, we now also have a clear form for the uncertainty in our estimated parameters.
Specifically, if we want to examine the standard deviation of an individual parameter, we can approximate this by the standard error, \[ se\left(\hat{\beta}_i \right) \triangleq \hat{\sigma} \sqrt{ \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)_{ii}^{-1}}, \] where:
1. \( \hat{\sigma} \) is the biased estimate for the standard deviation \( \sigma \);
2. we define \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)_{ii}^{-1} \) to be the \( i \)-th diagonal entry of the matrix \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \).
Under the condition that \( \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2\mathbf{I}) \), we can thus create confidence intervals for each \( \hat{\beta}_i \) based on the student t distribution.

Even when the above assumption does not hold, we will often use this as an approximation where it is deemed appropriate.

Particularly, while the t test is designed for the mean of a Gaussian distribution, it tends to be robust as long as departures from normality aren’t extreme.

This is why we will typically still appeal to this test on our parameters provided that the sample size is sufficiently large.