11/02/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
When there is non-zero correlation of the error (or non-constant variance), in certain cases we can successfully reduce this back to our normal regression framework with generalized least squares.
Particularly, let's assume that the error takes a particular (but more general) form,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]
for a known structure matrix for the error \( \boldsymbol{\Sigma} \) but perhaps an unkown scale coefficient \( \sigma^2 \).
This describes a sittuation in which we have knowledge of the correlation and relative variance between the errors, but we don't know the absolute scale of the variation.
Let us recall some properties of postive-semidefinite, symmetric matrices:
All covariance matrices satisfy the above two properties by their construction. Therefore, it makes sense to say that \( \boldsymbol{\Sigma} \) has a well defined square root,
\[ \begin{align} \sqrt{\boldsymbol{\Sigma}} \triangleq \mathbf{S} \end{align} \]
A simple choice of a square root is the symmetric square root,
\[ \begin{align} \mathbf{S} &\triangleq \mathbf{E} \mathrm{diag}\left(\sqrt{\lambda_i}\right) \mathbf{E}^\mathrm{T} \\ &= \mathbf{E} \begin{pmatrix} \sqrt{\lambda_1} & 0 & 0 &\cdots & 0 \\ 0 & \sqrt{\lambda_2} & 0 &\cdots & 0 \\ 0 & 0 & \sqrt{\lambda_3} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \cdots & \cdots & \cdots & \sqrt{\lambda_n} \end{pmatrix} \mathbf{E}^\mathrm{T} \end{align} \]
where \( \lambda_i \) are the non-negative eigenvalues, and \( \mathbf{E} \) are the associated eigenvectors, of \( \boldsymbol{\Sigma} \).
By the orthogonality of the eigenvectors, it is easy to verify that \( \mathbf{S}^2 = \boldsymbol{\Sigma} \).
However, for matrices, we can define the square root in several different ways with different properties. A special alternative type of square root is given by a Cholesky factorization…
When \( \boldsymbol{\Sigma} \) is positive-definite (all strictly positive \( \lambda_i \)) there is a unique decompostition of \( \mathbf{\Sigma} \) as,
\[ \begin{align} \boldsymbol{\Sigma} \triangleq \mathbf{L} \mathbf{L}^\mathrm{T} \end{align} \]
such that \( \mathbf{L} \) is lower triangular, with positive entries on the diagonal.
That is, the matrix \( \mathbf{L} \) is of the form,
\[ \begin{align} \mathbf{L} = \begin{pmatrix} L_{11} & 0 & 0 &\cdots & 0 \\ L_{21} & L_{22} & 0 &\cdots & 0 \\ L_{31} & L_{31} & L_{33} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ L_{n1} & \cdots & \cdots & \cdots & L_{nn} \end{pmatrix} \end{align} \]
When we allow \( \lambda_i=0 \) for some \( i \), the decomposition will not be unique but will still exist.
This is a weak generalization of the idea of a “square-root” where now we ask that \( \mathbf{S} \mathbf{S}^\mathrm{T} = \boldsymbol{\Sigma} \).
Cholesky decompositions are extremely useful in many areas of mathematics, and especially for our case where the covariance matrix will be strictly positive-definite.
Returning to the regression analysis, we suppose that
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]
We will write \( \boldsymbol{\Sigma} = \mathbf{S} \mathbf{S}^\mathrm{T} \) in a Cholesky decomposition.
Then, considering our form for the signal
\[ \begin{align} \mathbf{Y}& = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}&\\ \Leftrightarrow \mathbf{S}^{-1} \mathbf{Y}& = \mathbf{S}^{-1}\mathbf{X} \boldsymbol{\beta} + \mathbf{S}^{-1}\boldsymbol{\epsilon}\\ \Leftrightarrow \mathbf{Y}' &= \mathbf{X}' \boldsymbol{\beta} + \boldsymbol{\epsilon}' \end{align} \]
for \( \mathbf{Y}',\mathbf{X}', \boldsymbol{\epsilon}' \) as defined by the relationship above.
Q: what is the mean of \( \boldsymbol{\epsilon}' \)?
A: this can be performed as usual, distributing the expectation through the non-random component,
\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\epsilon}' \right] &= \mathbb{E} \left[ \mathbf{S}^{-1}\boldsymbol{\epsilon}\right] \\ &= \mathbf{S}^{-1} \mathbb{E}\left[ \boldsymbol{\epsilon}\right] \\ &= \boldsymbol{0} \end{align} \]
Q: assuming that \( \mathrm{cov}(\boldsymbol{\epsilon}) = \sigma^2 \boldsymbol{\Sigma} \), compute the covariance of \( \boldsymbol{\epsilon}'\triangleq \mathbf{S}^{-1}\boldsymbol{\epsilon} \).
A: noting that \( \boldsymbol{\epsilon}' \) is mean zero, then we can compute,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) &= \mathbb{E}\left[ \left(\boldsymbol{\epsilon}'\right) \left(\boldsymbol{\epsilon}'\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)^\mathrm{T}\right]\\ &=\mathbf{S}^{-1}\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right] \mathbf{S}^{-\mathrm{T}}\\ &= \mathbf{S}^{-1} \sigma^2 \boldsymbol{\Sigma} \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2\mathbf{S}^{-1}\left(\mathbf{S} \mathbf{S}^\mathrm{T} \right) \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2 \mathbf{I} \end{align} \]
Q: How might we use this analysis to transform the variables into something easier to work with? In particular, what variables might satisfy the assumptions of ordinary least squares?
A: Supposing that we can compute the variables,
\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} \\ \mathbf{Y}' &\triangleq \mathbf{S}^{-1} \mathbf{Y} \end{align} \]
the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies our usual Gauss-Markov assumptions, for error \( \boldsymbol{\epsilon}' \) such that \( \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \).
Because the assumptions of the Gauss-Markov theorem hold for the transformed variables, we can re-derive the least-squares solution minimizing the associated residual sum of squares error.
Specifically, we can write,
\[ \begin{align} \hat{\boldsymbol{\epsilon}}' &= \mathbf{Y}' - \hat{\mathbf{Y}}'\\ & =\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}'\right) \end{align} \]
for some choice of \( \hat{\boldsymbol{\beta}}' \).
This choice of \( \hat{\boldsymbol{\beta}}' \) should be the one that minimizes the equation,
\[ \begin{align} J\left(\overline{\boldsymbol{\beta}}\right)&= \left(\hat{\boldsymbol{\epsilon}}'\right)^\mathrm{T}\left(\hat{\boldsymbol{\epsilon}}'\right) \\ & = \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\mathbf{S}^{-\mathrm{T}}\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)\\ &= \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\boldsymbol{\Sigma}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right) \end{align} \] for all choices of \( \overline{\boldsymbol{\beta}} \).
By substitution for our usual solution for the minimizer, with \( \mathbf{X}' \) and \( \mathbf{Y}' \), we find
\[ \begin{align} \hat{\boldsymbol{\beta}}' &\triangleq \left( \left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1}\left(\mathbf{X}'\right)^\mathrm{T} \mathbf{Y}'\\ &=\left(\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1}\mathbf{Y} \\ & =\left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1}\mathbf{Y}. \end{align} \]
Q: recalling that in ordinary least squares, \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right)= \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\sigma^2 \) is given how can we compute \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) \)
A: knowing that the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies the hypotheses of ordinary least squares, the simple method is substitution. I.e.,
\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) &= \left(\left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1} \sigma^2\\ &=\left(\mathbf{X} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1} \sigma^2\\ &= \left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X} \right)^{-1} \sigma^2 \end{align} \]
Because the error \( \boldsymbol{\epsilon}' = \mathbf{S}^{-1} \boldsymbol{\epsilon} \), \[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \end{align} \] the transformation allows us to perform the our usual diagnostics.
Specifically, the residuals are likewise transformed along with the other variables to give,
\[ \begin{align} \hat{\boldsymbol{\epsilon}}'&\triangleq \mathbf{Y}' - \hat{\mathbf{Y}}' \\ &=\mathbf{S}^{-1} \hat{\boldsymbol{\epsilon}} \end{align} \]
where the residuals \( \hat{\boldsymbol{\epsilon}}' \) will have the same properties as usual when \( \boldsymbol{\Sigma} \) is chosen correctly.
However, the issue is typically that \( \boldsymbol{\Sigma} \) is unkown in practice, except in special cases, and typically must be estimated.
As a special case of generalized least squares, we can consider the case when the errors are uncorrelated, but have unequal variances.
That is to say, \( \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \boldsymbol{\Sigma} \) where,
\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] for possibly \( w_i \neq w_j \), but \( \boldsymbol{\Sigma} \) is strictly diagonal.
Here, we will refer to the reciprocals of the individual variances, \( w_i \), as the weights.
Q: in the case
\[ \boldsymbol{\Sigma} =
\begin{pmatrix}
\frac{1}{w_1}& 0 & \cdots & 0 \\
0 & \frac{1}{w_2} & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{1}{w_n}
\end{pmatrix} \]
what is the square root of \( \boldsymbol{\Sigma} \)?
A: in this case, we can compute the square root identically by the component-wise square root, as diagonal matrices are a special class of matrices that are already in eigen-decomposition,
\[ \mathbf{S} = \begin{pmatrix} \frac{1}{\sqrt{w_1}}& 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{w_n}} \end{pmatrix} \]
As a special case of generalized least squares, we can compute the transformation into the variables \( \mathbf{X}', \mathbf{Y}', \boldsymbol{\epsilon}' \) as
\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} = \mathrm{diag}\left(\sqrt{w_i} \right)\mathbf{X}, \\ \mathbf{Y}' & \triangleq \mathbf{S}^{-1} \mathbf{Y} = \mathrm{diag}\left(\sqrt{w_i} \right) \mathbf{Y}, \\ \boldsymbol{\epsilon}' & \triangleq \mathbf{S}^{-1} \boldsymbol{\epsilon} = \mathrm{diag}\left(\sqrt{w_i} \right)\boldsymbol{\epsilon}; \end{align} \]
We note, the column corresponding to the intercept term in the matrix \( \mathbf{X} \) will be a vector of the form,
\[ \begin{align} \begin{pmatrix} \sqrt{w_1}, &\cdots, & \sqrt{w_n}\end{pmatrix}^\mathrm{T}, \end{align} \] instead of the usual vector of ones.
Qualitatively, we can note that
\[ \begin{align} \frac{1}{\sqrt{w_i}} \gg 1 \end{align} \]
will receive low weight in the regression \( \sqrt{w_i} \ll 1 \), while
\[ \begin{align} \frac{1}{\sqrt{w_i}} \ll 1 \end{align} \]
will receive high weight in the regression \( \sqrt{w_i} \gg 1 \)
In examples where the variances are unknown, we may consider an ansatz for the form of the dependence of the variance on the observation.
Let's consider data on the stopping distance of cars with respect to the observed speed.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmod <- lm(dist ~ speed, cars)
plot(residuals(lmod) ~ speed, cars, cex=3, cex.lab=3, cex.axis=1.5)
As an ansatz, we may consider a dependence relationship as,
\[ \begin{align} \sigma_i^2 = \gamma_0 + X_i^{\gamma_1} \end{align} \]
These coefficients, representing a power-law increase in the variance with the speed of the vehicle, can be estimated simultaneously with the parameters for the regression.
Here, we will use the generalized least squares function gls
in the mgcv
library, with a function in the argument for the weights.
library("mgcv")
wlmod <- gls(dist ~ speed, data=cars, weight = varConstPower(1, form= ~ speed))
summary(wlmod)
Generalized least squares fit by REML
Model: dist ~ speed
Data: cars
AIC BIC logLik
412.8352 422.1912 -201.4176
Variance function:
Structure: Constant plus power of variance covariate
Formula: ~speed
Parameter estimates:
const power
3.160444 1.022368
Coefficients:
Value Std.Error t-value p-value
(Intercept) -11.085378 4.052378 -2.735524 0.0087
speed 3.484162 0.320237 10.879947 0.0000
Correlation:
(Intr)
speed -0.9
Standardized residuals:
Min Q1 Med Q3 Max
-1.4520579 -0.6898209 -0.1308277 0.6375029 3.0757014
Residual standard error: 0.7636833
Degrees of freedom: 50 total; 48 residual
summary(lm(dist ~ speed, data=cars))
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Compared to the weighted least squares formulation, with the power-law increase in variance, we have a significantly poorer fit.