11/02/2020

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- Correlated errors
- Matrix square roots
- Generalized least squares
- Weighted least squares
- Simple example of weighted least squares
- Detailed example of weighted least squares

When there is non-zero correlation of the error (or non-constant variance), in certain cases we can successfully reduce this back to our normal regression framework with generalized least squares.

Particularly, let's assume that the error takes a particular (but more general) form,

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]

for a

*known structure matrix*for the error \( \boldsymbol{\Sigma} \) but perhaps an*unkown scale coefficient*\( \sigma^2 \).This describes a sittuation in which we have knowledge of the correlation and relative variance between the errors, but we don't know the absolute scale of the variation.

Let us recall some properties of postive-semidefinite, symmetric matrices:

- Positive-semidefinite matrices (by definition) have all non-negative eigenvalues.
- Symmetric matrices (by the spectral theorem) must have a decomposition into an orthogonal eigenbasis, one for which each basis vector is a principle axis corresponding to one of the eigenvalue scales of growth or decay.

All covariance matrices satisfy the above two properties by their construction. Therefore, it makes sense to say that \( \boldsymbol{\Sigma} \) has a well defined square root,

\[ \begin{align} \sqrt{\boldsymbol{\Sigma}} \triangleq \mathbf{S} \end{align} \]

A simple choice of a square root is the

**symmetric**square root,\[ \begin{align} \mathbf{S} &\triangleq \mathbf{E} \mathrm{diag}\left(\sqrt{\lambda_i}\right) \mathbf{E}^\mathrm{T} \\ &= \mathbf{E} \begin{pmatrix} \sqrt{\lambda_1} & 0 & 0 &\cdots & 0 \\ 0 & \sqrt{\lambda_2} & 0 &\cdots & 0 \\ 0 & 0 & \sqrt{\lambda_3} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \cdots & \cdots & \cdots & \sqrt{\lambda_n} \end{pmatrix} \mathbf{E}^\mathrm{T} \end{align} \]

where \( \lambda_i \) are the non-negative eigenvalues, and \( \mathbf{E} \) are the associated eigenvectors, of \( \boldsymbol{\Sigma} \).

By the orthogonality of the eigenvectors, it is easy to verify that \( \mathbf{S}^2 = \boldsymbol{\Sigma} \).

However, for matrices, we can define the square root in several different ways with different properties. A special alternative type of square root is given by a Cholesky factorization…

When \( \boldsymbol{\Sigma} \) is positive-definite (all strictly positive \( \lambda_i \)) there is a

**unique**decompostition of \( \mathbf{\Sigma} \) as,\[ \begin{align} \boldsymbol{\Sigma} \triangleq \mathbf{L} \mathbf{L}^\mathrm{T} \end{align} \]

such that \( \mathbf{L} \) is lower triangular, with positive entries on the diagonal.

That is, the matrix \( \mathbf{L} \) is of the form,

\[ \begin{align} \mathbf{L} = \begin{pmatrix} L_{11} & 0 & 0 &\cdots & 0 \\ L_{21} & L_{22} & 0 &\cdots & 0 \\ L_{31} & L_{31} & L_{33} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ L_{n1} & \cdots & \cdots & \cdots & L_{nn} \end{pmatrix} \end{align} \]

When we allow \( \lambda_i=0 \) for some \( i \), the decomposition

**will not be unique**but will still exist.This is a weak generalization of the idea of a “square-root” where now we ask that \( \mathbf{S} \mathbf{S}^\mathrm{T} = \boldsymbol{\Sigma} \).

Cholesky decompositions are extremely useful in many areas of mathematics, and especially for our case where the covariance matrix will be strictly positive-definite.

Returning to the regression analysis, we suppose that

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]

We will write \( \boldsymbol{\Sigma} = \mathbf{S} \mathbf{S}^\mathrm{T} \) in a Cholesky decomposition.

Then, considering our form for the signal

\[ \begin{align} \mathbf{Y}& = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}&\\ \Leftrightarrow \mathbf{S}^{-1} \mathbf{Y}& = \mathbf{S}^{-1}\mathbf{X} \boldsymbol{\beta} + \mathbf{S}^{-1}\boldsymbol{\epsilon}\\ \Leftrightarrow \mathbf{Y}' &= \mathbf{X}' \boldsymbol{\beta} + \boldsymbol{\epsilon}' \end{align} \]

for \( \mathbf{Y}',\mathbf{X}', \boldsymbol{\epsilon}' \) as defined by the relationship above.

**Q:**what is the mean of \( \boldsymbol{\epsilon}' \)?**A:**this can be performed as usual, distributing the expectation through the non-random component,\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\epsilon}' \right] &= \mathbb{E} \left[ \mathbf{S}^{-1}\boldsymbol{\epsilon}\right] \\ &= \mathbf{S}^{-1} \mathbb{E}\left[ \boldsymbol{\epsilon}\right] \\ &= \boldsymbol{0} \end{align} \]

**Q:**assuming that \( \mathrm{cov}(\boldsymbol{\epsilon}) = \sigma^2 \boldsymbol{\Sigma} \), compute the covariance of \( \boldsymbol{\epsilon}'\triangleq \mathbf{S}^{-1}\boldsymbol{\epsilon} \).**A:**noting that \( \boldsymbol{\epsilon}' \) is mean zero, then we can compute,\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) &= \mathbb{E}\left[ \left(\boldsymbol{\epsilon}'\right) \left(\boldsymbol{\epsilon}'\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)^\mathrm{T}\right]\\ &=\mathbf{S}^{-1}\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right] \mathbf{S}^{-\mathrm{T}}\\ &= \mathbf{S}^{-1} \sigma^2 \boldsymbol{\Sigma} \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2\mathbf{S}^{-1}\left(\mathbf{S} \mathbf{S}^\mathrm{T} \right) \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2 \mathbf{I} \end{align} \]

**Q:**How might we use this analysis to transform the variables into something easier to work with? In particular, what variables might satisfy the assumptions of ordinary least squares?**A:**Supposing that we can compute the variables,\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} \\ \mathbf{Y}' &\triangleq \mathbf{S}^{-1} \mathbf{Y} \end{align} \]

the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies our usual Gauss-Markov assumptions, for error \( \boldsymbol{\epsilon}' \) such that \( \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \).

Because the assumptions of the Gauss-Markov theorem hold for the transformed variables, we can re-derive the least-squares solution minimizing the associated residual sum of squares error.

Specifically, we can write,

\[ \begin{align} \hat{\boldsymbol{\epsilon}}' &= \mathbf{Y}' - \hat{\mathbf{Y}}'\\ & =\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}'\right) \end{align} \]

for some choice of \( \hat{\boldsymbol{\beta}}' \).

This choice of \( \hat{\boldsymbol{\beta}}' \) should be the one that minimizes the equation,

\[ \begin{align} J\left(\overline{\boldsymbol{\beta}}\right)&= \left(\hat{\boldsymbol{\epsilon}}'\right)^\mathrm{T}\left(\hat{\boldsymbol{\epsilon}}'\right) \\ & = \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\mathbf{S}^{-\mathrm{T}}\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)\\ &= \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\boldsymbol{\Sigma}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right) \end{align} \] for all choices of \( \overline{\boldsymbol{\beta}} \).

- Note that the least-squares objective function is now a weighted norm-squared in terms of the inverse structure matrix.

By substitution for our usual solution for the minimizer, with \( \mathbf{X}' \) and \( \mathbf{Y}' \), we find

\[ \begin{align} \hat{\boldsymbol{\beta}}' &\triangleq \left( \left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1}\left(\mathbf{X}'\right)^\mathrm{T} \mathbf{Y}'\\ &=\left(\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1}\mathbf{Y} \\ & =\left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1}\mathbf{Y}. \end{align} \]

**Q:**recalling that in ordinary least squares, \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right)= \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\sigma^2 \) is given how can we compute \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) \)**A:**knowing that the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies the hypotheses of ordinary least squares, the simple method is substitution. I.e.,\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) &= \left(\left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1} \sigma^2\\ &=\left(\mathbf{X} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1} \sigma^2\\ &= \left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X} \right)^{-1} \sigma^2 \end{align} \]

Because the error \( \boldsymbol{\epsilon}' = \mathbf{S}^{-1} \boldsymbol{\epsilon} \), \[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \end{align} \] the transformation allows us to perform the our usual diagnostics.

Specifically, the residuals are likewise transformed along with the other variables to give,

\[ \begin{align} \hat{\boldsymbol{\epsilon}}'&\triangleq \mathbf{Y}' - \hat{\mathbf{Y}}' \\ &=\mathbf{S}^{-1} \hat{\boldsymbol{\epsilon}} \end{align} \]

where the residuals \( \hat{\boldsymbol{\epsilon}}' \) will have the same properties as usual when \( \boldsymbol{\Sigma} \) is chosen correctly.

- Therefore, our normal diagnostics applied to \( \hat{\boldsymbol{\epsilon}}' \) will remain the same.

However, the issue is typically that \( \boldsymbol{\Sigma} \) is unkown in practice, except in special cases, and typically must be estimated.

- As noted before, the correlation structure will have much to do with the problem at hand, e.g.,
- spatial correlations in spatial data;
- auto-correlations in time series data;
- batch correlations in batch collected data.

As a special case of generalized least squares, we can consider the case when the errors are uncorrelated, but have unequal variances.

That is to say, \( \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \boldsymbol{\Sigma} \) where,

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] for possibly \( w_i \neq w_j \), but \( \boldsymbol{\Sigma} \) is strictly diagonal.

Here, we will refer to the reciprocals of the individual variances, \( w_i \), as the

**weights**.

**Q:**in the case

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] what is the square root of \( \boldsymbol{\Sigma} \)?**A:**in this case, we can compute the square root identically by the component-wise square root, as diagonal matrices are a special class of matrices that are already in eigen-decomposition,\[ \mathbf{S} = \begin{pmatrix} \frac{1}{\sqrt{w_1}}& 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{w_n}} \end{pmatrix} \]

As a special case of generalized least squares, we can compute the transformation into the variables \( \mathbf{X}', \mathbf{Y}', \boldsymbol{\epsilon}' \) as

\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} = \mathrm{diag}\left(\sqrt{w_i} \right)\mathbf{X}, \\ \mathbf{Y}' & \triangleq \mathbf{S}^{-1} \mathbf{Y} = \mathrm{diag}\left(\sqrt{w_i} \right) \mathbf{Y}, \\ \boldsymbol{\epsilon}' & \triangleq \mathbf{S}^{-1} \boldsymbol{\epsilon} = \mathrm{diag}\left(\sqrt{w_i} \right)\boldsymbol{\epsilon}; \end{align} \]

- By direct matrix computation, we see that the
**\( i \)-th row**of the matrix \( \mathbf{X} \) is scaled by the inverse of the standard deviation of the \( i \)-th observation.

- By direct matrix computation, we see that the
We note, the column corresponding to the intercept term in the matrix \( \mathbf{X} \) will be a vector of the form,

\[ \begin{align} \begin{pmatrix} \sqrt{w_1}, &\cdots, & \sqrt{w_n}\end{pmatrix}^\mathrm{T}, \end{align} \] instead of the usual vector of ones.

Qualitatively, we can note that

- the cases of standard deviation
\[ \begin{align} \frac{1}{\sqrt{w_i}} \gg 1 \end{align} \]

will receive low weight in the regression \( \sqrt{w_i} \ll 1 \), while

- the cases of low standard deviation
\[ \begin{align} \frac{1}{\sqrt{w_i}} \ll 1 \end{align} \]

will receive high weight in the regression \( \sqrt{w_i} \gg 1 \)

- the cases of standard deviation

- As some general considerations with weighted least squares, we have the following heuristic rules:
- When the errors are proportional to the predictor \( X_i \), i.e., \( \mathrm{var}(\epsilon_i) \propto X_i \) suggests the weights \( w_i = X_i^{-1} \). This might be suggested by, e.g., seeing a positive relationship in \( \vert \hat{\epsilon}_i \vert \) versus the predictor \( X_i \).
- This would thus result in the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{\sqrt{X_i}}\right) \).
- When the response \( Y_i = \frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij} \) is the average of \( n_i \) independent sub-observations, each with variance \( \sigma^2 \), then \( \mathrm{var}(Y_i) =\mathrm{var}\left(\frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij}\right) = \frac{\sigma^2}{n_i} \).
- This suggests using \( w_i = n_i \) for each case, with the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{\sqrt{n_i}}\right) \).
- Responses that are averages arise quite commonly, but take care that the variance in the response really is proportional to the group size.
- For example, consider the life expectancies for different countries.
- At first glance, one might consider setting the weights equal to the populations of the countries, but notice that there are many other sources of variation in life expectancy that would dwarf the population size effect.
- When the observed responses are already known to be of varying quality, weights may be assigned \( w_i = \frac{1}{s_i^2} \), the empirical variance of the response at predictor level \( X_i \).
- This would result in the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{s_i}\right) \)
- We will examine how the re-weighting of the least-squares fit affects the outcome in an example…

In examples where the variances are unknown, we may consider an ansatz for the form of the

**dependence**of the variance on the observation.Let's consider data on the stopping distance of cars with respect to the observed speed.

```
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmod <- lm(dist ~ speed, cars)
plot(residuals(lmod) ~ speed, cars, cex=3, cex.lab=3, cex.axis=1.5)
```

- Plotting the residuals versus the predictor values, we see what appears to be increasing variance with respect to the increasing speed.

As an ansatz, we may consider a dependence relationship as,

\[ \begin{align} \sigma_i^2 = \gamma_0 + X_i^{\gamma_1} \end{align} \]

These coefficients, representing a power-law increase in the variance with the speed of the vehicle, can be estimated simultaneously with the parameters for the regression.

Here, we will use the generalized least squares function

`gls`

in the`mgcv`

library, with a function in the argument for the weights.

```
library("mgcv")
wlmod <- gls(dist ~ speed, data=cars, weight = varConstPower(1, form= ~ speed))
```

- The above assigns weights in terms of the assumption that the variance of the observations is a function of a constant, plus a power rule in relation to the value of the explanatory variable, as in the ansatz.

```
summary(wlmod)
```

```
Generalized least squares fit by REML
Model: dist ~ speed
Data: cars
AIC BIC logLik
412.8352 422.1912 -201.4176
Variance function:
Structure: Constant plus power of variance covariate
Formula: ~speed
Parameter estimates:
const power
3.160444 1.022368
Coefficients:
Value Std.Error t-value p-value
(Intercept) -11.085378 4.052378 -2.735524 0.0087
speed 3.484162 0.320237 10.879947 0.0000
Correlation:
(Intr)
speed -0.9
Standardized residuals:
Min Q1 Med Q3 Max
-1.4520579 -0.6898209 -0.1308277 0.6375029 3.0757014
Residual standard error: 0.7636833
Degrees of freedom: 50 total; 48 residual
```

```
summary(lm(dist ~ speed, data=cars))
```

```
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
```

Compared to the weighted least squares formulation, with the power-law increase in variance, we have a significantly poorer fit.

- This is evidenced in terms of the drastic difference in the residual standard error for each model:
- weighted \( \approx 0.76 \)
- ordinary \( \approx 15.38 \)