11/02/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
When there is non-zero correlation of the error (or non-constant variance), in certain cases we can successfully reduce this back to our normal regression framework with generalized least squares.
Particularly, let's assume that the error takes a particular (but more general) form,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]
for a known structure matrix for the error \( \boldsymbol{\Sigma} \) but perhaps an unkown scale coefficient \( \sigma^2 \).
This describes a sittuation in which we have knowledge of the correlation and relative variance between the errors, but we don't know the absolute scale of the variation.
Let us recall some properties of postive-semidefinite, symmetric matrices:
All covariance matrices satisfy the above two properties by their construction. Therefore, it makes sense to say that \( \boldsymbol{\Sigma} \) has a well defined square root,
\[ \begin{align} \sqrt{\boldsymbol{\Sigma}} \triangleq \mathbf{S} \end{align} \]
A simple choice of a square root is the symmetric square root,
\[ \begin{align} \mathbf{S} &\triangleq \mathbf{E} \mathrm{diag}\left(\sqrt{\lambda_i}\right) \mathbf{E}^\mathrm{T} \\ &= \mathbf{E} \begin{pmatrix} \sqrt{\lambda_1} & 0 & 0 &\cdots & 0 \\ 0 & \sqrt{\lambda_2} & 0 &\cdots & 0 \\ 0 & 0 & \sqrt{\lambda_3} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \cdots & \cdots & \cdots & \sqrt{\lambda_n} \end{pmatrix} \mathbf{E}^\mathrm{T} \end{align} \]
where \( \lambda_i \) are the non-negative eigenvalues, and \( \mathbf{E} \) are the associated eigenvectors, of \( \boldsymbol{\Sigma} \).
By the orthogonality of the eigenvectors, it is easy to verify that \( \mathbf{S}^2 = \boldsymbol{\Sigma} \).
However, for matrices, we can define the square root in several different ways with different properties. A special alternative type of square root is given by a Cholesky factorization…
When \( \boldsymbol{\Sigma} \) is positive-definite (all strictly positive \( \lambda_i \)) there is a unique decompostition of \( \mathbf{\Sigma} \) as,
\[ \begin{align} \boldsymbol{\Sigma} \triangleq \mathbf{L} \mathbf{L}^\mathrm{T} \end{align} \]
such that \( \mathbf{L} \) is lower triangular, with positive entries on the diagonal.
That is, the matrix \( \mathbf{L} \) is of the form,
\[ \begin{align} \mathbf{L} = \begin{pmatrix} L_{11} & 0 & 0 &\cdots & 0 \\ L_{21} & L_{22} & 0 &\cdots & 0 \\ L_{31} & L_{31} & L_{33} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ L_{n1} & \cdots & \cdots & \cdots & L_{nn} \end{pmatrix} \end{align} \]
When we allow \( \lambda_i=0 \) for some \( i \), the decomposition will not be unique but will still exist.
This is a weak generalization of the idea of a “square-root” where now we ask that \( \mathbf{S} \mathbf{S}^\mathrm{T} = \boldsymbol{\Sigma} \).
Cholesky decompositions are extremely useful in many areas of mathematics, and especially for our case where the covariance matrix will be strictly positive-definite.
Returning to the regression analysis, we suppose that
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]
We will write \( \boldsymbol{\Sigma} = \mathbf{S} \mathbf{S}^\mathrm{T} \) in a Cholesky decomposition.
Then, considering our form for the signal
\[ \begin{align} \mathbf{Y}& = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}&\\ \Leftrightarrow \mathbf{S}^{-1} \mathbf{Y}& = \mathbf{S}^{-1}\mathbf{X} \boldsymbol{\beta} + \mathbf{S}^{-1}\boldsymbol{\epsilon}\\ \Leftrightarrow \mathbf{Y}' &= \mathbf{X}' \boldsymbol{\beta} + \boldsymbol{\epsilon}' \end{align} \]
for \( \mathbf{Y}',\mathbf{X}', \boldsymbol{\epsilon}' \) as defined by the relationship above.
Q: what is the mean of \( \boldsymbol{\epsilon}' \)?
A: this can be performed as usual, distributing the expectation through the non-random component,
\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\epsilon}' \right] &= \mathbb{E} \left[ \mathbf{S}^{-1}\boldsymbol{\epsilon}\right] \\ &= \mathbf{S}^{-1} \mathbb{E}\left[ \boldsymbol{\epsilon}\right] \\ &= \boldsymbol{0} \end{align} \]
Q: assuming that \( \mathrm{cov}(\boldsymbol{\epsilon}) = \sigma^2 \boldsymbol{\Sigma} \), compute the covariance of \( \boldsymbol{\epsilon}'\triangleq \mathbf{S}^{-1}\boldsymbol{\epsilon} \).
A: noting that \( \boldsymbol{\epsilon}' \) is mean zero, then we can compute,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) &= \mathbb{E}\left[ \left(\boldsymbol{\epsilon}'\right) \left(\boldsymbol{\epsilon}'\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)^\mathrm{T}\right]\\ &=\mathbf{S}^{-1}\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right] \mathbf{S}^{-\mathrm{T}}\\ &= \mathbf{S}^{-1} \sigma^2 \boldsymbol{\Sigma} \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2\mathbf{S}^{-1}\left(\mathbf{S} \mathbf{S}^\mathrm{T} \right) \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2 \mathbf{I} \end{align} \]
Q: How might we use this analysis to transform the variables into something easier to work with? In particular, what variables might satisfy the assumptions of ordinary least squares?
A: Supposing that we can compute the variables,
\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} \\ \mathbf{Y}' &\triangleq \mathbf{S}^{-1} \mathbf{Y} \end{align} \]
the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies our usual Gauss-Markov assumptions, for error \( \boldsymbol{\epsilon}' \) such that \( \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \).
Because the assumptions of the Gauss-Markov theorem hold for the transformed variables, we can re-derive the least-squares solution minimizing the associated residual sum of squares error.
Specifically, we can write,
\[ \begin{align} \hat{\boldsymbol{\epsilon}}' &= \mathbf{Y}' - \hat{\mathbf{Y}}'\\ & =\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}'\right) \end{align} \]
for some choice of \( \hat{\boldsymbol{\beta}}' \).
This choice of \( \hat{\boldsymbol{\beta}}' \) should be the one that minimizes the equation,
\[ \begin{align} J\left(\overline{\boldsymbol{\beta}}\right)&= \left(\hat{\boldsymbol{\epsilon}}'\right)^\mathrm{T}\left(\hat{\boldsymbol{\epsilon}}'\right) \\ & = \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\mathbf{S}^{-\mathrm{T}}\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)\\ &= \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\boldsymbol{\Sigma}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right) \end{align} \] for all choices of \( \overline{\boldsymbol{\beta}} \).
By substitution for our usual solution for the minimizer, with \( \mathbf{X}' \) and \( \mathbf{Y}' \), we find
\[ \begin{align} \hat{\boldsymbol{\beta}}' &\triangleq \left( \left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1}\left(\mathbf{X}'\right)^\mathrm{T} \mathbf{Y}'\\ &=\left(\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1}\mathbf{Y} \\ & =\left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1}\mathbf{Y}. \end{align} \]
Q: recalling that in ordinary least squares, \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right)= \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\sigma^2 \) is given how can we compute \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) \)
A: knowing that the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies the hypotheses of ordinary least squares, the simple method is substitution. I.e.,
\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) &= \left(\left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1} \sigma^2\\ &=\left(\mathbf{X} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1} \sigma^2\\ &= \left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X} \right)^{-1} \sigma^2 \end{align} \]
Because the error \( \boldsymbol{\epsilon}' = \mathbf{S}^{-1} \boldsymbol{\epsilon} \), \[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \end{align} \] the transformation allows us to perform the our usual diagnostics.
Specifically, the residuals are likewise transformed along with the other variables to give,
\[ \begin{align} \hat{\boldsymbol{\epsilon}}'&\triangleq \mathbf{Y}' - \hat{\mathbf{Y}}' \\ &=\mathbf{S}^{-1} \hat{\boldsymbol{\epsilon}} \end{align} \]
where the residuals \( \hat{\boldsymbol{\epsilon}}' \) will have the same properties as usual when \( \boldsymbol{\Sigma} \) is chosen correctly.
However, the issue is typically that \( \boldsymbol{\Sigma} \) is unkown in practice, except in special cases, and typically must be estimated.
As a special case of generalized least squares, we can consider the case when the errors are uncorrelated, but have unequal variances.
That is to say, \( \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \boldsymbol{\Sigma} \) where,
\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] for possibly \( w_i \neq w_j \), but \( \boldsymbol{\Sigma} \) is strictly diagonal.
Here, we will refer to the reciprocals of the individual variances, \( w_i \), as the weights.
Q: in the case
\[ \boldsymbol{\Sigma} =
\begin{pmatrix}
\frac{1}{w_1}& 0 & \cdots & 0 \\
0 & \frac{1}{w_2} & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{1}{w_n}
\end{pmatrix} \]
what is the square root of \( \boldsymbol{\Sigma} \)?
A: in this case, we can compute the square root identically by the component-wise square root, as diagonal matrices are a special class of matrices that are already in eigen-decomposition,
\[ \mathbf{S} = \begin{pmatrix} \frac{1}{\sqrt{w_1}}& 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{w_n}} \end{pmatrix} \]
As a special case of generalized least squares, we can compute the transformation into the variables \( \mathbf{X}', \mathbf{Y}', \boldsymbol{\epsilon}' \) as
\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} = \mathrm{diag}\left(\sqrt{w_i} \right)\mathbf{X}, \\ \mathbf{Y}' & \triangleq \mathbf{S}^{-1} \mathbf{Y} = \mathrm{diag}\left(\sqrt{w_i} \right) \mathbf{Y}, \\ \boldsymbol{\epsilon}' & \triangleq \mathbf{S}^{-1} \boldsymbol{\epsilon} = \mathrm{diag}\left(\sqrt{w_i} \right)\boldsymbol{\epsilon}; \end{align} \]
We note, the column corresponding to the intercept term in the matrix \( \mathbf{X} \) will be a vector of the form,
\[ \begin{align} \begin{pmatrix} \sqrt{w_1}, &\cdots, & \sqrt{w_n}\end{pmatrix}^\mathrm{T}, \end{align} \] instead of the usual vector of ones.
Qualitatively, we can note that
\[ \begin{align} \frac{1}{\sqrt{w_i}} \gg 1 \end{align} \]
will receive low weight in the regression \( \sqrt{w_i} \ll 1 \), while
\[ \begin{align} \frac{1}{\sqrt{w_i}} \ll 1 \end{align} \]
will receive high weight in the regression \( \sqrt{w_i} \gg 1 \)
In examples where the variances are unknown, we may consider an ansatz for the form of the dependence of the variance on the observation.
Let's consider data on the stopping distance of cars with respect to the observed speed.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmod <- lm(dist ~ speed, cars)
plot(residuals(lmod) ~ speed, cars, cex=3, cex.lab=3, cex.axis=1.5)
As an ansatz, we may consider a dependence relationship as,
\[ \begin{align} \sigma_i^2 = \gamma_0 + X_i^{\gamma_1} \end{align} \]
These coefficients, representing a power-law increase in the variance with the speed of the vehicle, can be estimated simultaneously with the parameters for the regression.
Here, we will use the generalized least squares function gls
in the mgcv
library, with a function in the argument for the weights.
library("mgcv")
wlmod <- gls(dist ~ speed, data=cars, weight = varConstPower(1, form= ~ speed))
summary(wlmod)
Generalized least squares fit by REML
Model: dist ~ speed
Data: cars
AIC BIC logLik
412.8352 422.1912 -201.4176
Variance function:
Structure: Constant plus power of variance covariate
Formula: ~speed
Parameter estimates:
const power
3.160444 1.022368
Coefficients:
Value Std.Error t-value p-value
(Intercept) -11.085378 4.052378 -2.735524 0.0087
speed 3.484162 0.320237 10.879947 0.0000
Correlation:
(Intr)
speed -0.9
Standardized residuals:
Min Q1 Med Q3 Max
-1.4520579 -0.6898209 -0.1308277 0.6375029 3.0757014
Residual standard error: 0.7636833
Degrees of freedom: 50 total; 48 residual
summary(lm(dist ~ speed, data=cars))
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Compared to the weighted least squares formulation, with the power-law increase in variance, we have a significantly poorer fit.
We will try to infer, based on published vote totals, how votes were transferred between candidates.
In the the dataset fpe
we find the published vote totals in every fourth department for France.
library("faraway")
head(fpe)
EI A B C D E F G H J K A2 B2 N
Ain 260 51 64 36 23 9 5 4 4 3 3 105 114 17
Alpes 75 14 17 9 9 3 1 2 1 1 1 32 31 5
Ariege 107 27 18 13 17 2 2 2 1 1 1 57 33 6
Bouches.du.Rhone 1036 191 204 119 205 29 13 13 10 10 6 466 364 30
Charente.Maritime 367 71 76 47 37 8 34 5 4 4 2 163 142 17
Cotes.du.Nord 396 93 90 57 54 13 5 9 4 3 5 193 155 15
The variables A
and B
correspond to the vote totals for Mitterand and Giscard in the first round, respectively.
A2
and B2
correspond to their second round vote totals.The variable EI
is the number of registered votes, and all numbers are measured in thousands.
The difference between the number of voters in the second round and the first is given by the variable N
.
The varible that we will model for the response is A2
, the number of second round votes given to Mitterand.
Suppose we can write the signal an exact equation for the national vote totals
\[ \begin{align} A_2 = \beta_A A+ \beta_B B+\beta_C C+\beta_D D+\beta_E E+\beta_F F+\beta_G G+\beta_H H+\beta_J J+\beta_K K+\beta_N N \end{align} \] where each \( \beta_i \) represents the true parameters, giving the proportion of votes transferred to Mitterand in the second round.
This is different from the regression equation, defined by
\[ \begin{align} Y_i = \mathbf{x}_i^\mathrm{T} \boldsymbol{\beta} + \epsilon_i, \end{align} \]
where for \( \mathbf{x}_i \) (the vote totals of a particular department) there will be miss-match (\( \epsilon_i \)) between the proportion of votes transferred from the national totals (\( \beta_i \)) and the local totals.
The error term will likely have different variance with respect to different departments.
If the weights for each department are to be inversely proportional to their variance, we may set the weights equal to 1/EI
.
Note, it is natural in this equation to neglect the intercept, as there should be no vote transfer when there are no voters.
lmod <- lm(A2 ~ A+B+C+D+E+F+G+H+J+K+N-1, fpe, weights=1/EI)
coef(lmod)
A B C D E F G
1.0671302 -0.1050507 0.2459577 0.9261878 0.2493970 0.7551100 1.9722124
H J K N
-0.5662165 0.6116417 1.2106584 0.5293527
lm(A2 ~ A+B+C+D+E+F+G+H+J+K+N-1, fpe)$coef
A B C D E F G
1.0751478 -0.1245589 0.2574465 0.9045371 0.6706768 0.7825333 2.1656554
H J K N
-0.8542876 0.1444185 0.5181332 0.5582718
However, there is an issue about the physicallity of the model, where proportions should be between zero and one.
To fix this, we can logically remove the voters for Giscard and likewise automatically set the vote transfer for Mitterand to one (assuming that each block votes once again for the same candidate in the second round).
As an ad hoc fix for the other parameters greater than one, we set these to one exactly (along with the parameter for Mitterand) with the “offset” function;
Likewise, for the other coefficients that are negative, we set these to zero (removing them from the regression equation):
lmod <- lm(A2 ~ offset(A+G+K)+C+D+E+F+N-1, fpe, weights=1/EI)
coef(lmod)
C D E F N
0.2257726 0.9699765 0.3902044 0.7442401 0.6085392
The above coefficients suggest (roughly true to historical understanding) that:
We note, the analysis we had on the French 1981 presidential performed more rigorously with a form of constrained least squares;
mgcv
, we can explicitly set the range for the parameters \( 0 \leq \hat{\boldsymbol{\beta}}_i \leq 1 \).The main idea of generalized least squares is that the Gauss-Markov theorem holds theoretically for any system for which we have,
as long as we can make a linear transformation of the variables by an inverse-square-root of the structure matrix.
Least squares is thus still an optimal solution in the sense of the BLUE, with respect to the transformed variables.
Generally, a structure matrix as above is difficult to estimate in practice;
Next time, we will discuss a more systematic approach to nonlinear transformations.