Generalized least squares

11/02/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Correlated errors
- Matrix square roots
- Generalized least squares
- Weighted least squares
- Simple example of weighted least squares
- Detailed example of weighted least squares

Correlated errors

When there is non-zero correlation of the error (or non-constant variance), in certain cases we can successfully reduce this back to our normal regression framework with generalized least squares.
Particularly, let's assume that the error takes a particular (but more general) form,

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]

for a known structure matrix for the error \( \boldsymbol{\Sigma} \) but perhaps an unkown scale coefficient \( \sigma^2 \).
This describes a sittuation in which we have knowledge of the correlation and relative variance between the errors, but we don't know the absolute scale of the variation.
Let us recall some properties of postive-semidefinite, symmetric matrices:
1. Positive-semidefinite matrices (by definition) have all non-negative eigenvalues.
2. Symmetric matrices (by the spectral theorem) must have a decomposition into an orthogonal eigenbasis, one for which each basis vector is a principle axis corresponding to one of the eigenvalue scales of growth or decay.
All covariance matrices satisfy the above two properties by their construction. Therefore, it makes sense to say that \( \boldsymbol{\Sigma} \) has a well defined square root,

\[ \begin{align} \sqrt{\boldsymbol{\Sigma}} \triangleq \mathbf{S} \end{align} \]

Cholesky decomposition

A simple choice of a square root is the symmetric square root,

\[ \begin{align} \mathbf{S} &\triangleq \mathbf{E} \mathrm{diag}\left(\sqrt{\lambda_i}\right) \mathbf{E}^\mathrm{T} \\ &= \mathbf{E} \begin{pmatrix} \sqrt{\lambda_1} & 0 & 0 &\cdots & 0 \\ 0 & \sqrt{\lambda_2} & 0 &\cdots & 0 \\ 0 & 0 & \sqrt{\lambda_3} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \cdots & \cdots & \cdots & \sqrt{\lambda_n} \end{pmatrix} \mathbf{E}^\mathrm{T} \end{align} \]

where \( \lambda_i \) are the non-negative eigenvalues, and \( \mathbf{E} \) are the associated eigenvectors, of \( \boldsymbol{\Sigma} \).
By the orthogonality of the eigenvectors, it is easy to verify that \( \mathbf{S}^2 = \boldsymbol{\Sigma} \).
However, for matrices, we can define the square root in several different ways with different properties. A special alternative type of square root is given by a Cholesky factorization…

Cholesky decomposition continued…

When \( \boldsymbol{\Sigma} \) is positive-definite (all strictly positive \( \lambda_i \)) there is a unique decompostition of \( \mathbf{\Sigma} \) as,

\[ \begin{align} \boldsymbol{\Sigma} \triangleq \mathbf{L} \mathbf{L}^\mathrm{T} \end{align} \]

such that \( \mathbf{L} \) is lower triangular, with positive entries on the diagonal.
That is, the matrix \( \mathbf{L} \) is of the form,

\[ \begin{align} \mathbf{L} = \begin{pmatrix} L_{11} & 0 & 0 &\cdots & 0 \\ L_{21} & L_{22} & 0 &\cdots & 0 \\ L_{31} & L_{31} & L_{33} & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ L_{n1} & \cdots & \cdots & \cdots & L_{nn} \end{pmatrix} \end{align} \]
When we allow \( \lambda_i=0 \) for some \( i \), the decomposition will not be unique but will still exist.
This is a weak generalization of the idea of a “square-root” where now we ask that \( \mathbf{S} \mathbf{S}^\mathrm{T} = \boldsymbol{\Sigma} \).
Cholesky decompositions are extremely useful in many areas of mathematics, and especially for our case where the covariance matrix will be strictly positive-definite.

Correlated errors continued

Returning to the regression analysis, we suppose that

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \end{align} \]
We will write \( \boldsymbol{\Sigma} = \mathbf{S} \mathbf{S}^\mathrm{T} \) in a Cholesky decomposition.
Then, considering our form for the signal

\[ \begin{align} \mathbf{Y}& = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}&\\ \Leftrightarrow \mathbf{S}^{-1} \mathbf{Y}& = \mathbf{S}^{-1}\mathbf{X} \boldsymbol{\beta} + \mathbf{S}^{-1}\boldsymbol{\epsilon}\\ \Leftrightarrow \mathbf{Y}' &= \mathbf{X}' \boldsymbol{\beta} + \boldsymbol{\epsilon}' \end{align} \]

for \( \mathbf{Y}',\mathbf{X}', \boldsymbol{\epsilon}' \) as defined by the relationship above.
Q: what is the mean of \( \boldsymbol{\epsilon}' \)?
A: this can be performed as usual, distributing the expectation through the non-random component,

\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\epsilon}' \right] &= \mathbb{E} \left[ \mathbf{S}^{-1}\boldsymbol{\epsilon}\right] \\ &= \mathbf{S}^{-1} \mathbb{E}\left[ \boldsymbol{\epsilon}\right] \\ &= \boldsymbol{0} \end{align} \]

Correlated errors continued

Q: assuming that \( \mathrm{cov}(\boldsymbol{\epsilon}) = \sigma^2 \boldsymbol{\Sigma} \), compute the covariance of \( \boldsymbol{\epsilon}'\triangleq \mathbf{S}^{-1}\boldsymbol{\epsilon} \).
A: noting that \( \boldsymbol{\epsilon}' \) is mean zero, then we can compute,

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) &= \mathbb{E}\left[ \left(\boldsymbol{\epsilon}'\right) \left(\boldsymbol{\epsilon}'\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)\left(\mathbf{S}^{-1}\boldsymbol{\epsilon}\right)^\mathrm{T}\right]\\ &=\mathbf{S}^{-1}\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right] \mathbf{S}^{-\mathrm{T}}\\ &= \mathbf{S}^{-1} \sigma^2 \boldsymbol{\Sigma} \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2\mathbf{S}^{-1}\left(\mathbf{S} \mathbf{S}^\mathrm{T} \right) \mathbf{S}^{-\mathrm{T}} \\ &= \sigma^2 \mathbf{I} \end{align} \]
Q: How might we use this analysis to transform the variables into something easier to work with? In particular, what variables might satisfy the assumptions of ordinary least squares?
A: Supposing that we can compute the variables,

\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} \\ \mathbf{Y}' &\triangleq \mathbf{S}^{-1} \mathbf{Y} \end{align} \]

the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies our usual Gauss-Markov assumptions, for error \( \boldsymbol{\epsilon}' \) such that \( \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \).

Generalized least squares

Because the assumptions of the Gauss-Markov theorem hold for the transformed variables, we can re-derive the least-squares solution minimizing the associated residual sum of squares error.
Specifically, we can write,

\[ \begin{align} \hat{\boldsymbol{\epsilon}}' &= \mathbf{Y}' - \hat{\mathbf{Y}}'\\ & =\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}'\right) \end{align} \]

for some choice of \( \hat{\boldsymbol{\beta}}' \).
This choice of \( \hat{\boldsymbol{\beta}}' \) should be the one that minimizes the equation,

\[ \begin{align} J\left(\overline{\boldsymbol{\beta}}\right)&= \left(\hat{\boldsymbol{\epsilon}}'\right)^\mathrm{T}\left(\hat{\boldsymbol{\epsilon}}'\right) \\ & = \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\mathbf{S}^{-\mathrm{T}}\mathbf{S}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)\\ &= \left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right)^\mathrm{T}\boldsymbol{\Sigma}^{-1}\left(\mathbf{Y} - \mathbf{X} \overline{\boldsymbol{\beta}}\right) \end{align} \] for all choices of \( \overline{\boldsymbol{\beta}} \).
- Note that the least-squares objective function is now a weighted norm-squared in terms of the inverse structure matrix.

Generalized least squares

By substitution for our usual solution for the minimizer, with \( \mathbf{X}' \) and \( \mathbf{Y}' \), we find

\[ \begin{align} \hat{\boldsymbol{\beta}}' &\triangleq \left( \left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1}\left(\mathbf{X}'\right)^\mathrm{T} \mathbf{Y}'\\ &=\left(\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1}\mathbf{Y} \\ & =\left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1}\mathbf{Y}. \end{align} \]
Q: recalling that in ordinary least squares, \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right)= \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\sigma^2 \) is given how can we compute \( \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) \)
A: knowing that the regression of \( \mathbf{Y}' \) in terms of \( \mathbf{X}' \) satisfies the hypotheses of ordinary least squares, the simple method is substitution. I.e.,

\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}'\right) &= \left(\left(\mathbf{X}'\right)^\mathrm{T} \left(\mathbf{X}'\right)\right)^{-1} \sigma^2\\ &=\left(\mathbf{X} \mathbf{S}^{-\mathrm{T}} \mathbf{S}^{-1} \mathbf{X}\right)^{-1} \sigma^2\\ &= \left(\mathbf{X}^\mathrm{T} \boldsymbol{\Sigma}^{-1} \mathbf{X} \right)^{-1} \sigma^2 \end{align} \]

Generalized least squares

Because the error \( \boldsymbol{\epsilon}' = \mathbf{S}^{-1} \boldsymbol{\epsilon} \), \[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}'\right) = \sigma^2 \mathbf{I} \end{align} \] the transformation allows us to perform the our usual diagnostics.
Specifically, the residuals are likewise transformed along with the other variables to give,

\[ \begin{align} \hat{\boldsymbol{\epsilon}}'&\triangleq \mathbf{Y}' - \hat{\mathbf{Y}}' \\ &=\mathbf{S}^{-1} \hat{\boldsymbol{\epsilon}} \end{align} \]

where the residuals \( \hat{\boldsymbol{\epsilon}}' \) will have the same properties as usual when \( \boldsymbol{\Sigma} \) is chosen correctly.
- Therefore, our normal diagnostics applied to \( \hat{\boldsymbol{\epsilon}}' \) will remain the same.
However, the issue is typically that \( \boldsymbol{\Sigma} \) is unkown in practice, except in special cases, and typically must be estimated.

As noted before, the correlation structure will have much to do with the problem at hand, e.g.,

spatial correlations in spatial data;
auto-correlations in time series data;
batch correlations in batch collected data.

Weighted least squares

As a special case of generalized least squares, we can consider the case when the errors are uncorrelated, but have unequal variances.
That is to say, \( \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \boldsymbol{\Sigma} \) where,

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] for possibly \( w_i \neq w_j \), but \( \boldsymbol{\Sigma} \) is strictly diagonal.
Here, we will refer to the reciprocals of the individual variances, \( w_i \), as the weights.

Weighted least squares

Q: in the case
\[ \boldsymbol{\Sigma} = \begin{pmatrix} \frac{1}{w_1}& 0 & \cdots & 0 \\ 0 & \frac{1}{w_2} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{w_n} \end{pmatrix} \] what is the square root of \( \boldsymbol{\Sigma} \)?
A: in this case, we can compute the square root identically by the component-wise square root, as diagonal matrices are a special class of matrices that are already in eigen-decomposition,

\[ \mathbf{S} = \begin{pmatrix} \frac{1}{\sqrt{w_1}}& 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{w_n}} \end{pmatrix} \]

Weighted least squares

As a special case of generalized least squares, we can compute the transformation into the variables \( \mathbf{X}', \mathbf{Y}', \boldsymbol{\epsilon}' \) as

\[ \begin{align} \mathbf{X}' &\triangleq \mathbf{S}^{-1} \mathbf{X} = \mathrm{diag}\left(\sqrt{w_i} \right)\mathbf{X}, \\ \mathbf{Y}' & \triangleq \mathbf{S}^{-1} \mathbf{Y} = \mathrm{diag}\left(\sqrt{w_i} \right) \mathbf{Y}, \\ \boldsymbol{\epsilon}' & \triangleq \mathbf{S}^{-1} \boldsymbol{\epsilon} = \mathrm{diag}\left(\sqrt{w_i} \right)\boldsymbol{\epsilon}; \end{align} \]
- By direct matrix computation, we see that the \( i \)-th row of the matrix \( \mathbf{X} \) is scaled by the inverse of the standard deviation of the \( i \)-th observation.
We note, the column corresponding to the intercept term in the matrix \( \mathbf{X} \) will be a vector of the form,

\[ \begin{align} \begin{pmatrix} \sqrt{w_1}, &\cdots, & \sqrt{w_n}\end{pmatrix}^\mathrm{T}, \end{align} \] instead of the usual vector of ones.

Weighted least squares

Qualitatively, we can note that
1. the cases of standard deviation
  
  \[ \begin{align} \frac{1}{\sqrt{w_i}} \gg 1 \end{align} \]
  
  will receive low weight in the regression \( \sqrt{w_i} \ll 1 \), while
2. the cases of low standard deviation
  
  \[ \begin{align} \frac{1}{\sqrt{w_i}} \ll 1 \end{align} \]
  
  will receive high weight in the regression \( \sqrt{w_i} \gg 1 \)

Weighted least squares – heuristics

As some general considerations with weighted least squares, we have the following heuristic rules:

When the errors are proportional to the predictor \( X_i \), i.e., \( \mathrm{var}(\epsilon_i) \propto X_i \) suggests the weights \( w_i = X_i^{-1} \). This might be suggested by, e.g., seeing a positive relationship in \( \vert \hat{\epsilon}_i \vert \) versus the predictor \( X_i \).

This would thus result in the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{\sqrt{X_i}}\right) \).

When the response \( Y_i = \frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij} \) is the average of \( n_i \) independent sub-observations, each with variance \( \sigma^2 \), then \( \mathrm{var}(Y_i) =\mathrm{var}\left(\frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij}\right) = \frac{\sigma^2}{n_i} \).

This suggests using \( w_i = n_i \) for each case, with the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{\sqrt{n_i}}\right) \).
Responses that are averages arise quite commonly, but take care that the variance in the response really is proportional to the group size.
For example, consider the life expectancies for different countries.

At first glance, one might consider setting the weights equal to the populations of the countries, but notice that there are many other sources of variation in life expectancy that would dwarf the population size effect.

When the observed responses are already known to be of varying quality, weights may be assigned \( w_i = \frac{1}{s_i^2} \), the empirical variance of the response at predictor level \( X_i \).

This would result in the transformation by \( \mathbf{S}^{-1} = \mathrm{diag}\left(\frac{1}{s_i}\right) \)

We will examine how the re-weighting of the least-squares fit affects the outcome in an example…

A simple example of weighted least squares

In examples where the variances are unknown, we may consider an ansatz for the form of the dependence of the variance on the observation.
Let's consider data on the stopping distance of cars with respect to the observed speed.

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmod <- lm(dist ~ speed, cars)
plot(residuals(lmod) ~ speed, cars,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-1

Plotting the residuals versus the predictor values, we see what appears to be increasing variance with respect to the increasing speed.

A simple example of weighted least squares

As an ansatz, we may consider a dependence relationship as,

\[ \begin{align} \sigma_i^2 = \gamma_0 + X_i^{\gamma_1} \end{align} \]
These coefficients, representing a power-law increase in the variance with the speed of the vehicle, can be estimated simultaneously with the parameters for the regression.
Here, we will use the generalized least squares function gls in the mgcv library, with a function in the argument for the weights.

library("mgcv")
wlmod <- gls(dist ~ speed, data=cars, weight = varConstPower(1, form= ~ speed))

The above assigns weights in terms of the assumption that the variance of the observations is a function of a constant, plus a power rule in relation to the value of the explanatory variable, as in the ansatz.

summary(wlmod)

Generalized least squares fit by REML
  Model: dist ~ speed 
  Data: cars 
       AIC      BIC    logLik
  412.8352 422.1912 -201.4176

Variance function:
 Structure: Constant plus power of variance covariate
 Formula: ~speed 
 Parameter estimates:
   const    power 
3.160444 1.022368 

Coefficients:
                 Value Std.Error   t-value p-value
(Intercept) -11.085378  4.052378 -2.735524  0.0087
speed         3.484162  0.320237 10.879947  0.0000

 Correlation: 
      (Intr)
speed -0.9  

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.4520579 -0.6898209 -0.1308277  0.6375029  3.0757014 

Residual standard error: 0.7636833 
Degrees of freedom: 50 total; 48 residual

summary(lm(dist ~ speed, data=cars))


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Compared to the weighted least squares formulation, with the power-law increase in variance, we have a significantly poorer fit.
- This is evidenced in terms of the drastic difference in the residual standard error for each model:

A detailed example of weighted least squares

Claude Monet, painting of Rue Montorgueil, Paris, Festival of June 30, 1878

Claude Monet, Rue Montorgueil,
Paris, Festival of June 30, 1878.
Public Domain.

France follows run-off elections in which the leading two candidates in the first election will be voted on once again to decide a final outcome.
The intention of this election style is to allow voters to pick a “second-choice” candidate, if their first choice wasn’t successful in the first round.
In 1981 there were 10 candidates in the first round which was narrowed down to François Mitterand and Valéry Giscard-d’Estaing, who held the two highest vote totals in the first round.

Courtesy of Philippe Roos CC 2.0

Courtesy of White House Staff Photographers. Public domain.
This kind of voting is the subject to poltical coalition building in which candidates who are elminated in the first round can encourage their supporters to vote for one of the final round candidates.
However, the voting is done privately and anonymously, and voters don’t have to follow the advice of their first round pick.

A detailed example of weighted least squares

We will try to infer, based on published vote totals, how votes were transferred between candidates.
In the the dataset fpe we find the published vote totals in every fourth department for France.

library("faraway")
head(fpe)

                    EI   A   B   C   D  E  F  G  H  J K  A2  B2  N
Ain                260  51  64  36  23  9  5  4  4  3 3 105 114 17
Alpes               75  14  17   9   9  3  1  2  1  1 1  32  31  5
Ariege             107  27  18  13  17  2  2  2  1  1 1  57  33  6
Bouches.du.Rhone  1036 191 204 119 205 29 13 13 10 10 6 466 364 30
Charente.Maritime  367  71  76  47  37  8 34  5  4  4 2 163 142 17
Cotes.du.Nord      396  93  90  57  54 13  5  9  4  3 5 193 155 15

The variables A and B correspond to the vote totals for Mitterand and Giscard in the first round, respectively.
- The variables A2 and B2 correspond to their second round vote totals.
The variable EI is the number of registered votes, and all numbers are measured in thousands.
The difference between the number of voters in the second round and the first is given by the variable N.
- In this case, we will treat this as any other explanatory variable, though it could be treated differently in principle.

A detailed example of weighted least squares

The varible that we will model for the response is A2, the number of second round votes given to Mitterand.
Suppose we can write the signal an exact equation for the national vote totals

\[ \begin{align} A_2 = \beta_A A+ \beta_B B+\beta_C C+\beta_D D+\beta_E E+\beta_F F+\beta_G G+\beta_H H+\beta_J J+\beta_K K+\beta_N N \end{align} \] where each \( \beta_i \) represents the true parameters, giving the proportion of votes transferred to Mitterand in the second round.
This is different from the regression equation, defined by

\[ \begin{align} Y_i = \mathbf{x}_i^\mathrm{T} \boldsymbol{\beta} + \epsilon_i, \end{align} \]

where for \( \mathbf{x}_i \) (the vote totals of a particular department) there will be miss-match (\( \epsilon_i \)) between the proportion of votes transferred from the national totals (\( \beta_i \)) and the local totals.
The error term will likely have different variance with respect to different departments.
- Assuming (naively) that the total votes for Mitterand in the second round for a particular district is the sum of uncorrelated random variables (the first round candidate totals for a particular department), then
- we might treat the variance of the second round outcome for Mitterand in a district as proportional to the number of voters in the district;
- this is because the variance of the left side of the equation (\( Y_i \)) will be equal to a weighted sum of the variances of the right side (\( \mathbf{x}_i \)).

A detailed example of weighted least squares

If the weights for each department are to be inversely proportional to their variance, we may set the weights equal to 1/EI.
Note, it is natural in this equation to neglect the intercept, as there should be no vote transfer when there are no voters.

lmod <- lm(A2 ~ A+B+C+D+E+F+G+H+J+K+N-1, fpe, weights=1/EI)
coef(lmod)

         A          B          C          D          E          F          G 
 1.0671302 -0.1050507  0.2459577  0.9261878  0.2493970  0.7551100  1.9722124 
         H          J          K          N 
-0.5662165  0.6116417  1.2106584  0.5293527

Fitting the model without weights, we find a substantial difference,

lm(A2 ~ A+B+C+D+E+F+G+H+J+K+N-1, fpe)$coef

         A          B          C          D          E          F          G 
 1.0751478 -0.1245589  0.2574465  0.9045371  0.6706768  0.7825333  2.1656554 
         H          J          K          N 
-0.8542876  0.1444185  0.5181332  0.5582718

However, there is an issue about the physicallity of the model, where proportions should be between zero and one.
To fix this, we can logically remove the voters for Giscard and likewise automatically set the vote transfer for Mitterand to one (assuming that each block votes once again for the same candidate in the second round).

A detailed example of weighted least squares

As an ad hoc fix for the other parameters greater than one, we set these to one exactly (along with the parameter for Mitterand) with the “offset” function;
Likewise, for the other coefficients that are negative, we set these to zero (removing them from the regression equation):

lmod <- lm(A2 ~ offset(A+G+K)+C+D+E+F+N-1, fpe, weights=1/EI)
coef(lmod)

        C         D         E         F         N 
0.2257726 0.9699765 0.3902044 0.7442401 0.6085392

The above coefficients suggest (roughly true to historical understanding) that:
1. Nearly all communist party voters (D) supported Mitterand in the second round;
2. Surprisingly, a significant number \( \approx 20\% \) of Gaullist voters ( C ) voted for Mitterand, shifting the overall balance;
3. Ecology party voters (E) split their votes almost evenly in the second round;
4. Non-voters in the first round split their vote roughly evenly.
We note, the analysis we had on the French 1981 presidential performed more rigorously with a form of constrained least squares;
- That is, using the package mgcv, we can explicitly set the range for the parameters \( 0 \leq \hat{\boldsymbol{\beta}}_i \leq 1 \).
- This is discussed explicitly by Faraway.

A summary of main ideas

The main idea of generalized least squares is that the Gauss-Markov theorem holds theoretically for any system for which we have,
1. a linear signal \( \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \) that is well-defined;
2. the variation in the signal \( \mathbb{E}\left[\boldsymbol{\epsilon}\right]=0 \) and \( \mathrm{cov}\left(\boldsymbol{\epsilon}\right) = \sigma^2 \boldsymbol{\Sigma} \);
as long as we can make a linear transformation of the variables by an inverse-square-root of the structure matrix.
Least squares is thus still an optimal solution in the sense of the BLUE, with respect to the transformed variables.
Generally, a structure matrix as above is difficult to estimate in practice;
- however, there are many instances where we can make a strong improvement of our model by weighting cases as in weighted least squares.
Next time, we will discuss a more systematic approach to nonlinear transformations.