10/21/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We will now begin our discussion on diagnosing issues with our error covariance assumptions.
If we wish to check the assumptions on the error or variation in the signal \( \boldsymbol{\epsilon} \), we need to consider, \( \boldsymbol{\epsilon} \) itself is not observable.
Q: What proxy could we consider for the error?
Recall, the definition of \( \hat{\mathbf{Y}} \)
\[ \begin{align} \hat{\mathbf{Y}} \triangleq& \mathbf{X}\left(\mathbf{X}^\mathrm{T} \mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y} \\ & =\mathbf{H} \mathbf{Y} \end{align} \]
Therefore, if we compute the residuals,
\[ \begin{align} \hat{\boldsymbol{\epsilon}} & = \mathbf{Y} - \hat{\mathbf{Y}} \\ & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{Y} \\ & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \boldsymbol{\beta} + \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]
From the last slide
\[ \begin{align} \hat{\boldsymbol{\epsilon}} & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \boldsymbol{\beta} + \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]
Q: recalling the definition of \( \mathbf{H} = \mathbf{X}\left(\mathbf{X}^\mathrm{T} \mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \), what does \( \mathbf{H}\mathbf{X} \) equal to?
A: \( \mathbf{H} \) is the projection operator onto the span of the design matrix, and thus \( \mathbf{H}\mathbf{X} = \mathbf{X} \) by construction.
Q: given the above, what does \( \left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \) equal to?
A: the above must equal zero, as \( \mathbf{I}\mathbf{X} = \mathbf{H}\mathbf{X} = \mathbf{X} \).
From the previous two exercises, we can deduce,
\[ \begin{align} \hat{\boldsymbol{\epsilon}} = \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]
We take the assumption that \( \boldsymbol{\epsilon}\sim N(0, \mathbf{I} \sigma^2) \),
\[ \begin{align} \mathbb{E}[\hat{\boldsymbol{\epsilon}}] &= \mathbb{E}\left[\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right] \\ &=\left(\mathbf{I} - \mathbf{H}\right)\mathbb{E}\left[\boldsymbol{\epsilon}\right]\\ &= 0 \end{align} \]
\[ \begin{align} \mathbb{E}\left[\left(\hat{\boldsymbol{\epsilon}}\right) \left(\hat{\boldsymbol{\epsilon}}\right)^\mathrm{T}\right] &= \mathbb{E}\left[\left(\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right)\left(\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T}\right]\\ &=\left(\mathbf{I} - \mathbf{H}\right)\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right]\left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T}\\ & =\left(\mathbf{I} - \mathbf{H}\right)\left[\sigma^2 \mathbf{I}\right]\left(\mathbf{I} - \mathbf{H}\right)\\ &=\sigma^2\left(\mathbf{I} - \mathbf{H}\right) \end{align} \]
Q: Why is the last slide relevant? Particularly, why should we be concerned with the covariance of the residuals?
Note, even though the errors are assumed to be independent and of equal variance \( \sigma^2 \), the same doesn't hold for the residuals.
Nonetheless, we use the residuals to underrstand how the true errors are behaving, which are unobservable.
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
library("faraway")
lmod_savings <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
sumary(lmod_savings)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.56608654 7.35451611 3.8842 0.0003338
pop15 -0.46119315 0.14464222 -3.1885 0.0026030
pop75 -1.69149768 1.08359893 -1.5610 0.1255298
dpi -0.00033690 0.00093111 -0.3618 0.7191732
ddpi 0.40969493 0.19619713 2.0882 0.0424711
n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
We saw in the last lecture that the residuals do not diverge strongly from the Gaussian assumption, but some of the predictors themselves appear to have skew or multimodal distributions.
We will be interested now in examining our error covariance assumptions.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(fitted(lmod_savings),residuals(lmod_savings),xlab="Fitted",ylab="Residuals", cex=3, cex.lab=3, cex.axis=3)
abline(h=0)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(fitted(lmod_savings),sqrt(abs(residuals(lmod_savings))), xlab="Fitted",ylab=expression(sqrt(hat(epsilon))), cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(savings$pop15,residuals(lmod_savings), xlab="Population under 15", ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
We note that there are two distinct groups illustrated in the population below 15,
This reflects the multi-modality that was previously observed in the Q-Q plots.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(savings$pop75,residuals(lmod_savings),xlab="Population over 75",ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
Suppose we have two independent samples from two Gaussian distributions of unknown variances respectively, i.e.,
\[ \begin{align} &N(\mu_X , \sigma^2_X) & & N(\mu_Z , \sigma^2_Z) \end{align} \]
Then, we may compute the sample-based means of the random variables above as,
\[ \begin{align} \overline{X} &\triangleq \frac{1}{n}\sum_{i=1}^n X_i & & \overline{Z} \triangleq \frac{1}{m}\sum_{i=1}^m Z_i \end{align} \]
\[ \begin{align} S_X^2 \triangleq & \frac{1}{n-1} \sum_{i=1}^n\left(X_i - \overline{X}\right)^2 & & S_Z^2 \triangleq \frac{1}{m-1}\sum_{i=1}^m \left(Z_i - \overline{Z}\right)^2 \end{align} \]
Courtesy of IkamusumeFan CC BY-SA 4.0
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
var.test(residuals(lmod_savings)[savings$pop15>35], residuals(lmod_savings)[savings$pop15<35])
F test to compare two variances
data: residuals(lmod_savings)[savings$pop15 > 35] and residuals(lmod_savings)[savings$pop15 < 35]
F = 2.7851, num df = 22, denom df = 26, p-value = 0.01358
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.240967 6.430238
sample estimates:
ratio of variances
2.785067
We see two dual measures for the F-test of variances above:
This suggest the use of weighted least squares can correct the issues with non-constant variance – we will discuss this later on in the course.
There aren't very general methods for testing for the correlation of the observations \( \boldsymbol{\epsilon} \), i.e.,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right)\triangleq &\mathbb{E}\left[\boldsymbol{\epsilon} \boldsymbol{\epsilon}^\mathrm{T}\right] = \boldsymbol{\Sigma}^2, \end{align} \]
then it is difficult generally to test based on the residuals if the non-diagonal entries of \( \boldsymbol{\Sigma} \) are non-zero.
\[ \boldsymbol{\Sigma}^2 = \begin{pmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{pmatrix} \] for possibly different \( \sigma_i \), but strictly diagonal.
Recall that even when \( \boldsymbol{\Sigma}^2 = \sigma^2 \mathbf{I} \), the residuals are not generally uncorrelated as,
\[ \begin{align} \mathrm{cov} \left( \hat{\boldsymbol{\epsilon}} \right) = \sigma^2 \left(\mathbf{I} - \mathbf{H}\right) \end{align} \]
Ice core sample hole. Image courtesy of NOAA - National Centers for Environmental Information
globwarm
,head(globwarm, 4)
nhtemp wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
1000 NA -0.66 -0.03 0.03 -0.66 0.33 -1.49 0.83 -0.12
1001 NA -0.63 -0.07 0.09 -0.67 0.21 -1.44 0.96 -0.17
1002 NA -0.60 -0.11 0.18 -0.67 0.13 -1.39 0.99 -0.22
1003 NA -0.55 -0.14 0.30 -0.68 0.08 -1.34 0.95 -0.26
year
1000 1000
1001 1001
1002 1002
1003 1003
nhtemp - is the average northern hemisphere temperature, in degrees celcius, which only has data available back to 1856 (thus the NA values).
westgreen - is the ice core proxy information from western Greenland.
chesapeake - is sea shell proxy information from Chesapeake Bay.
tornetrask - is tree ring proxy information from Sweden.
urals - is tree ring proxy information from the Ural mountains.
tail(globwarm, 4)
nhtemp wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
1997 0.48 0.82 0.35 -0.23 -0.32 0.56 0.6 1.33 0.50
1998 0.66 0.82 0.34 -0.21 -0.07 0.44 0.6 1.30 0.49
1999 0.46 0.84 0.32 -0.20 0.17 0.37 0.6 1.24 0.49
2000 0.40 0.88 0.31 -0.19 0.39 0.34 0.6 1.13 0.50
year
1997 1997
1998 1998
1999 1999
2000 2000
mongolia - is tree ring proxy information from Mongolia.
tasman - is tree ring proxy information from Tasmania.
In the tail, we have accurate temperature measurements into the year 2000.
The default behavior in R is to omit any lines where there are missing values, so we can fit the model directly only on the years 1856-2000 as below:
lmod <- lm(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, globwarm)
Q: What kind of plot would make sense in this instance to study the
A: we can once again plot the residuals, but in this context versus the time variable
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(residuals(lmod) ~ year, na.omit(globwarm), ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
Given the last diagnostic, it appears that the residuals are positively correlated in time,
Therefore, we will plot the residual one step in time forward as a function of the residual one step in time backward:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
n <- length(residuals(lmod))
plot(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1), xlab=expression(hat(epsilon)[i]),ylab=expression(hat(epsilon)[i+1]), cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0,v=0,col=grey(0.75))
The former plot strongly suggests a linear relationship between the residuals between subseqent time steps.
To verify our visual inspection numerically, we fit a linear model (without intercept) for the plot studied earlier:
sumary(lm(head(residuals(lmod),n-1) ~ tail(residuals(lmod),n-1) -1))
Estimate Std. Error t value Pr(>|t|)
tail(residuals(lmod), n - 1) 0.571578 0.066575 8.5855 1.391e-14
n = 144, p = 1, Residual SE = 0.13640, R-Squared = 0.34
The \( R^2 \) value is somewhat small in this case, indicating that the variation of the trend isn't fully explained by the linear relationship.
However, the p-value indicates that there is some relationship between the residuals at consecutive time steps;
If we wish to evaluate the previous question formally in terms of hypothesis testing, we can consider the Durbin-Watson test.
Generally speaking, the Durbin-Watson test is a test-statistic to study the auto-correlation, lag 1 in the residuals, from a regression analysis.
Suppose we have a linear relationship defined in terms of the lag 1 residuals, i.e.,
\[ \begin{align} \hat{\boldsymbol{\epsilon}}_t = \rho \hat{\boldsymbol{\epsilon}}_{t-1} + \nu_t \end{align} \]
where:
require("lmtest")
dwtest(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, data=globwarm)
Durbin-Watson test
data: nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman
DW = 0.81661, p-value = 1.402e-15
alternative hypothesis: true autocorrelation is greater than 0
The p-value is effectively zero, so we can conclude with \( 5\% \) significance that the observations are time-correlated.
We will evaluate later how to deal with some of the issues with correlation in observations with generalized least squares.
There aren't very general means to test correlation between cases, and instead we will usually need to appeal to expert knowledge of the structure of the problem
Particularly, we note there are common types of correlation between cases in which data is collected: