Diagnostics part 2: error covariance assumptions

10/21/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • A review of the residuals and their covariance
    • Visual diagnostics of non-constant variances
    • The F-test for non-equal variances
    • Correlation between cases
    • The Durbin-Watson test for time correlation

Diagnostics

  • We are considering 4 categories of potential issues with the model:
    1. Issues with the Gaussian error assumption: our hypothesis testing studied thus far relies on the Gaussian error assumption.
    2. Issues with the form of the hypothetical covariance: we have assumed that \[ \mathrm{cov}\left(\mathbf{Y}\right) = \sigma^2\mathbf{I} \] but for many cases this will not be true.
    3. Issues with unusual observations: some of the observations may not look like others, and they might change the choice and the fit of the model.
    4. Issues with the systematic part of the model: we have assumed that there is an actual signal in the data of the form \[ \begin{align} \mathbb{E}[\mathbf{Y}] = \mathbf{X} \boldsymbol{\beta}, \end{align} \] which may not be valid.

Checking error covariance assumptions

  • We will now begin our discussion on diagnosing issues with our error covariance assumptions.

  • If we wish to check the assumptions on the error or variation in the signal \( \boldsymbol{\epsilon} \), we need to consider, \( \boldsymbol{\epsilon} \) itself is not observable.

  • Q: What proxy could we consider for the error?

    • A: The residuals are related to the error functionally, but have slightly different properties.
  • Recall, the definition of \( \hat{\mathbf{Y}} \)

    \[ \begin{align} \hat{\mathbf{Y}} \triangleq& \mathbf{X}\left(\mathbf{X}^\mathrm{T} \mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y} \\ & =\mathbf{H} \mathbf{Y} \end{align} \]

  • Therefore, if we compute the residuals,

\[ \begin{align} \hat{\boldsymbol{\epsilon}} & = \mathbf{Y} - \hat{\mathbf{Y}} \\ & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{Y} \\ & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \boldsymbol{\beta} + \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]

Checking error covariance assumptions

  • From the last slide

    \[ \begin{align} \hat{\boldsymbol{\epsilon}} & =\left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \boldsymbol{\beta} + \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]

  • Q: recalling the definition of \( \mathbf{H} = \mathbf{X}\left(\mathbf{X}^\mathrm{T} \mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \), what does \( \mathbf{H}\mathbf{X} \) equal to?

  • A: \( \mathbf{H} \) is the projection operator onto the span of the design matrix, and thus \( \mathbf{H}\mathbf{X} = \mathbf{X} \) by construction.

  • Q: given the above, what does \( \left(\mathbf{I} - \mathbf{H}\right)\mathbf{X} \) equal to?

  • A: the above must equal zero, as \( \mathbf{I}\mathbf{X} = \mathbf{H}\mathbf{X} = \mathbf{X} \).

Checking error covariance assumptions

  • From the previous two exercises, we can deduce,

    \[ \begin{align} \hat{\boldsymbol{\epsilon}} = \left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon} \end{align} \]

  • We take the assumption that \( \boldsymbol{\epsilon}\sim N(0, \mathbf{I} \sigma^2) \),

    • Q: given the above assumption, what is the mean of \( \hat{\boldsymbol{\epsilon}} \)?
    • A:

    \[ \begin{align} \mathbb{E}[\hat{\boldsymbol{\epsilon}}] &= \mathbb{E}\left[\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right] \\ &=\left(\mathbf{I} - \mathbf{H}\right)\mathbb{E}\left[\boldsymbol{\epsilon}\right]\\ &= 0 \end{align} \]

    • Q: given the above assumption, what is the covariance of \( \hat{\boldsymbol{\epsilon}} \)?
    • A:

    \[ \begin{align} \mathbb{E}\left[\left(\hat{\boldsymbol{\epsilon}}\right) \left(\hat{\boldsymbol{\epsilon}}\right)^\mathrm{T}\right] &= \mathbb{E}\left[\left(\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right)\left(\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\right)^\mathrm{T}\right] \\ &=\mathbb{E}\left[\left(\mathbf{I} - \mathbf{H}\right)\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T}\right]\\ &=\left(\mathbf{I} - \mathbf{H}\right)\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right]\left(\mathbf{I} - \mathbf{H}\right)^\mathrm{T}\\ & =\left(\mathbf{I} - \mathbf{H}\right)\left[\sigma^2 \mathbf{I}\right]\left(\mathbf{I} - \mathbf{H}\right)\\ &=\sigma^2\left(\mathbf{I} - \mathbf{H}\right) \end{align} \]

Checking error covariance assumptions

  • Q: Why is the last slide relevant? Particularly, why should we be concerned with the covariance of the residuals?

    • A: If the assumptions hold for \( \boldsymbol{\epsilon} \), then we can compare the coviance of the residuals to their theoretical value \( \sigma^2\left(\mathbf{I} - \mathbf{H}\right) \) for consistency.
    • Occasionally, we might actually have prior knowledge about the value of \( \sigma^2 \) which we can evaluate directly.
    • Otherwise, we have an unbiased estimator for \( \sigma^2 \) in terms of \[ \hat{\sigma}^2 = \frac{RSS}{n-p} \] which we can compare with the observed covariance of the residuals.
  • Note, even though the errors are assumed to be independent and of equal variance \( \sigma^2 \), the same doesn't hold for the residuals.

    • In particular, the operator \( \mathbf{I} - \mathbf{H} \) is not generally diagonal (such that the residuals have correlation); nor are the diagonal values equal (so that the variances don't all match).
  • Nonetheless, we use the residuals to underrstand how the true errors are behaving, which are unobservable.

Visual diagnostics

Image of residuals versus fitted values with constant variance in the residuals over cases.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • To the left is a plot where:
    1. The horizontal (x-axis) represents the fitted value \( \hat{\mathbf{Y}} \), as in some model \( \hat{\mathbf{Y}}=\mathbf{X}\hat{\boldsymbol{\beta}} \).
    2. The vertical axis (y-axis) represents the corresponding residual \( \hat{\boldsymbol{\epsilon}} \) value, equal to \( \mathbf{Y} - \hat{\mathbf{Y}} \).
  • Suppose that the variances of the error \( \boldsymbol{\epsilon} \) are not fixed, i.e.,
    • suppose that some observations have more variation and some observations have less variation around the signal, \( \mathbf{X}\boldsymbol{\beta} \).
  • In this case, there is dependence of the variation (or error \( \boldsymbol{\epsilon} \)) on the observation.
  • To the left, this is actuall a well behaved situation. In particular, the residuals show variation that doesn’t seem to depend on the value of the fitted value/ observation.
  • The mean of the residuals appears to be zero as well, because there isn’t a clear preference for positive or negative residuals
  • The situation to the left is denoted homoscedasticity.
  • Later, we will develop tests to determine if the variance is “close-enough-to-constant”, so that we are satisfied.
  • On the other hand, if there are non-constant variances (as discussed above), this will often show up as a pattern in the plot to the left.

Visual diagnostics

Image of residuals versus fitted values with non-constant variance in the residuals over cases.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • The non-constant variance of \( \hat{\boldsymbol{\epsilon}} \) to the left is known as heteroscedasticity.
  • In this case, there is a clear dependence on the variation of \( \hat{\boldsymbol{\epsilon}} \) on the fitted value/ the observation.
  • Breaking the constant variance assumption, the Gauss-Markov theorem no longer applies
  • Heteroscedasticity does not, in itself, cause ordinary least squares coefficient estimates to be biased;
  • however, the theoretical estimates of the variance of the residuals (and therefore the standard errors) will become biased.

Visual diagnostics

Image of residuals versus fitted values with non-constant variance in the residuals over cases.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • The bias in the standard errors complicates our ability to accurate quantify the uncertainty, and therefore to accurately:
    1. make hypothesis tests on the parameters for significance;
    2. provide confidence intervals for the parameters
    3. provide accurate prediction intervals and confidence intervals for the mean response;
    4. provide explanatory power in the relationship between the response and the explanatory variables.

Visual diagnostics

Image of residuals versus fitted values with nonlinear behavior in the residuals over cases.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • Using the same plot as before, we can likewise determine if there is actually nonlinearity in the residuals.
  • The plot at the left exhibits a nonlinear dependence of the residual on the fitted/ observed values
  • The same technique can also be used replacing the fitted values \( \hat{\mathbf{Y}} \) in the horizontal axis with any other variable in the model \( \mathbf{X}_i \), to determine dependence of the residual on the explanatory variables
  • Additionally, we may consider how the residual varies across variables \( \mathbf{X}_i \) that we have data for, but have not included as explanatory variables on the response.
  • If there is a dependence structure for the residuals on the variable that was left out of the model, it suggests that we should consider its impact on the response and/or if it is a variable that is tightly correlated with our existing model variables.

An example of non-constant variance

  • Recall the “savings” dataset, listing the average savings rate, age demographics and per capita incomes of fifty countries averaged over 1960 -1970.
library("faraway")
lmod_savings <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
sumary(lmod_savings)
               Estimate  Std. Error t value  Pr(>|t|)
(Intercept) 28.56608654  7.35451611  3.8842 0.0003338
pop15       -0.46119315  0.14464222 -3.1885 0.0026030
pop75       -1.69149768  1.08359893 -1.5610 0.1255298
dpi         -0.00033690  0.00093111 -0.3618 0.7191732
ddpi         0.40969493  0.19619713  2.0882 0.0424711

n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
  • We saw in the last lecture that the residuals do not diverge strongly from the Gaussian assumption, but some of the predictors themselves appear to have skew or multimodal distributions.

  • We will be interested now in examining our error covariance assumptions.

Savings data example

  • With this model, we plot the residuals versus the fitted values:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(fitted(lmod_savings),residuals(lmod_savings),xlab="Fitted",ylab="Residuals", cex=3, cex.lab=3, cex.axis=3) 
abline(h=0)

plot of chunk unnamed-chunk-2

  • From the initial inspection, we do not notice anything particularly structured about the residuals – indeed they are roughly symmetric around zero and in the fitted value.

Half normal distributions

Image of the half-normal distribution. Courtesy of Nagelum CC BY-SA 4.0 via Wikimedia Commons
  • Therefore, to investigate the constant variance assumption more closely, we will “zoom” in on the residuals.
  • In particular, we will consider the plot of the square root of the absolute residuals, i.e., \[ \begin{align} \sqrt{\vert\hat{\boldsymbol{\epsilon}}\vert} \end{align} \]
  • Under the assumption \( \boldsymbol{\epsilon}\sim N(0, \mathbf{I}\sigma^2) \), the values of \( \vert\hat{\boldsymbol{\epsilon}}\vert \) are distribted according to the half normal distribution
    • the relationship between the normal distribution and the corresponding half-normal distribution is illustrated on the left.
  • Having excluded the nonlinearity expressed in other plots, we will focus on the absolute values to increase the resolution of the residuals as a response of the fitted values.
  • However, the half normal distribution is very skewed, so we make a change of scale with the square root of the absolute values in order to keep it well behaved.
  • Savings data example

    • In the case below, we plot \( \sqrt{\vert\hat{\boldsymbol{\epsilon}}\vert} \) versus the fitted values
    par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
    plot(fitted(lmod_savings),sqrt(abs(residuals(lmod_savings))), xlab="Fitted",ylab=expression(sqrt(hat(epsilon))), cex=3, cex.lab=3, cex.axis=1.5)
    

    plot of chunk unnamed-chunk-3

    • We note in this case as well (irrespective of sign) the residuals have approximately even variation.

    Plotting rediduals versus predictors

    • We will now consider the plot of the residual versus the explanatory variables of the model:
    par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
    plot(savings$pop15,residuals(lmod_savings), xlab="Population under 15", ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
    abline(h=0)
    

    plot of chunk unnamed-chunk-4

    • We note that there are two distinct groups illustrated in the population below 15,

    • This reflects the multi-modality that was previously observed in the Q-Q plots.

    Plotting rediduals versus predictors

    • On the other hand, the residuals are of uniform variation with respect to the values for the population over 75.
    par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
    plot(savings$pop75,residuals(lmod_savings),xlab="Population over 75",ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
    abline(h=0)
    

    plot of chunk unnamed-chunk-5

    • We are motivated thus to determine if there is non-constant variance in the residuals with respect to the pop15 predictor…

    The F-test for variance

    • Suppose we have two independent samples from two Gaussian distributions of unknown variances respectively, i.e.,

      \[ \begin{align} &N(\mu_X , \sigma^2_X) & & N(\mu_Z , \sigma^2_Z) \end{align} \]

      • For \( m,n>1 \), the samples are given respectively as \( \left\{X_i\right\}_{i=1}^n \) and \( \left\{Z_i\right\}_{i=1}^m \).
    • Then, we may compute the sample-based means of the random variables above as,

    \[ \begin{align} \overline{X} &\triangleq \frac{1}{n}\sum_{i=1}^n X_i & & \overline{Z} \triangleq \frac{1}{m}\sum_{i=1}^m Z_i \end{align} \]

    • And likewise, we can compute the sample-based variances as,

    \[ \begin{align} S_X^2 \triangleq & \frac{1}{n-1} \sum_{i=1}^n\left(X_i - \overline{X}\right)^2 & & S_Z^2 \triangleq \frac{1}{m-1}\sum_{i=1}^m \left(Z_i - \overline{Z}\right)^2 \end{align} \]

    F-test continued…

    Image of the F distribution.

    Courtesy of IkamusumeFan CC BY-SA 4.0

    • Consider the following null and alternative hypothesis: \[ \begin{align} H_0 : &\sigma^2_X = \sigma^2_Z \\ H_1 : & \sigma^2_X \neq \sigma^2_Z \end{align} \]
    • Under the previous assumptions, and the null hypothesis \( H_0 \), the ratio of the two sample based variances is an F-statistic \[ \begin{align} \frac{S_X^2}{S_Z^2}\sim F_{(n-1,m-1)} \end{align} \] a distribution defined by the ratio of two \( \chi^2 \) variables, divided by their respective degrees of freedom.
    • If the p-value of the above ratio is greater than the pre-specified significance (\( 5\% \) typically) then we fail to reject the null hypothesis, i.e., \( \sigma^2_X = \sigma^2_Z \).
    • On the other hand, if we observe an extreme value for this ratio, this gives significant evidence to support inequality of the variances.

    Returning to the regression analysis…

    Image of two groups of residuals with different variance in their distributions.

    Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

    • We wish to determine if there is non-constant variance in the residuals with respect to the values of the explanatory variable “pop15”.
    • This is of course, a violation of the Gauss-Markov theorem assumptions. We want to test this concretely with the F-test for variances.
    • For the F-test discussed earlier we take
      • \( \left\{X_i \right\}_{i=1}^n \) to be the residuals corresponding to a country with the percent of population under 15, less than \( 35\% \);
      • \( \left\{Z_i \right\}_{i=1}^m \) to be the residuals corresponding to a country with the percent of population under 15, greater than \( 35\% \).
    • Assuming that both sample populations are normally distributed with unknown variance, we will use an F-test to determine if the variances are equal.
    • In our earlier analysis, we failed to reject the null hypothesis of Gaussian error distributions, so this assumption is justified.

    Testing for constant variance

    • We perform the above F-test in the following code:
    var.test(residuals(lmod_savings)[savings$pop15>35], residuals(lmod_savings)[savings$pop15<35])
    
    
        F test to compare two variances
    
    data:  residuals(lmod_savings)[savings$pop15 > 35] and residuals(lmod_savings)[savings$pop15 < 35]
    F = 2.7851, num df = 22, denom df = 26, p-value = 0.01358
    alternative hypothesis: true ratio of variances is not equal to 1
    95 percent confidence interval:
     1.240967 6.430238
    sample estimates:
    ratio of variances 
              2.785067 
    
    • We see two dual measures for the F-test of variances above:

      • Firstly, the p-value is less than \( 2\% \) so we reject the null hypothesis (equal variances) with \( 5\% \) significance.
      • Secondly, the \( 95\% \) confidence interval for the ratio of variances excludes one, such that with \( 95\% \) confidence we exclude the possibility that their variances are equal.
    • This suggest the use of weighted least squares can correct the issues with non-constant variance – we will discuss this later on in the course.

    A summary of ideas so far

    • Though non-constant variance doesn’t bias the least squares estimate of the parameters \( \hat{\boldsymbol{\beta}} \), it does bias the estimates of standard errors.
    • In particular, our confidence intervals and uncertainty quantification can become unreliable.
    • We can’t observe the true error \( \boldsymbol{\epsilon} \) direcly, so we must use the residuals \( \hat{\boldsymbol{\epsilon}} \) as a kind of proxy measurement.
      • Even when the errors \( \boldsymbol{\epsilon} \) are uncorrelated and of constant variance, the same does not hold generally for the residuals.
    • Nonetheless, plotting residuals versus fitted values and versus predictors reveals structures in the errors.
    • To ameliorate issues of non-constant variance, one common technique is to use a change of scale for the either:
      1. the response variable; or
      2. one or more of the predictors.
    • At this moment, we have only considered two transformations: log or sqrt, which we have only explored informally.
    • Whenever we try a change of scale, we should repeat the diagnostics to see if this had the intended effect, or if other changes may be necessary.
    • Shortly, we will look at other more formal means of defining transformations, both to the response and the predictors.

    Correlated observations

    • There aren't very general methods for testing for the correlation of the observations \( \boldsymbol{\epsilon} \), i.e.,

      • let

      \[ \begin{align} \mathrm{cov}\left(\boldsymbol{\epsilon}\right)\triangleq &\mathbb{E}\left[\boldsymbol{\epsilon} \boldsymbol{\epsilon}^\mathrm{T}\right] = \boldsymbol{\Sigma}^2, \end{align} \]

      then it is difficult generally to test based on the residuals if the non-diagonal entries of \( \boldsymbol{\Sigma} \) are non-zero.

      • That is, whether we can even write

      \[ \boldsymbol{\Sigma}^2 = \begin{pmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{pmatrix} \] for possibly different \( \sigma_i \), but strictly diagonal.

    Correlated observations

    • Recall that even when \( \boldsymbol{\Sigma}^2 = \sigma^2 \mathbf{I} \), the residuals are not generally uncorrelated as,

      \[ \begin{align} \mathrm{cov} \left( \hat{\boldsymbol{\epsilon}} \right) = \sigma^2 \left(\mathbf{I} - \mathbf{H}\right) \end{align} \]

      • Therefore, we must usually anticipate the issues from the structure of the problem itself, e.g,
        1. in time series problems, we shoud check for time correlation of the observations;
        2. in spatial data, we should check for correlation in the observations spatially;
        3. data collected in particular batches should be checked for correlations between observations within particular batches;
        4. etc…
      • Typically, this will include (as usual) a mixture of visual inspection to feel the data for issues, and numerical verification of our intuition.

    An example of time-correlated observations

    Image of ice core hole.

    Ice core sample hole. Image courtesy of NOAA - National Centers for Environmental Information

    • If we wish to model and understand the changing climate, we must bear in mind that there are only accurate measures of temperatures available since the early 1850’s.
    • To estimate the average land surface temperature in a region for dates further back in time, we can use proxy data including, e.g,
      1. tree rings, which show the relative intensity of winters, summers and precipitation;
      2. ice core samples, which give stratified layers demonstrating the relative intensity of summer melting and precipitation events;
      3. corals, which by chemical properties of their skeletons can be used to estimate the ocean temperature in which they formed;
      4. and others…
    • In this context, we can consider building a linear model for the average northern hemisphere surface temperature as the response, with the proxy data as the explanatory variables.
    • Fitting the relationship where there are reliable observations of the temperature, we can predict the average surface temperature when we lack these futher back in time, based upon the proxies which extend much further.

    An example of correlated observations

    • We will use the data from Faraway, globwarm,
    head(globwarm, 4)
    
         nhtemp  wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
    1000     NA -0.66  -0.03      0.03      -0.66       0.33 -1.49     0.83  -0.12
    1001     NA -0.63  -0.07      0.09      -0.67       0.21 -1.44     0.96  -0.17
    1002     NA -0.60  -0.11      0.18      -0.67       0.13 -1.39     0.99  -0.22
    1003     NA -0.55  -0.14      0.30      -0.68       0.08 -1.34     0.95  -0.26
         year
    1000 1000
    1001 1001
    1002 1002
    1003 1003
    
    • nhtemp - is the average northern hemisphere temperature, in degrees celcius, which only has data available back to 1856 (thus the NA values).

    • westgreen - is the ice core proxy information from western Greenland.

    • chesapeake - is sea shell proxy information from Chesapeake Bay.

    • tornetrask - is tree ring proxy information from Sweden.

    • urals - is tree ring proxy information from the Ural mountains.

    An example of correlated observations

    tail(globwarm, 4)
    
         nhtemp wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
    1997   0.48 0.82   0.35     -0.23      -0.32       0.56   0.6     1.33   0.50
    1998   0.66 0.82   0.34     -0.21      -0.07       0.44   0.6     1.30   0.49
    1999   0.46 0.84   0.32     -0.20       0.17       0.37   0.6     1.24   0.49
    2000   0.40 0.88   0.31     -0.19       0.39       0.34   0.6     1.13   0.50
         year
    1997 1997
    1998 1998
    1999 1999
    2000 2000
    
    • mongolia - is tree ring proxy information from Mongolia.

    • tasman - is tree ring proxy information from Tasmania.

    • In the tail, we have accurate temperature measurements into the year 2000.

    • The default behavior in R is to omit any lines where there are missing values, so we can fit the model directly only on the years 1856-2000 as below:

    lmod <- lm(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, globwarm)
    

    An example of correlated observations

    • Q: What kind of plot would make sense in this instance to study the time-correlation of the observations?

    • A: we can once again plot the residuals, but in this context versus the time variable

    par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
    plot(residuals(lmod) ~ year, na.omit(globwarm), ylab="Residuals", cex=3, cex.lab=3, cex.axis=1.5)
    abline(h=0)
    

    plot of chunk unnamed-chunk-10

    • Q: do we detect any structure to the residuals?

    An example of correlated observations

    • Given the last diagnostic, it appears that the residuals are positively correlated in time,

      • that is, the residual in the last time step appears to largely be similar to the outcome of the subsequent.
    • Therefore, we will plot the residual one step in time forward as a function of the residual one step in time backward:

    par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
    n <- length(residuals(lmod))
    plot(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1), xlab=expression(hat(epsilon)[i]),ylab=expression(hat(epsilon)[i+1]),  cex=3, cex.lab=3, cex.axis=1.5)
    abline(h=0,v=0,col=grey(0.75))
    

    plot of chunk unnamed-chunk-11

    An example of correlated observations

    • The former plot strongly suggests a linear relationship between the residuals between subseqent time steps.

      • We note that in the former plot, the intercept term is zero as the residuals have mean zero.
    • To verify our visual inspection numerically, we fit a linear model (without intercept) for the plot studied earlier:

    sumary(lm(head(residuals(lmod),n-1) ~ tail(residuals(lmod),n-1) -1))
    
                                 Estimate Std. Error t value  Pr(>|t|)
    tail(residuals(lmod), n - 1) 0.571578   0.066575  8.5855 1.391e-14
    
    n = 144, p = 1, Residual SE = 0.13640, R-Squared = 0.34
    
    • The \( R^2 \) value is somewhat small in this case, indicating that the variation of the trend isn't fully explained by the linear relationship.

    • However, the p-value indicates that there is some relationship between the residuals at consecutive time steps;

      • particularly, it strongly suggests time-correlation of the residuals (and observations).

    Durbin-Watson test

    • If we wish to evaluate the previous question formally in terms of hypothesis testing, we can consider the Durbin-Watson test.

    • Generally speaking, the Durbin-Watson test is a test-statistic to study the auto-correlation, lag 1 in the residuals, from a regression analysis.

    • Suppose we have a linear relationship defined in terms of the lag 1 residuals, i.e.,

      \[ \begin{align} \hat{\boldsymbol{\epsilon}}_t = \rho \hat{\boldsymbol{\epsilon}}_{t-1} + \nu_t \end{align} \]

      where:

      1. \( \hat{\boldsymbol{\epsilon}}_t \) is the residual at time \( t \) in the time series;
      2. \( \hat{\boldsymbol{\epsilon}}_{t-1} \) is the residual at the time measured one step before;
      3. \( \rho \) is a constant scalar, giving the slope of the relationship; and
      4. \( \nu_t \) is the random variation around the linear signal between time steps.

    Durbin-Watson test

    Image courtesy of Geek3 CC BY 3.0 via Wikimedia Commons.
    • Given the lag-one autocorrelation model in the residuals, \( \hat{\boldsymbol{\epsilon}}_t = \rho \hat{\boldsymbol{\epsilon}}_{t-1} + \nu_t \), we will consider the following hypotheses:
    • \[ \begin{align} H_0: &\rho = 0 \\ H_1: &\rho \neq 0 \end{align} \] where the null corresponds to the absence of a time-correlation of the residuals.
    • Under the null hypothesis, the following, \[ \begin{align} DW \triangleq \frac{\sum_{i=2}^n \left(\hat{\boldsymbol{\epsilon}}_{i} - \hat{\boldsymbol{\epsilon}}_{i-1}\right)^2 }{\sum_{i=1}^n \hat{\boldsymbol{\epsilon}}_i^2} \end{align} \] is distributed according to a linear combination of \( \chi^2 \) distributions (example Chi-squared distributions pictured left).

    Durbin-Watson test

    • In R, we can perform the test using the package “lmtest”
    require("lmtest")
    dwtest(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, data=globwarm)
    
    
        Durbin-Watson test
    
    data:  nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask +     urals + mongolia + tasman
    DW = 0.81661, p-value = 1.402e-15
    alternative hypothesis: true autocorrelation is greater than 0
    
    • The p-value is effectively zero, so we can conclude with \( 5\% \) significance that the observations are time-correlated.

    • We will evaluate later how to deal with some of the issues with correlation in observations with generalized least squares.

    A summary of correlation between cases

    • There aren't very general means to test correlation between cases, and instead we will usually need to appeal to expert knowledge of the structure of the problem

    • Particularly, we note there are common types of correlation between cases in which data is collected:

    1. in time, giving auto-correlation in time;
    2. in batches, giving correlation between cases within the same batch;
    3. in space, giving correlation structure between cases within spatial proximity of other cases.
    • Investigating for these correlations should be performed in exploratory analysis when possible, noting the structure of the data and the method of collection.