10/28/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We will now begin to consider the structural part of the signal, and if this is correctly specified.
In order to check the structural assumption on the model, i.e.,
\[ \begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \end{align} \]
it is often a good first diagnostic to examine the 2D plot of the variable \( Y \) versus \( X_i \) for each variable \( X_i \).
However, when there are multiple explanatory variables of interest, this kind of plot is limited;
We may instead consider partial regression plots.
Particularly, consider the following definitions:
Then together, this allows us to approximate the effect of \( X_i \) on \( Y \) with the effect of the other variables taken out.
We'll demonstrate this with the savings data…
library("faraway")
lmod <- lm(sr ~ pop15 + pop75 + dpi + ddpi,savings)
d <- residuals(lm(sr ~ pop75 + dpi + ddpi,savings))
m <- residuals(lm(pop15 ~ pop75 + dpi + ddpi,savings))
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(m,d,xlab="pop15 residuals",ylab="Savings residuals", cex=3, cex.lab=3, cex.axis=1.5)
abline(0,coef(lmod)['pop15'])
The slope of the line is given precisely by the coefficient for “pop15” in the full model, describing the trend of the partial regression.
This type of plot is typically used for outlier and influential detection; here nothing is particularly troubling.
Additionally, we can consider a partial residual plot, not to be confused with the last.
Partial residual plots are a similar, alternative type of plot that are more frequently used for nonlinearity detection.
We can consider, in a different derivation, the effect of the response with the other variables removed.
The residual of the case \( Y_i \) regressing on all variables except for \( X_k \), is given
\[ \begin{align} Y_i - \sum_{j\neq k}X_{i,j} \hat{\beta}_j &= \hat{Y}_i + \hat{\epsilon}_i - \sum_{j\neq k}X_{i,j} \hat{\beta}_j \\ &=X_{i,k} \hat{\beta}_k + \hat{\epsilon}_i \end{align} \]
Particularly, we can then plot, \( X_{i,k} \hat{\beta}_k + \hat{\epsilon}_i \) versus \( X_{i,k} \), varying the case index \( i \), where the slope and the interpretation with the last plot is the same.
This is easy to do automatically in R, using the function “termplot”…
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=1, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
Distinct from the previous plot, we see the previously observed two sub-groups emerging with high and low percent population below 15.
Though both plots are meant to describe the dependence of the response on the variable “pop15” with all other factors removed, the second plot here better isolates the structure.
mod1 <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(pop15 > 35))
mod2 <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(pop15 < 35))
sumary(mod1)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.43396890 21.15502778 -0.1151 0.9097
pop15 0.27385369 0.43919097 0.6235 0.5408
pop75 -3.54847686 3.03328065 -1.1698 0.2573
dpi 0.00042076 0.00500007 0.0842 0.9339
ddpi 0.39547422 0.29010124 1.3632 0.1896
n = 23, p = 5, Residual SE = 4.45441, R-Squared = 0.16
sumary(mod2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.96179499 8.08375019 2.9642 0.007164
pop15 -0.38589762 0.19536859 -1.9752 0.060921
pop75 -1.32774213 0.92606273 -1.4337 0.165705
dpi -0.00045881 0.00072372 -0.6340 0.532640
ddpi 0.88439443 0.29534055 2.9945 0.006679
n = 27, p = 5, Residual SE = 2.77167, R-Squared = 0.51
mod1 <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(pop15 > 35))
mod2 <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(pop15 < 35))
sumary(mod1)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.43396890 21.15502778 -0.1151 0.9097
pop15 0.27385369 0.43919097 0.6235 0.5408
pop75 -3.54847686 3.03328065 -1.1698 0.2573
dpi 0.00042076 0.00500007 0.0842 0.9339
ddpi 0.39547422 0.29010124 1.3632 0.1896
n = 23, p = 5, Residual SE = 4.45441, R-Squared = 0.16
sumary(mod2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.96179499 8.08375019 2.9642 0.007164
pop15 -0.38589762 0.19536859 -1.9752 0.060921
pop75 -1.32774213 0.92606273 -1.4337 0.165705
dpi -0.00045881 0.00072372 -0.6340 0.532640
ddpi 0.88439443 0.29534055 2.9945 0.006679
n = 27, p = 5, Residual SE = 2.77167, R-Squared = 0.51
We have seen the very different fit of the model on the two different blocks of countries with different variances between the groups;
This kind of analysis is important as we have seen how a mis-specified model in the teengamb
data (a simple encoding versus the mixed effect of sex
) can greatly change the conclusions we make about the trend in the data.
Using the variety of diagnostics we have on this model, we have a more complete picture about how to approach the relationships.
The partial residual plots helped especially to determine if the form of the model,
\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
is a reasonable hypothesis in the first place.
Recall, when our assumptions (e.g., G-M) hold then the residual standard error,
\[ \begin{align} \hat{\sigma}^2 \triangleq \frac{\hat{\boldsymbol{\epsilon}}^\mathrm{T} \hat{\boldsymbol{\epsilon}}}{n-p} \end{align} \]
is an unbiased estimate of the true variance \( \sigma^2 \) of the error.
However, we have now seen many ways how these assumptions can break down.
In an ideal sittuation, where we may actually know \( \sigma^2 \), comparing the two values would lead us to understanding the fit of the model by our empirical estimate versus the known value.
By the same principle, we may more generally get a good estimate of \( \sigma^2 \) when we have multiple observations of the response for the same values of the explanatory variables.
We note that this requires an assumption of the independence of the measurements.
For example, we may be measuring blood presssure as the response variable and using age, height, weight, and other measurements as the explanatory varibles in a model.
It won't suffice to re-measure the same individual to obtain an estimate of \( \sigma^2 \) as this will only measure the within-subjet variability.
Rather, we would need to find multiple individuals with the same measurements for the explanatory varibles to estimate the variance \( \sigma^2 \) of the error around the signal.
We call these multiple, independent observations of the response, with the identical explanatory variables replicates.
Suppose we are in the case of simple regression to simplify our discussion.
Let \( Y_{ij} \) be the \( i \)-th observation of the \( j \)-th group of replicates.
Define the mean over all replicates in the \( j \)-the group as \( \overline{Y}_j \); then we can then estimate \( \sigma^2 \) independently of the regression model as
\[ \begin{align} \frac{\sum_{j}\sum_{i}\left(Y_{ij} - \overline{Y}_{j}\right)^2}{\sum_j \left( \{\text{number of replicates of type }j\} - 1\right)} &=\frac{\sum_{j}\sum_{i}\left(Y_{ij} - \overline{Y}_{j}\right)^2}{n - \text{number of groups}} \end{align} \]
The estimate given above is known as the pure error.
We can formulate a test for lack of fit in terms of the ratio of the two estimates for variance, i.e.,
Q: what test statistic can we use to test if two empirical variances match?
A: this can be done with the F-test.
We may fit a model in which each value for a replicate group is treated as a factor;
by treating these as a categorical factor, our model becomes saturated as there will be one parameter in the model per replicate group;
the fitted value for each group of replicates will be given as the mean for this group of replicates;
while this doesn't provide a model for explaining the phenomenon, the standard error of this dummy model is equal to the pure error;
thus, we can use the F-test in an ANOVA table to compare the regression model standard error with the pure error.
library("faraway")
head(corrosion)
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
4 0.95 103.9
5 1.19 101.5
6 0.01 130.1
Each sample has varying ammounts of iron, and the loss is measured in terms of milligrams per day.
We fit a simple regression model for the rate of loss per day based on the iron content:
lmod <- lm(loss ~ Fe, corrosion)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.7866 1.4027 92.524 < 2.2e-16
Fe -24.0199 1.2798 -18.769 1.055e-09
n = 13, p = 2, Residual SE = 3.05778, R-Squared = 0.97
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(loss ~ Fe, corrosion,xlab="Iron content",ylab="Weight loss", cex=3, cex.lab=3, cex.axis=1.5)
abline(coef(lmod))
lmod <- lm(loss ~ Fe, corrosion)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(residuals(lmod)))
qqline(scale(residuals(lmod)))
shapiro.test(residuals(lmod))
Shapiro-Wilk normality test
data: residuals(lmod)
W = 0.93304, p-value = 0.3733
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(corrosion$loss))
qqline(scale(corrosion$loss))
shapiro.test(corrosion$loss)
Shapiro-Wilk normality test
data: corrosion$loss
W = 0.91198, p-value = 0.1952
Here, in the ANOVA table, the null hypothesis is that the standard error estimate and the pure error estimate are empirical variances that approximate the same \( \sigma^2 \).
The alternative hypothesis is that these do not estimate the same quantity, i.e., that the regression model lacks fit.
$$\begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align}$$
is not an accurate form for the model, as the pure error doesn't match the error given by this formulation.
lmoda <- lm(loss ~ factor(Fe), corrosion)
sqrt(sum(lmoda$residuals^2)/lmoda$df.residual)
[1] 1.401289
sqrt(sum(lmod$residuals^2)/lmod$df.residual)
[1] 3.05778
anova(lmod, lmoda)
Analysis of Variance Table
Model 1: loss ~ Fe
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 11 102.850
2 6 11.782 5 91.069 9.2756 0.008623 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Due to the extremely small p-value, we reject the null at \( 5\% \) significance.
In particular, data estimated standard deviation is \( \approx 1.4 \), while the regression estimate is significantly larger at \( 3.06 \).
In this case, it appears that there is still a structural issue in the model itself, which makes the \( R^2 \) misleading.
A more general question is of how close of a fit to the data is actually appropriate for the problem at hand.
By including up to sixth degree polynomial terms in the explanatory variable, we can acheive an almost perfect fit, but the model is completely unphysical and clearly will lack any inference power.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmodp <- lm(loss ~ Fe+I(Fe^2)+I(Fe^3)+I(Fe^4)+I(Fe^5)+I(Fe^6),corrosion)
plot(loss ~ Fe, data=corrosion,ylim=c(60,130), cex=3, cex.lab=3, cex.axis=1.5)
points(corrosion$Fe,fitted(lmoda),pch=3)
grid <- seq(0,2,len=50)
lines(grid,predict(lmodp, data.frame(Fe=grid)))
The issues to check for in diagnostics are diverse, but also of differing importance for the analysis.
Specifically, we will order these in terms of importance of consideration:
Remediation of the above can take many different forms.