10/26/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We will now consider diagnostics for considering unusual observations.
Some observations can have their own dramatic impact on the performance of the model,
Outliers are those observations that don't fit the overall trend or the linear model well.
It is possible that an observation is either, both or neither an outlier or influential.
A leverage point is an extreme value of one of the explanatory variables, that has potential to influence the fit of the model.
It is important to identify any of the above such points, but the choice of what to do with them depends strongly on the problem at hand.
We will first study the leverage points.
Recall the “hat” matrix that projects the observations into the space of the explanatory variables,
\[ \begin{align} \mathbf{H}&\triangleq \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X} \right)^{-1}\mathbf{X}^\mathrm{T} \end{align} \]
Using the hat matrix, we define the fitted values via the relationship,
\[ \begin{align} \hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}. \end{align} \]
Therefore, consider the derivative of the \( i \)-th fitted value \( \hat{\mathbf{Y}}_i \) with respect to the \( j \)-th observation of the response,
\[ \begin{align} \partial_{\mathbf{Y}_j} \hat{\mathbf{Y}}_i &= \partial_{\mathbf{Y}_j} \left(\mathbf{H} \mathbf{Y}\right) \\ & =\mathbf{H}_{ij}. \end{align} \]
Thus, the sensitivity of the \( i \)-th fitted value (prediction) with respect to the \( j \)-th observed value is given precisely by the entry \( (i,j) \) of the hat matrix, \( \mathbf{H}_{ij} \).
For this reason, the entries of the projection operator \( \mathbf{H} \) can be considered to be,
\( \mathbf{H}_{ii} \) — the self-sensitivity or leverage of the fitted value \( \hat{\mathbf{Y}}_{i} \) with respect to the variation of the observation \( \mathbf{Y}_i \); while
\( \mathbf{H}_{ij} \) — the cross-sensitivity of the fitted value \( \hat{\mathbf{Y}}_{i} \) with respect to the variation of the observation \( \mathbf{Y}_j \).
Recall that, under the condition of \( \boldsymbol{\epsilon}\sim N\left(0,\sigma^2 \mathbf{I}\right) \)
\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\epsilon}}\right) = \sigma^2\left(\mathbf{I} - \mathbf{H}\right) \end{align} \]
Denote the \( i \)-th self-sensitivty or leverage,
\[ \begin{align} h_i \triangleq \mathbf{H}_{ii} \end{align} \]
From above, the variance of the \( i \)-th residual is given,
\[ \begin{align} \sigma^2 \left(1 - h_i\right) \end{align} \]
This implies that the values of the leverages will have a strong impact on the variance of the residuals.
Specifically, if
To understand large leverages, we can consider how this relates to a weighted distance function.
Suppose that \( \mathbf{x}\in \mathbb{R}^{p-1} \) is a vector of predictor values and \( \mathrm{cov}\left(\mathbf{X}\right) \) will represent the covariance of the observed predictors, without including the intercept.
If \( \mathbf{a} = \mathbf{x} - \overline{\mathbf{x}} \) is the vector of anomalies from the mean of the observed predictor values, \[ \begin{align} \sqrt{\left( \mathbf{a}\right)^\mathrm{T} \mathrm{cov}\left(\mathbf{X}\right)^{-1} \left(\mathbf{a}\right)} \end{align} \]
is defined as the Mahalanobis distance function.
The above distance is almost identical to the form of the extrapolation error variance, but where we utilized the normalized anomalies in that case.
Again, if the data was mean zero, completely uncorrelated, uniform and unit variation in every direction, this would become the standard Euclidean distance,
\[ \begin{align} \sqrt{\left(\mathbf{x} - \mathbf{0} \right)^\mathrm{T} \mathbf{I} \left(\mathbf{x} - \mathbf{0} \right)} = \sqrt{\mathbf{x}^\mathrm{T}\mathbf{x}} \end{align} \]
The connection between leverage and the Mahalanobis distance is that for the \( i \)-th case, if \( \mathbf{A}^{(i)} \) is the vector of the anomalies giving its deviation from the variable means,
\[ \begin{align} h_i &\equiv \frac{\left[\left(\mathbf{A}^{(i)}\right)^\mathrm{T} \mathrm{cov}(\mathbf{X})^{-1} \left(\mathbf{A}^{(i)}\right)\right]}{ \left(n-1\right)} + \frac{1}{n}\\ &= \left(\tilde{\mathbf{X}}^{(i)}\right)^\mathrm{T}\mathrm{cov}(\mathbf{X})^{-1} \tilde{\mathbf{X}}^{(i)} + \frac{1}{n} \end{align} \]
Correspondingly, if an observed case is far from the center of the data according to the weighted distance, the fitted value is highly sensitive to this observed value.
Leverage can be understood “mechanically” where applying force to a lever at a point far from the fulcrum exerts a large force.
Note: leverages only contain information from the explanatory variables \( \mathbf{X} \) and not the observations of the response \( \mathbf{Y} \), and therefore only contain partial information, in some sense…
Courtesy of Weiner et al. Handbook of Psychology, Research Methods in Psychology. John Wiley & Sons, 2012
We recall some properties about projection operators.
Suppose \( \mathbf{H} \) is an orthogonal projection operator into the space spanned by the columns of the matrix \( \mathbf{X} \).
Let the vectors \( \left\{\mathbf{x}_i\right\}_{i=1}^p \) form a basis for the column span of \( \mathbf{X} \).
Let the vectors \( \left\{\mathbf{v}_i\right\}_{i=p+1}^{n} \) form a basis for the orthogonal complement to the column span of \( \mathbf{X} \).
Then:
Let \( \mathbf{A} \) be any matrix. Recall that the trace of the matrix \( \mathbf{A} \) satisfies the following relationships:
\[ \begin{align} \mathrm{tr}\left(\mathbf{A}\right) &= \sum_{i=1}^n \mathbf{A}_{ii} \\ &= \sum_{i=1}^n \lambda_i \end{align} \]
where the \( \mathbf{A}_{ii} \) are the diagonal entries of the matrix \( \mathbf{A} \) and the \( \lambda_i \) are the eigenvalues of the matrix \( \mathbf{A} \).
Using the properties from the previous slide, we can conclude immediately that the sum of the leverages,
\[ \begin{align} \sum_{i=1}^n h_i = p \end{align} \]
An average value for one leverage on the other hand is \( \frac{p}{n} \).
To say that a leverage is equal to one means that this leverage uses an entire degree of freedom when fitting the data.
As a rough diagnostic, we can say large leverages of size greater than or equal to \( \frac{2p}{n} \) should be evaluated carefully.
We will again consider the savings data and look for points of high leverage.
In the below, notice we include four predictors and thus \( p=5 \) with intercept:
require(faraway)
lmod <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
hatv <- hatvalues(lmod)
sum(hatv)
[1] 5
length(lmod$fitted.values)
[1] 50
hatv[hatv > (2 * sum(hatv) / length(lmod$fitted.values))]
Ireland Japan United States Libya
0.2122363 0.2233099 0.3336880 0.5314568
Here we find four leverages worth investigating the behavior of, based on the above criterion.
We can likewise make a visual inspection for outliers in the leverages with a Q-Q plot versus a half normal.
Here we plot the ordered values of the leverages (y-axis) versus the quantiles of the Half-normal distribution.
In particular, we plot the most extreme leverages with their name as corresponds to the previous analysis.
The criterion value for the large leverages \( y= \frac{2p}{n} \) is plotted additionally.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
countries <- row.names(savings)
halfnorm(hatv,labs=countries,ylab="Leverages", cex=3, cex.lab=3, cex.axis=1.5)
abline(h = (2 * sum(hatv) / length(lmod$fitted.values)))
It is often the case, as in the difference between covariance and correlation, that we wish to standardize the variances to a unit quantity for comparison.
This is likewise the case with the residuals of our model, and this can be performed precisely with respect to the leverages.
Particularly, with
\[ \begin{align} \mathrm{var}\left(\hat{\boldsymbol{\epsilon}}_i\right) = \sigma^2 \left(1 - h_i \right), \end{align} \]
this suggests the standardization of the residuals as,
\[ \begin{align} r_i \triangleq \frac{\hat{\boldsymbol{\epsilon}}_i}{\hat{\sigma} \sqrt{1 - h_i}} \end{align} \]
When all of our assumptions hold true, the relationship between the standardized residuals and the residuals are the analogue of correlation as the normalized covariances.
The residuals \( \hat{\boldsymbol{\epsilon}} \) have some natural variation leading to non-constant variance so that the standardized residuals can perform as better diagnostic plots.
However, if there is non-constant variance of \( \boldsymbol{\epsilon} \), standardization fails to correct for this.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(rstandard(lmod), main = "", cex=3, cex.lab=3, cex.axis=1.5)
abline(0,1)
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
As illustrated, identifying problematic outliers is important in determining the fit and the explanatory power of a linear model.
However, if the point is already included in the model, evaluating the residuals of the fit won't necessarily reveal outliers as all other observations' residuals are likewise affected by the offending observation.
To detect such points, we must go through a process of excluding observations. Define \( \hat{\boldsymbol{\beta}}_{(i)} \) and \( \hat{\sigma}_{(i)}^2 \) to be the least squares fit for \( \hat{\boldsymbol{\beta}} \) and the residual standard error \( \hat{\sigma}^2 \) for the model that excludes the \( i \)-th observation.
Consider that for the \( i \)-th observation, we can make a prediction for this observation with the model that excludes it in the fitting, via,
\[ \begin{align} \hat{Y}_{(i)} \triangleq \mathbf{x}_i^\mathrm{T} \hat{\boldsymbol{\beta}}_{(i)} \end{align} \]
If the predicted value from the model that excluded the \( i \)-th observation in the fitting, \( \hat{Y}_{(i)} \) is significantly different than the observed value, \( Y_i \), we can then consider \( Y_i \) to be an outlier.
In order to judge the size of such an outlier, we need to view it in the appropriate scale with respect to the rest of the data.
We can derive the relationship,
\[ \begin{align} \mathrm{var}\left(Y_i - \hat{Y}_{(i)}\right) = \hat{\sigma}_{(i)}^2 \left( 1 + X_i^\mathrm{T}\left(\mathbf{X}_{(i)}^\mathrm{T} \mathbf{X}_{(i)} \right)^{-1} X_i\right) \end{align} \]
Such that the studentized residual is defined,
\[ \begin{align} t_i \triangleq \frac{Y_i - \hat{Y}_{(i)}}{\hat{\sigma}_{(i)} \left( 1 + X_i^\mathrm{T}\left(\mathbf{X}_{(i)}^\mathrm{T} \mathbf{X}_{(i)} \right)^{-1} X_i\right)^\frac{1}{2}} \end{align} \]
Provided the usual assumptions hold, and that the observation is not to be considered an outlier we find that \( t_i \sim t_{n-p-1} \).
To avoid fitting \( n \) separate regressions, we appeal to an equivalent form of the above test satistic,
\[ \begin{align} t_i &\equiv \frac{\hat{\boldsymbol{\epsilon}}_i}{\hat{\sigma}_{(i)} \sqrt{1- h_i}} \\ & =r_i \left(\frac{n-p-1}{n-p-r^2_i}\right)^{1/2} \end{align} \] where \( r_i \) is the standardized \( i \)-th residual.
The previous discussion motivates the following hypothesis test,
\[ \begin{align} H_0: &Y_i \text{ }\textit{is not}\text{ an outlier for the full model}\\ H_1: &Y_i \text{ }\textit{is}\text{ an outlier for the full model} \end{align} \]
To examine a single pre-selected observation, this will boil down to the usual t-test where we examine the p-value of
\[ \begin{align} t_i = r_i \left(\frac{n-p-1}{n-p-r^2_i}\right)^{1/2} \end{align} \]
with respect to \( t_{n-p-1} \) to determine significance.
However, suppose we have \( n=100 \) observations that we wish to study.
Suppose we want to test for an outlier at \( \alpha \) significance.
The probability of accepting the null hypothesis in all cases is given by the complimentary probabililty,
\[ \begin{align} P\left(\text{all tests accept}\right) &= 1 - P\left(\text{at least one test rejects}\right) \\ & \geq 1 - \sum_{i=1}^n P\left(\text{test $i$ rejects}\right)\\ &=1 - n\alpha \end{align} \]
To compensate the test for an overall \( \alpha \) test for all of the observations, we can select a corresponding individual significance level of \( \alpha / n \).
This adjustment is known as the Bonferroni correction (and is also used in other contexts).
The drawback of this adjusted significance level is that for large \( n \), this can become an extremely conservative measurement of possible outliers.
lmod <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
stud <- rstudent(lmod)
stud[which.max(abs(stud))]
Zambia
2.853558
To evaluate the p-value and signficance of the largest studentized residual, we compute the Bonferroni corrected critical value.
qt(.05/(50*2),44)
[1] -3.525801
In particular, this shows that the range of values with Bonferroni corrected significance greater than \( \alpha=5\% \) is the range \( (-3.525801,3.525801) \).
The studentized residual for Zambia lies within this range, and we therefore fail to reject the null.
When searching for outliers:
When correcting for outliers:
Broadly speaking, influential observations are those that would dramatically change the fit of our model if it is included or excluded.
Influential observations may or may not be outliers or have large leverage, but typically one of the two applies.
Rather than look at \( n \) separate models, we will compress this investigation of the impact of leaving out an observation into a simpler diagonostic, Cook's distance.
We define \( \hat{\boldsymbol{\beta}}_{(i)} \) and \( \hat{\mathbf{Y}}_{(i)} \) to be the least-squares estimated values for the parameters \( \boldsymbol{\beta} \) and the vector of fitted values for the observations when the \( i \)-th observation is exclued.
Then, with the above definition, we define Cook's distance for observation \( i \) as,
\[ \begin{align} D_i &\triangleq \frac{\left(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)} \right)^\mathrm{T}\left(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)} \right)}{p \hat{\sigma}^2} \\ &= \frac{1}{p}r^2_i \frac{h_i}{1 -h_i} \end{align} \]
lmod <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
cook <- cooks.distance(lmod)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
halfnorm(cook,3,labs=row.names(savings),ylab="Cook’s distances", cex=3, cex.lab=3, cex.axis=1.5)
lmodi <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(cook < max(cook)))
sumary(lmodi)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.52404598 8.22402631 2.9820 0.004655
pop15 -0.39144013 0.15790949 -2.4789 0.017084
pop75 -1.28086692 1.14518206 -1.1185 0.269430
dpi -0.00031890 0.00092933 -0.3432 0.733119
ddpi 0.61027903 0.26877842 2.2706 0.028122
n = 49, p = 5, Residual SE = 3.79481, R-Squared = 0.36
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.56608654 7.35451611 3.8842 0.0003338
pop15 -0.46119315 0.14464222 -3.1885 0.0026030
pop75 -1.69149768 1.08359893 -1.5610 0.1255298
dpi -0.00033690 0.00093111 -0.3618 0.7191732
ddpi 0.40969493 0.19619713 2.0882 0.0424711
n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
Perfoming this analysis one at a time (or strictly numerically) would be difficult.
We can do this over all observations with a “leave-one-out”, observation deletion diagnositc function “dfbeta”:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in pop15 coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
Here, one particular observation leads to an extreme change in the value for the parameter for pop15 (Japan).
Note: this type of plot should be evaluated for each of the parameters (when possible) to search for influential observations.
lmodj <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,subset=(row.names(savings) != "Japan"))
sumary(lmodj)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.94017135 7.78399682 3.0756 0.003605
pop15 -0.36790149 0.15362956 -2.3947 0.020957
pop75 -0.97367427 1.15545021 -0.8427 0.403965
dpi -0.00047063 0.00091907 -0.5121 0.611162
ddpi 0.33474859 0.19844568 1.6869 0.098710
n = 49, p = 5, Residual SE = 3.73801, R-Squared = 0.28
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
In addition to the above model plot, the parameter sensitivity plot dfbeta
is extremely useful to diagnose the explanatory power of a model.
This gives a strong indication if the conclusions about the role of certain effects depend strongly on the inclusion or exclusion of a single observation.
If our conclusions about an effect depended on a single case, we should be very skeptical about the interpretation of the effect.
The role again of identifying unusual observations is not to automatically exclude them, but rather to see how they affect our inferences and conclusions.