10/19/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
all rely on several assumptions, e.g., the conditions for the Gauss Markov theorem and usually Gaussianity of the errors.
Methods for checking and validating these assumptions are known as diagnostics.
Typically, we will start with one model as a best first guess.
Performing diagnostics will reveal issues in the model, and suggest ways for improvement.
Building a model is thus usually an interactive, iterative process, where we will create and perform diagnostics over a succession of models.
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of Inductiveload via Wikimedia Commons
Given the two probability measures \( P_1,P_2 \), and their respective CDF's, we can define their theoretical Q-Q plot as the graph of the \( \mathbb{R}^2 \) valued function,
\[ \begin{align} G:&[0,1] &\mapsto &\mathbb{R}^2& \\ &p &\mapsto &(C_1^{-1}(p), C_2^{-1}(p)) & \end{align} \]
Q: in the above, what does the point \( (x_1,x_2) \) correspond to?
Q: What kind of shape do we expect for the plot when the two CDFs are equal?
Suppose that \( P_1,P_2 \) represent the same family of probability measures, such that their CDFs differ only by location and shape, i.e.,
Therefore, if two distributions differ only in location and scale,
the Q-Q plot is just a straight line with slope \( \sigma \) and intercept \( \mu \).
For this reason, when making a Q-Q plot, the CDFs (and/or the data) are typically standardized to be mean zero and variance one.
By doing so, when measuring two distributions in the same family (as above), the Q-Q plot will just be the central diagonal of the plane.
With sample size \( n \), we can make \( n \) plotting positions for the inverse of the empirical and theoretical CDF, i.e., we can plot the points,
\[ \begin{align} \left\{\left(C_1^{-1}\left(\frac{i}{n+1}\right), x_i\right) \in \mathbb{R}^2: i = 1,\cdots,n \right\} \end{align} \]
\[ \begin{align} H_0: C_1 = C_2 \\ H_1: C_1 \neq C_2 \end{align} \]
using the empirical versus theoretical CDF in the Q-Q plot is very close to the Kolmogorov-Smirnov test.
The Kolmogorov-Smirnov test follows a similar principle and can be used generally to evaluate the divergence of the empirical CDF of a sample of observations from a hypothesized CDF.
library("faraway")
str(savings)
'data.frame': 50 obs. of 5 variables:
$ sr : num 11.43 12.07 13.17 5.75 12.88 ...
$ pop15: num 29.4 23.3 23.8 41.9 42.2 ...
$ pop75: num 2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
$ dpi : num 2330 1508 2108 189 728 ...
$ ddpi : num 2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...
sr - is the savings rate, calculated as personal savings divided by disposable income;
pop15 - is the percent of the countries' populations under age 15;
pop75 - is the percent of the countries' populations over age 75;
dip - is the per-capita disposable income in dollars;
ddpi - is the percent growth rate of dpi.
Recall, for the response variable we have 50 observations taken of sample averages of the savings rate for various countries taken over the 1960's.
Q: regarding a key theorem, what does this suggest about the distribution of the data response variable?
A: sample averages tend to be normal by the central limit theorem, so this supports the idea Gaussian error distributions.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(savings$sr),ylab="Savings rate", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(savings$sr))
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(savings$pop15),ylab="Percent population under 15", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(savings$pop15))
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(savings$pop75),ylab="Percent population over 75", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(savings$pop75))
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(savings$dpi),ylab="Per capital disposable income", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(savings$dpi))
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(savings$ddpi),ylab="Percent growth rate of dpi", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(savings$ddpi))
lmod <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
summary(lmod)
Call:
lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
Residuals:
Min 1Q Median 3Q Max
-8.2422 -2.6857 -0.2488 2.4280 9.7509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
pop15 -0.4611931 0.1446422 -3.189 0.002603 **
pop75 -1.6914977 1.0835989 -1.561 0.125530
dpi -0.0003369 0.0009311 -0.362 0.719173
ddpi 0.4096949 0.1961971 2.088 0.042471 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
qqnorm(scale(residuals(lmod)),ylab="Residuals", main="", cex=3, cex.lab=3, cex.axis=1.5)
qqline(scale(residuals(lmod)))
Q: can you conjecture if we would accept the residuals as following a Gaussian distribution based on the above plot?
Here, the residuals appear to be sufficiently Gaussian.
We can numerically test the Gaussianity of the residuals with the Shapiro-Wilk test, validating our Q-Q plot.
Given a sample \( \left\{X_i\right\}_{i=1}^n \), the Shapiro-Wilk test computes how likely it would be to observe a test statistic for the hypothesis test:
\[ \begin{align} H_0: &X_i \sim N\left(\mu_X,\sigma^2_X\right)\\ H_1: &X_i \hspace{2mm}\text{not Gaussian distributed} \end{align} \]
The derivation goes beyond the scope of this discussion, and we compute this in R simply as:
shapiro.test(residuals(lmod))
Shapiro-Wilk normality test
data: residuals(lmod)
W = 0.98698, p-value = 0.8524
Q: does this support the notion that the residuals are Gaussian?
Note: without performing a Q-Q plot before hand, we can't diagnose what possible issues may exist by the p-value alone.
Generally, a Q-Q plot will give insight into the structure that might not otherwise be detected.
We note that Gaussianity of the error wasn't required by the Gauss-Markov theorem, rather this gauranteed that least-squares was the
Additionally, our confidence intervals and hypothesis tests utilized the Gaussian assumption on the error.
Without Gaussianity of the errors, least-squares is still the best linear unbiased estimator, but we may find that a linear model in itself is not appropriate or that a biased estimator may perform better.
However, we can sometimes make due “OK” with slightly innacurate uncertainty quantification, if the sample sizes are sufficiently large…
Particularly, the hypothesis testing and confidence intervals we have developed can be understood as good approximations due to the Central Limit Theorem.
Generally, when facing a non-Gaussian error distribution the solution will depend on the types of issues detected.
For a large number of observations, we can usually ignore issues of non-Gaussianity for uncertainty quantification due to the Central Limit Theorem.
This is also often the case for short-tailed distributions.
For skewed distributions, a nonlinear transformation of the response may alleviate the issue,
For long-tailed distributions, we may need to use robust regression, which we will not cover in this class but is discussed in the course reference books.