10/05/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
A quick refresher on the idea of uncertainty by example:
\[ \begin{align} P\left(\vert T - \theta \vert \geq k \sigma_T\right) \leq \frac{1}{k^2} \end{align} \]
\[ \begin{align} P\left(\vert T - \theta \vert < 2 \sigma_T\right) > \frac{3}{4} \end{align} \]
Suppose the estimate \( T \) gives us based on some data is \( t=299852.4 \)
We can say that \( \theta \in (299652.4, 300 052.4) \) with confidence 75%.
Similarly, we will want to fit confidence intervals, and moreover to use hypothesis testing to determine the significance of our model parameters versus the null hypothesis of random variation and no systematic structure.
Particularly, we will be concerned with dual questions:
The process of hypothesis testing is always defined in terms of a null and an alternative hypothesis.
In regression, we will denote the hypothesis that there is a systematic, statistical relationship the alternative hypothesis \( H_1 \).
The hypothesis that the observed structure can be explained by random variation will be denoted the null hypothesis \( H_0 \).
Hypothesis testing thus assumes that the null holds, and finds how surprising it would be to see the observed structure in the case it was just due to random variation.
We always choose a pre-set value of significance \( \alpha \) (typically \( 5\% \)) and determine if the probability of finding a value at least as extreme as our observed results is less than \( \alpha \).
If the p-value falls below \( \alpha \) we reject the null hypothesis in favor of the alternative.
Neither indicates causality or the lack thereof, but are good indications of points for further investigation.
Suppose we have a set of response variables and explanatory variables.
\[ \begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align} \]
In addition to the null model of random variation, we will be concerned with whether a particular variable, or combination of variables, is significant in the presence of other variables considered.
The significance level we assign to the test is the measure of “how suprising” it would be to find out there was no structure.
We should be careful about how much we can read into the meaning of this due to the saturation of p-values.
To use standard methods for hypothesis testing and confidence intervals, we will now assume additionally Gaussianity, \( \boldsymbol{\epsilon} \sim N(0, \sigma^2 \mathbf{I}) \).
In principle, we favor solutions to problems that are as simple as possible.
This makes the problem easier to interpret, and our models transparent in their predictions.
Suppose we have a large model \( \boldsymbol{\Omega} \), which abstractly refers to the set of all linear models possible by choices of \( \beta_i \), and their respective uncertainties, over certain variables \( X_1, X_2, \cdots, X_{p-1} \).
Let \( q < p-1 \), and suppose that abstractly \( \boldsymbol{\omega} \) represents a “smaller model”, as found by a strictly smaller set of explanatory variables, \( X_1, X_2, \cdots, X_q \).
We will say that we favor the model \( \boldsymbol{\omega} \) unless \( \boldsymbol{\Omega} \) provides appreciably better results.
\[ \begin{align} \frac{RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}}{RSS_\boldsymbol{\Omega}} \end{align} \]
The quantity defined,
\[ \begin{align} \frac{RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}}{RSS_\boldsymbol{\Omega}} \end{align} \]
is known as a test-statistic, which is actually defined in terms of the ratio of likelihood functions.
Let's recall our Gaussian likelihood function
\[ \begin{align} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert \mathbf{Y} =Y_{1,\cdots,n} ) \end{align} \] representing the likelihood of the parameter vector \( \beta \) and the associated uncertainties with respect to the observed outcomes of the response variable.
The following,
\[ \begin{align} max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\Omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert\mathbf{Y} =Y_{1,\cdots,n} ) \end{align} \]
will represent the maximum likelihood attainable over all choices of \( \boldsymbol{\beta} \) and choices of \( \sigma \) in the large model space \( \Omega \).
We can imagine intuitively that,
\[ \begin{align} \frac{ max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\Omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert \mathbf{Y} =Y_{1,\cdots,n} )}{max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert\mathbf{Y} =Y_{1,\cdots,n} )} \end{align} \] is a reasonable measure of whether the model over the large model space (including more parameters) is more likely than the model over the smaller model space (with fewer parameters).
If the likelihood ratio statistic is sufficiently large, we can say that:
In the above situation, we reject the null hypothesis, i.e., we reject the small model \( \boldsymbol{\omega} \).
Let us recall, the above intuition was formalized somewhat in our discussion of ANOVA.
The sample variances of standard normal distributions are distributed according to a \( \chi^2 \), this suggests that it can be shown that the mean square above has a \( \chi^2_{n-p} \) distribution in \( n-p \) degrees of freedom.
The the ratio of two \( \chi^2 \) variables is also a commonly used construct in statistics to compare the sample variances of different samples – this follows a well-known distribution called the F-distribution.
The random variable \( Z \) has the Fisher–Snedecor (F-distribution) distribution with \( n \) and \( m \) degrees of freedom if
\[ \begin{align} Z = \frac{\chi^2(n)/ n}{\chi^2(m)/m} \end{align} \]
where \( \chi^2(n) ∼ \chi^2_n \) and \( χ^2(m) ∼ \chi^2_m \) are independent rvs.
The pdf and the cdf of the F-distribtution become especially complicated and we will suppress a direct statement of these.
However, when we explicitly utilize the degrees of freedom for each regression model \( \omega \) and \( \Omega \), we get a statement we can evaluate based on the theoretical expected values.
Recall, \( \boldsymbol{\omega} \) uses \( q < p \) parameters, while \( \boldsymbol{\Omega} \) uses \( p \).
We find
\[ \begin{align} F &\triangleq \frac{ \left( RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}\right)/ (p-q)}{RSS_\boldsymbol{\Omega}/(n-p)} \end{align} \]
is an \( F \) statistic, with \( F \) distribution under the null hypothesis.
This is to say, “if the null hypothesis holds (such that the smaller model is favorable), then \( F \sim F_{(p-q),(n-p)} \).”
We will thus study how surprising this value is or not, based on the assumption that \( F \) is drawn from the \( F_{(p-q),(n-p)} \).
Courtesy of IkamusumeFan CC BY-SA 4.0
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Let's consider the null hypothesis that there is no structure whatsover between the response variables and the explanatory variables.
That is, we suppose the relationship looks like \[ \begin{align} \mathbf{Y} = \mu_Y \mathbf{1} + \boldsymbol{\epsilon} \end{align} \]
The null hypothesis is thus, \( H_0 : \beta_i = 0 \) for \( i=1,\cdots,p-1 \).
Q: what is \( RSS_\boldsymbol{\omega} \) in this case?
A: this is the sum of square differences of the predicted value (always the empirical sample-based mean) versus the observed values, i.e.,
\[ \begin{align} \boldsymbol{\epsilon}_\boldsymbol{\omega}^\mathrm{T} \boldsymbol{\epsilon}_\boldsymbol{\omega} &= \left(\mathbf{Y} - \overline{\mathbf{Y}} \right)^\mathrm{T}\left(\mathbf{Y} - \overline{\mathbf{Y}}\right) \\ &=TSS \end{align} \]
gala
data…library('faraway')
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
gala)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.3690 0.7153508
Area -0.023938 0.022422 -1.0676 0.2963180
Elevation 0.319465 0.053663 5.9532 3.823e-06
Nearest 0.009144 1.054136 0.0087 0.9931506
Scruz -0.240524 0.215402 -1.1166 0.2752082
Adjacent -0.074805 0.017700 -4.2262 0.0002971
n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
nullmod <- lm(Species ~ 1, gala)
sumary(nullmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.233 20.929 4.0725 0.0003285
n = 30, p = 1, Residual SE = 114.63305, R-Squared = 0
mean(gala$Species)
[1] 85.23333
Explicitly, the F-test can be computed as follows
(rss0 <- deviance(nullmod))
[1] 381081.4
(rss <- deviance(lmod))
[1] 89231.37
(df0 <- df.residual(nullmod))
[1] 29
(df <- df.residual(lmod))
[1] 24
(fstat <- ((rss0-rss)/(df0-df))/(rss/df))
[1] 15.69941
df0 - df
and df
.1-pf(fstat, df0-df, df)
[1] 6.837893e-07
The function pf
above evaluates the F distribution's CDF at the value fstat
with respect to the degrees of freedom parameters.
\[ \begin{align} 1 - P(Z \leq \text{fstat}) = P(Z > \text{fstat}) \end{align} \]
The probability of this value is approximately zero, on the order of \( 10^{-7} \).
anova(nullmod, lmod)
Analysis of Variance Table
Model 1: Species ~ 1
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 381081
2 24 89231 5 291850 15.699 6.838e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We tentatively reject the null hypothesis and conclude that at \( 5\% \) significance, at least one subspace of the explanatory variables has predictive power.
This does not say which one, or if a combination of the explanatory variables gives the predictive power.
summary(lmod)
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07