10/05/2020

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- A review of hypothesis testing
- The F-test for model selection versus the null
- Evaluating the F-test in R

A quick refresher on the idea of uncertainty by example:

- Suppose \( T \) is a linear unbiased estimator for the speed of light \( \theta \).
- Like our parameters \( \beta \), we will assume that \( \theta \) is a deterministic but unkown constant.
- For sake of example, also suppose that \( T \) has standard deviation \( \sigma_T \) = 100 km/sec.
- Recall Chebyshev’s inequality,

\[ \begin{align} P\left(\vert T - \theta \vert \geq k \sigma_T\right) \leq \frac{1}{k^2} \end{align} \]

- We find

\[ \begin{align} P\left(\vert T - \theta \vert < 2 \sigma_T\right) > \frac{3}{4} \end{align} \]

- This tells us that there is a probability of at least 75% that \( T \) is within 200 km/sec of the speed of light \( \theta \).
- Equivalently, \( \theta \in (T-200, T+200) \) with probability 75%.

Suppose the estimate \( T \) gives us based on some data is \( t=299852.4 \)

We can say that \( \theta \in (299652.4, 300 052.4) \) with

**confidence**75%.- Note that \( \theta \) is an an unkown constant – it is either in the interval or not and there is nothing random about the above statement.
- Therefore, we can't say that the probability of \( \theta \in (299652.4, 300 052.4) \) is 75%, but we used information to guarantee that our proceedure for estimation will work 75% of the time.

Similarly, we will want to fit confidence intervals, and moreover to use hypothesis testing to determine the significance of our model parameters versus the null hypothesis of random variation and no systematic structure.

Particularly, we will be concerned with dual questions:

- Does our confidence interval for a parameter \( \beta_i \) contain the value \( 0 \)? And
- how unlikely would it be for \( \beta_i \) to equal zero based on our observations?

The process of hypothesis testing is always defined in terms of a null and an alternative hypothesis.

In regression, we will denote the hypothesis that there is a systematic, statistical relationship the alternative hypothesis \( H_1 \).

The hypothesis that the observed structure can be explained by random variation will be denoted the null hypothesis \( H_0 \).

Hypothesis testing thus assumes that the null holds, and finds how surprising it would be to see the observed structure in the case it was just due to random variation.

We always choose a pre-set value of significance \( \alpha \) (typically \( 5\% \)) and determine if the probability of finding a value at least as extreme as our observed results is less than \( \alpha \).

- This probability is known as the p-value.

If the p-value falls below \( \alpha \) we reject the null hypothesis in favor of the alternative.

- On the other hand, if the p-value is greater, then we fail to reject the null model.

Neither indicates causality or the lack thereof, but are good indications of points for further investigation.

- Examining p-values can also provide corroborating evidence of the explanatory or predictive power of a model when used in conjunction with more robust forms of model selection.

Suppose we have a set of response variables and explanatory variables.

- We want to see if the response variables have a systematic relationship with the explantory variables, or if this can be reduced to random variation.
- Our null hypothesis \( H_0 \) is that there is no systematic structure, i.e., \[ \begin{align} \mathbf{Y} = \mu_Y \mathbf{1} + \boldsymbol{\epsilon} \end{align} \]
- Our alternative hypothesis is that there does exist some (unknown) set of values \( \boldsymbol{\beta} \) for which

\[ \begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align} \]

- Hypothesis testing is a systematic methodology to determine
**significance**of events. **Note:**this does not imply causation – rather the significance can be thought of as

**“how suprised would we be if this was just random variation?”**

In addition to the null model of random variation, we will be concerned with whether a particular variable, or combination of variables, is significant in the presence of other variables considered.

The significance level we assign to the test is the measure of “how suprising” it would be to find out there was no structure.

- For example, a significance value of \( 5\% \) indicates that the observed outcomes would only be attributable to random variation in about \( \frac{1}{20} \) trials.

We should be careful about how much we can read into the meaning of this due to the saturation of p-values.

- We will see more robust metrics for model selection later in the course, but because of how widely used p-values are in the literature, we need to understand them.

To use standard methods for hypothesis testing and confidence intervals,

**we will now assume additionally Gaussianity**, \( \boldsymbol{\epsilon} \sim N(0, \sigma^2 \mathbf{I}) \).

In principle, we favor solutions to problems that are as simple as possible.

- Occam's Razor is the philisophical principle that,

**“When presented with competing hypotheses to solve a problem, one should select the solution with the fewest assumptions.”**

- Occam's Razor is the philisophical principle that,
This makes the problem easier to interpret, and our models transparent in their predictions.

Suppose we have a large model \( \boldsymbol{\Omega} \), which abstractly refers to the set of all linear models possible by choices of \( \beta_i \), and their respective uncertainties, over certain variables \( X_1, X_2, \cdots, X_{p-1} \).

Let \( q < p-1 \), and suppose that abstractly \( \boldsymbol{\omega} \) represents a “smaller model”, as found by a strictly smaller set of explanatory variables, \( X_1, X_2, \cdots, X_q \).

We will say that we favor the model \( \boldsymbol{\omega} \) unless \( \boldsymbol{\Omega} \) provides appreciably better results.

- For example, we may consdider, if \( RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega} \) is small, then we favor the former, small model.
- With an additional scaling factor, we can use this principle directly as a test statistic for the null hypothesis, i.e.,

\[ \begin{align} \frac{RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}}{RSS_\boldsymbol{\Omega}} \end{align} \]

The quantity defined,

\[ \begin{align} \frac{RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}}{RSS_\boldsymbol{\Omega}} \end{align} \]

is known as a test-statistic, which is actually defined in terms of the ratio of likelihood functions.

Let's recall our Gaussian likelihood function

\[ \begin{align} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert \mathbf{Y} =Y_{1,\cdots,n} ) \end{align} \] representing the likelihood of the parameter vector \( \beta \) and the associated uncertainties with respect to the observed outcomes of the response variable.

The following,

\[ \begin{align} max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\Omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert\mathbf{Y} =Y_{1,\cdots,n} ) \end{align} \]

will represent the

**maximum likelihood attainable**over all choices of \( \boldsymbol{\beta} \) and choices of \( \sigma \) in the large model space \( \Omega \).

We can imagine intuitively that,

\[ \begin{align} \frac{ max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\Omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert \mathbf{Y} =Y_{1,\cdots,n} )}{max_{\boldsymbol{\beta}, \sigma \in \boldsymbol{\omega}} \mathcal{L} (\boldsymbol{\beta}, \sigma \vert\mathbf{Y} =Y_{1,\cdots,n} )} \end{align} \] is a reasonable measure of whether the model over the large model space (including more parameters) is more likely than the model over the smaller model space (with fewer parameters).

If the likelihood ratio statistic is sufficiently large, we can say that:

- “it would be very surprising that the high likelihood of the larger model versus the low likelihood of the small model is just due to random variation.”

- “it would be very surprising that the high likelihood of the larger model versus the low likelihood of the small model is just due to random variation.”
In the above situation, we reject the null hypothesis, i.e., we reject the small model \( \boldsymbol{\omega} \).

Let us recall, the above intuition was formalized somewhat in our discussion of ANOVA.

- Particularly, we saw the \( RSS \) divided by the number of degrees of freedom as a “mean-square”, which has a known expected value as \( \sigma^2 \).

The sample variances of standard normal distributions are distributed according to a \( \chi^2 \), this suggests that it can be shown that the mean square above has a \( \chi^2_{n-p} \) distribution in \( n-p \) degrees of freedom.

The the ratio of two \( \chi^2 \) variables is also a commonly used construct in statistics to compare the sample variances of different samples – this follows a well-known distribution called the F-distribution.

The random variable

**\( Z \) has the Fisher–Snedecor (F-distribution) distribution with \( n \) and \( m \) degrees of freedom**if\[ \begin{align} Z = \frac{\chi^2(n)/ n}{\chi^2(m)/m} \end{align} \]

where \( \chi^2(n) ∼ \chi^2_n \) and \( χ^2(m) ∼ \chi^2_m \) are independent rvs.

The pdf and the cdf of the F-distribtution become especially complicated and we will suppress a direct statement of these.

However, when we explicitly utilize the degrees of freedom for each regression model \( \omega \) and \( \Omega \), we get a statement we can evaluate based on the theoretical expected values.

Recall, \( \boldsymbol{\omega} \) uses \( q < p \) parameters, while \( \boldsymbol{\Omega} \) uses \( p \).

We find

\[ \begin{align} F &\triangleq \frac{ \left( RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}\right)/ (p-q)}{RSS_\boldsymbol{\Omega}/(n-p)} \end{align} \]

is an \( F \) statistic, with \( F \) distribution under the null hypothesis.

This is to say, “if the null hypothesis holds (such that the smaller model is favorable), then \( F \sim F_{(p-q),(n-p)} \).”

We will thus study how surprising this value is or not, based on the assumption that \( F \) is drawn from the \( F_{(p-q),(n-p)} \).

Courtesy of IkamusumeFan CC BY-SA 4.0

- The hypothesis testing procedure thus follows the following idea:
- Let us say (for sake of example) we want to choose a model with \( \alpha =5\% \) significance.
- We will look at the appropriate \( F \) distribution and find the value of \[ F^\alpha_{ (p-q), (n-p) } \]
- such that the probability of \[ F \geq F^\alpha_{(p-q),(n-p)} \]
**given**\[ F \sim F_{(p-q),(n-p)} \]- is equal to \[ \alpha = 5 \%. \]

- If our observed \[ \begin{align} F &\triangleq \frac{ \left( RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega}\right)/ (p-q)}{RSS_\boldsymbol{\Omega}/(n-p)} \geq F_{(p-q),(n-p)}^{\alpha}, \end{align} \] we find this to be an unlikely outcome under the null hypothes (due to random variation).

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

- When we have
**Gaussian errors, \( \boldsymbol{\epsilon}\sim N(0, \sigma^2 \mathbf{I}) \)**, - if we have a “large model” and a smaller, simpler version of the model,
- where the smaller version of the model is defined as a sub-combination (subspace) of the larger model,
- we can express the difference in the models as \( RSS_\boldsymbol{\omega} - RSS_\boldsymbol{\Omega} \).
- When this difference is “small” relative to various considerations, we should reject the large model for simplicity.

- However, when there is an appreciable difference in the results, i.e.,
- with significance in the F-test (we would be very surprised that this result was due to random variation)
- we reject the smaller model.

Let's consider the null hypothesis that there is no structure whatsover between the response variables and the explanatory variables.

That is, we suppose the relationship looks like \[ \begin{align} \mathbf{Y} = \mu_Y \mathbf{1} + \boldsymbol{\epsilon} \end{align} \]

The null hypothesis is thus, \( H_0 : \beta_i = 0 \) for \( i=1,\cdots,p-1 \).

**Q:**what is \( RSS_\boldsymbol{\omega} \) in this case?**A:**this is the sum of square differences of the predicted value (always the empirical sample-based mean) versus the observed values, i.e.,\[ \begin{align} \boldsymbol{\epsilon}_\boldsymbol{\omega}^\mathrm{T} \boldsymbol{\epsilon}_\boldsymbol{\omega} &= \left(\mathbf{Y} - \overline{\mathbf{Y}} \right)^\mathrm{T}\left(\mathbf{Y} - \overline{\mathbf{Y}}\right) \\ &=TSS \end{align} \]

- Let us consider again the
`gala`

data…

```
library('faraway')
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
gala)
sumary(lmod)
```

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.3690 0.7153508
Area -0.023938 0.022422 -1.0676 0.2963180
Elevation 0.319465 0.053663 5.9532 3.823e-06
Nearest 0.009144 1.054136 0.0087 0.9931506
Scruz -0.240524 0.215402 -1.1166 0.2752082
Adjacent -0.074805 0.017700 -4.2262 0.0002971
n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
```

- Secondly, we will fit the null model as variation around a constant value, i.e.,

```
nullmod <- lm(Species ~ 1, gala)
sumary(nullmod)
```

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.233 20.929 4.0725 0.0003285
n = 30, p = 1, Residual SE = 114.63305, R-Squared = 0
```

```
mean(gala$Species)
```

```
[1] 85.23333
```

- where the fixed constant is indeed the empirical mean value of the response.

Explicitly, the F-test can be computed as follows

- The RSS of each of the models

`(rss0 <- deviance(nullmod))`

`[1] 381081.4`

`(rss <- deviance(lmod))`

`[1] 89231.37`

- The degrees of freedom of each model

`(df0 <- df.residual(nullmod))`

`[1] 29`

`(df <- df.residual(lmod))`

`[1] 24`

- Then, we compute the fstatistic with the ratio of likelihoods

```
(fstat <- ((rss0-rss)/(df0-df))/(rss/df))
```

```
[1] 15.69941
```

- Finally, we determine the probability of this value being drawn from the F distribution in the two parameters
`df0 - df`

and`df`

.

```
1-pf(fstat, df0-df, df)
```

```
[1] 6.837893e-07
```

The function

`pf`

above evaluates the F distribution's CDF at the value`fstat`

with respect to the degrees of freedom parameters.- This gives precisely the calculation of

\[ \begin{align} 1 - P(Z \leq \text{fstat}) = P(Z > \text{fstat}) \end{align} \]

The probability of this value is approximately zero, on the order of \( 10^{-7} \).

- More compactly, this is computed in an analysis of variance (ANOVA) table

```
anova(nullmod, lmod)
```

```
Analysis of Variance Table
Model 1: Species ~ 1
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 381081
2 24 89231 5 291850 15.699 6.838e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We tentatively reject the null hypothesis and conclude that at \( 5\% \) significance,

**at least one subspace of the explanatory variables**has predictive power.This does not say which one, or if a combination of the explanatory variables gives the predictive power.

- This only says, it is very unlikely that there is no relationship between the space of explanatory variables and the changes in the response.

- The same result can be observed in the F-statistic of the model summary,

```
summary(lmod)
```

```
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
```