09/28/2020

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- Multiple regression in R
- Identifiability
- Linear dependence

We have now covered the basic theory that will structure the course.

We will now look at a concrete example of multiple regression from the Faraway package “gala” data set.

```
library(faraway)
str(gala)
```

```
'data.frame': 30 obs. of 7 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
```

**Q:**how many observations \( n \) do we have in this data set, and what is the largest \( p \) number of parameters can we put into a model?**A:**there are \( n=30 \) observations, and the largest number is \( p=6 +1 \) parameters (including an intercept and minus one for the response).

- In this case, there are 30 islands in the Galápagos with 7 variables – each observation corresponds to a particular island:

```
row.names(gala)
```

```
[1] "Baltra" "Bartolome" "Caldwell" "Champion" "Coamano"
[6] "Daphne.Major" "Daphne.Minor" "Darwin" "Eden" "Enderby"
[11] "Espanola" "Fernandina" "Gardner1" "Gardner2" "Genovesa"
[16] "Isabela" "Marchena" "Onslow" "Pinta" "Pinzon"
[21] "Las.Plazas" "Rabida" "SanCristobal" "SanSalvador" "SantaCruz"
[26] "SantaFe" "SantaMaria" "Seymour" "Tortuga" "Wolf"
```

- Recall, We can see some basic summary statistics with “summary”

```
summary(gala)
```

```
Species Endemics Area Elevation
Min. : 2.00 Min. : 0.00 Min. : 0.010 Min. : 25.00
1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.258 1st Qu.: 97.75
Median : 42.00 Median :18.00 Median : 2.590 Median : 192.00
Mean : 85.23 Mean :26.10 Mean : 261.709 Mean : 368.03
3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.237 3rd Qu.: 435.25
Max. :444.00 Max. :95.00 Max. :4669.320 Max. :1707.00
Nearest Scruz Adjacent
Min. : 0.20 Min. : 0.00 Min. : 0.03
1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52
Median : 3.05 Median : 46.65 Median : 2.59
Mean :10.06 Mean : 56.98 Mean : 261.10
3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24
Max. :47.40 Max. :290.20 Max. :4669.32
```

```
head(gala)
```

```
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82
Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
```

The meaning of different variables can be found in the help file for the “gala” data using the command

`?gala`

.Species – number of species found on the island

Endemics – number of endemic species

Area – the area of the island in km

^{2}Elevation – the highest elevation of the island

Nearest – the distance from the nearest island

Scruz — the distance from Santa Cruz island

Adjacent – the area of the adjacent island

- We will create a linear model now from scratch with the funciton
`lm()`

```
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
```

Notice the code syntax:

- The first argument is the “Species”, indicating that we have the number of species as our response variable \( \mathbf{Y} \)
- The ~ here separates the response variables from the explanatory variables.
- Finally, the keyword argument “data” indicates to the function to draw these variables all from the variable “gala”
- This is designed to resemble our mathematical form for the model, where we may write,

\[ \begin{align} \mathbf{Y}_{\mathrm{species}} = \beta_0 + \beta_{\mathrm{Area}}\mathbf{X}_{\mathrm{Area}} + \beta_{\mathrm{Elevation}}\mathbf{X}_{\mathrm{Elevation}} + \beta_{\mathrm{Nearest}}\mathbf{X}_{\mathrm{Nearest}} + \beta_{\mathrm{Scruz}}\mathbf{X}_{\mathrm{Scruz}} + \beta_{\mathrm{Adjacent}}\mathbf{X}_{\mathrm{Adjacent}} \end{align} \]

Like in the simple regression model, we can see exactly how (for all other variables held fixed) the increase in, e.g., \( \mathbf{X}_\mathrm{Area} \) by 1 \( \mathrm{km}^2 \) corresponds to an increase in the predicted mean function by \( \beta_{\mathrm{Area}} \);

- Thus the \( \beta \) values in multiple regression correspond again to the slope due to increase in a particular predictor.

- What does our linear model look like?

```
summary(lmod)
```

```
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
```

- The model summary contains a great deal of information, which we will become familiar with.

- For instance, we can obtain quantities of interest from the “lmod”

```
names(lmod)
```

```
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
```

```
lmod$coefficients
```

```
(Intercept) Area Elevation Nearest Scruz Adjacent
7.068220709 -0.023938338 0.319464761 0.009143961 -0.240524230 -0.074804832
```

```
lmod$residuals
```

```
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
-58.725946 38.273154 -26.330659 14.635734 38.383916 -25.087705
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
-9.919668 19.018992 -20.314202 -28.785943 49.343513 -3.989598
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
62.033276 -59.633796 40.497176 -39.403558 -37.694540 -2.037233
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
-111.679486 -42.475375 -23.075807 -5.553122 73.048122 -40.676318
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
182.583587 -23.376486 89.383371 -5.805095 -36.935732 -5.700573
```

- However, linear models can be far from perfect

```
predicted_Y <- lmod$fitted.values
print(predicted_Y)
```

```
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
116.7259460 -7.2731544 29.3306594 10.3642660 -36.3839155 43.0877052
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
33.9196678 -9.0189919 28.3142017 30.7859425 47.6564865 96.9895982
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
-4.0332759 64.6337956 -0.4971756 386.4035578 88.6945404 4.0372328
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
215.6794862 150.4753750 35.0758066 75.5531221 206.9518779 277.6763183
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
261.4164131 85.3764857 195.6166286 49.8050946 52.9357316 26.7005735
```

**Q:**what appears to be at issue with our model?**A:**in this example, the predicted values are un-physical, including negative numbers of species.In practice, a great deal of the work will be identifying when the assupmtions of the Gauss-Markov theorem fail to hold;

- when they fail to hold in practice, we will look for transformations of variables and alternative models for which the new system is better conditioned.

- As within the context of the Gauss-Markov theorem, we will be interested in when functions are estimable or not.
- Consider the normal equations in matrix form, \[ \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y} \end{align} \] for \( \mathbf{X} \in \mathbb{R}^{n \times p} \).
- If \( n>p \) and the columns of \( \mathbf{X} \) are linearly independent, then \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right) \) will have an inverse that is defined.
- Particularly, this allows us to construct the estimate of the true parameters by least squares, \[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T}\mathbf{Y}. \]

- In the case when \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right) \) is not invertible, this has an important consequence for the normal equation, \[ \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y}. \end{align} \]
- In this case there are infinitely many solutions for \( \hat{\boldsymbol{\beta}} \), as the system of equations is under-constrained.
- Importantly, this means that \( \hat{\boldsymbol{\beta}} \) is at least in part, “unidentifiable”, as there is no unique solution.
- Typically, this will occur in practice by an oversight in the data collection or in our modeling, e.g.:
- A person’s weight could be measured both in pounds and kilos and both variables are entered into the model. One variable is just a multiple of the other.
- For each individual in a data set we have:
- \( X_1= \) the number of years of pre-university education,
- \( X_2= \) the number of years of university education
- \( X_3= \) the total number of years of education and put all three variables into the model.
- In this case, there is an exact linear relation among the variables \( X_1 +X_2 =X_3 \).
- We will also have a dependence relationship when have more variables than cases, that is, the model is supersaturated \( p > n \).
- Such models are considered in large-scale screening experiments used in product design and manufacture and in bioinformatics;
- for example, there are more genes than individuals tested, but there is no hope of uniquely estimating all the parameters in such a model.

- We will add a new column to the Gala data that is a linear combination of the existing columns

```
library(faraway)
gala$Adiff <- gala$Area -gala$Adjacent
str(gala)
```

```
'data.frame': 30 obs. of 8 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
$ Adiff : num 23.25 -571.09 -0.57 -0.08 -903.77 ...
```

- Now if we try to fit the model based on these variables

```
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala)
sumary(lmod)
```

```
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.3690 0.7153508
Area -0.023938 0.022422 -1.0676 0.2963180
Elevation 0.319465 0.053663 5.9532 3.823e-06
Nearest 0.009144 1.054136 0.0087 0.9931506
Scruz -0.240524 0.215402 -1.1166 0.2752082
Adjacent -0.074805 0.017700 -4.2262 0.0002971
n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
```

The default behavior in the R language is to neglect any variables in the design matrix that are clearly linearly dependent to others.

Here, the

`Adiff`

variable, which is included last, has been neglected from the model.

When there is actually linear dependence between the variables, it is possible to rectify this by methods of data compression, e.g. singular value decomposition, among other techniques.

What is more problematic is when the columns are

**very close to being dependent**, and it isn't clear if this is due to noise.

```
set.seed(123)
Adiffe <- gala$Adiff+0.001*(runif(30)-0.5)
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe, gala)
sumary(lmod)
```

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2964e+00 1.9434e+01 0.1696 0.8668
Area -4.5123e+04 4.2583e+04 -1.0596 0.3003
Elevation 3.1302e-01 5.3870e-02 5.8107 6.398e-06
Nearest 3.8273e-01 1.1090e+00 0.3451 0.7331
Scruz -2.6199e-01 2.1581e-01 -1.2140 0.2371
Adjacent 4.5123e+04 4.2583e+04 1.0596 0.3003
Adiffe 4.5123e+04 4.2583e+04 1.0596 0.3003
n = 30, p = 7, Residual SE = 60.81975, R-Squared = 0.78
```

- Here it is possible to fit a model, but the standard error is extremely large.