09/28/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We have now covered the basic theory that will structure the course.
We will now look at a concrete example of multiple regression from the Faraway package “gala” data set.
library(faraway)
str(gala)
'data.frame': 30 obs. of 7 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
Q: how many observations \( n \) do we have in this data set, and what is the largest \( p \) number of parameters can we put into a model?
A: there are \( n=30 \) observations, and the largest number is \( p=6 +1 \) parameters (including an intercept and minus one for the response).
row.names(gala)
[1] "Baltra" "Bartolome" "Caldwell" "Champion" "Coamano"
[6] "Daphne.Major" "Daphne.Minor" "Darwin" "Eden" "Enderby"
[11] "Espanola" "Fernandina" "Gardner1" "Gardner2" "Genovesa"
[16] "Isabela" "Marchena" "Onslow" "Pinta" "Pinzon"
[21] "Las.Plazas" "Rabida" "SanCristobal" "SanSalvador" "SantaCruz"
[26] "SantaFe" "SantaMaria" "Seymour" "Tortuga" "Wolf"
summary(gala)
Species Endemics Area Elevation
Min. : 2.00 Min. : 0.00 Min. : 0.010 Min. : 25.00
1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.258 1st Qu.: 97.75
Median : 42.00 Median :18.00 Median : 2.590 Median : 192.00
Mean : 85.23 Mean :26.10 Mean : 261.709 Mean : 368.03
3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.237 3rd Qu.: 435.25
Max. :444.00 Max. :95.00 Max. :4669.320 Max. :1707.00
Nearest Scruz Adjacent
Min. : 0.20 Min. : 0.00 Min. : 0.03
1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52
Median : 3.05 Median : 46.65 Median : 2.59
Mean :10.06 Mean : 56.98 Mean : 261.10
3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24
Max. :47.40 Max. :290.20 Max. :4669.32
head(gala)
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82
Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
The meaning of different variables can be found in the help file for the “gala” data using the command ?gala
.
Species – number of species found on the island
Endemics – number of endemic species
Area – the area of the island in km2
Elevation – the highest elevation of the island
Nearest – the distance from the nearest island
Scruz — the distance from Santa Cruz island
Adjacent – the area of the adjacent island
lm()
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
Notice the code syntax:
\[ \begin{align} \mathbf{Y}_{\mathrm{species}} = \beta_0 + \beta_{\mathrm{Area}}\mathbf{X}_{\mathrm{Area}} + \beta_{\mathrm{Elevation}}\mathbf{X}_{\mathrm{Elevation}} + \beta_{\mathrm{Nearest}}\mathbf{X}_{\mathrm{Nearest}} + \beta_{\mathrm{Scruz}}\mathbf{X}_{\mathrm{Scruz}} + \beta_{\mathrm{Adjacent}}\mathbf{X}_{\mathrm{Adjacent}} \end{align} \]
Like in the simple regression model, we can see exactly how (for all other variables held fixed) the increase in, e.g., \( \mathbf{X}_\mathrm{Area} \) by 1 \( \mathrm{km}^2 \) corresponds to an increase in the predicted mean function by \( \beta_{\mathrm{Area}} \);
summary(lmod)
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
names(lmod)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
lmod$coefficients
(Intercept) Area Elevation Nearest Scruz Adjacent
7.068220709 -0.023938338 0.319464761 0.009143961 -0.240524230 -0.074804832
lmod$residuals
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
-58.725946 38.273154 -26.330659 14.635734 38.383916 -25.087705
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
-9.919668 19.018992 -20.314202 -28.785943 49.343513 -3.989598
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
62.033276 -59.633796 40.497176 -39.403558 -37.694540 -2.037233
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
-111.679486 -42.475375 -23.075807 -5.553122 73.048122 -40.676318
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
182.583587 -23.376486 89.383371 -5.805095 -36.935732 -5.700573
predicted_Y <- lmod$fitted.values
print(predicted_Y)
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
116.7259460 -7.2731544 29.3306594 10.3642660 -36.3839155 43.0877052
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
33.9196678 -9.0189919 28.3142017 30.7859425 47.6564865 96.9895982
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
-4.0332759 64.6337956 -0.4971756 386.4035578 88.6945404 4.0372328
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
215.6794862 150.4753750 35.0758066 75.5531221 206.9518779 277.6763183
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
261.4164131 85.3764857 195.6166286 49.8050946 52.9357316 26.7005735
Q: what appears to be at issue with our model?
A: in this example, the predicted values are un-physical, including negative numbers of species.
In practice, a great deal of the work will be identifying when the assupmtions of the Gauss-Markov theorem fail to hold;
library(faraway)
gala$Adiff <- gala$Area -gala$Adjacent
str(gala)
'data.frame': 30 obs. of 8 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
$ Adiff : num 23.25 -571.09 -0.57 -0.08 -903.77 ...
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala)
sumary(lmod)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.3690 0.7153508
Area -0.023938 0.022422 -1.0676 0.2963180
Elevation 0.319465 0.053663 5.9532 3.823e-06
Nearest 0.009144 1.054136 0.0087 0.9931506
Scruz -0.240524 0.215402 -1.1166 0.2752082
Adjacent -0.074805 0.017700 -4.2262 0.0002971
n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
The default behavior in the R language is to neglect any variables in the design matrix that are clearly linearly dependent to others.
Here, the Adiff
variable, which is included last, has been neglected from the model.
When there is actually linear dependence between the variables, it is possible to rectify this by methods of data compression, e.g. singular value decomposition, among other techniques.
What is more problematic is when the columns are very close to being dependent, and it isn't clear if this is due to noise.
set.seed(123)
Adiffe <- gala$Adiff+0.001*(runif(30)-0.5)
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe, gala)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2964e+00 1.9434e+01 0.1696 0.8668
Area -4.5123e+04 4.2583e+04 -1.0596 0.3003
Elevation 3.1302e-01 5.3870e-02 5.8107 6.398e-06
Nearest 3.8273e-01 1.1090e+00 0.3451 0.7331
Scruz -2.6199e-01 2.1581e-01 -1.2140 0.2371
Adjacent 4.5123e+04 4.2583e+04 1.0596 0.3003
Adiffe 4.5123e+04 4.2583e+04 1.0596 0.3003
n = 30, p = 7, Residual SE = 60.81975, R-Squared = 0.78