Identifiability and linear dependence

09/28/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Multiple regression in R
    • Identifiability
    • Linear dependence

Multiple regression in R

  • We have now covered the basic theory that will structure the course.

  • We will now look at a concrete example of multiple regression from the Faraway package “gala” data set.

library(faraway)
str(gala)
'data.frame':   30 obs. of  7 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
  • Q: how many observations n do we have in this data set, and what is the largest p number of parameters can we put into a model?

  • A: there are n=30 observations, and the largest number is p=6+1 parameters (including an intercept and minus one for the response).

Moving into multiple regression – an example

  • In this case, there are 30 islands in the Galápagos with 7 variables – each observation corresponds to a particular island:
row.names(gala)
 [1] "Baltra"       "Bartolome"    "Caldwell"     "Champion"     "Coamano"     
 [6] "Daphne.Major" "Daphne.Minor" "Darwin"       "Eden"         "Enderby"     
[11] "Espanola"     "Fernandina"   "Gardner1"     "Gardner2"     "Genovesa"    
[16] "Isabela"      "Marchena"     "Onslow"       "Pinta"        "Pinzon"      
[21] "Las.Plazas"   "Rabida"       "SanCristobal" "SanSalvador"  "SantaCruz"   
[26] "SantaFe"      "SantaMaria"   "Seymour"      "Tortuga"      "Wolf"        

A computational example – linear dependence

  • We will add a new column to the Gala data that is a linear combination of the existing columns
library(faraway)
gala$Adiff <- gala$Area -gala$Adjacent
str(gala)
'data.frame':   30 obs. of  8 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
 $ Adiff    : num  23.25 -571.09 -0.57 -0.08 -903.77 ...

A computational example – linear dependence

  • Now if we try to fit the model based on these variables
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala)
sumary(lmod)

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  7.068221  19.154198  0.3690 0.7153508
Area        -0.023938   0.022422 -1.0676 0.2963180
Elevation    0.319465   0.053663  5.9532 3.823e-06
Nearest      0.009144   1.054136  0.0087 0.9931506
Scruz       -0.240524   0.215402 -1.1166 0.2752082
Adjacent    -0.074805   0.017700 -4.2262 0.0002971

n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
  • The default behavior in the R language is to neglect any variables in the design matrix that are clearly linearly dependent to others.

  • Here, the Adiff variable, which is included last, has been neglected from the model.

A computational example – linear dependence

  • When there is actually linear dependence between the variables, it is possible to rectify this by methods of data compression, e.g. singular value decomposition, among other techniques.

  • What is more problematic is when the columns are very close to being dependent, and it isn't clear if this is due to noise.

set.seed(123)
Adiffe <- gala$Adiff+0.001*(runif(30)-0.5)
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe, gala)
sumary(lmod)
               Estimate  Std. Error t value  Pr(>|t|)
(Intercept)  3.2964e+00  1.9434e+01  0.1696    0.8668
Area        -4.5123e+04  4.2583e+04 -1.0596    0.3003
Elevation    3.1302e-01  5.3870e-02  5.8107 6.398e-06
Nearest      3.8273e-01  1.1090e+00  0.3451    0.7331
Scruz       -2.6199e-01  2.1581e-01 -1.2140    0.2371
Adjacent     4.5123e+04  4.2583e+04  1.0596    0.3003
Adiffe       4.5123e+04  4.2583e+04  1.0596    0.3003

n = 30, p = 7, Residual SE = 60.81975, R-Squared = 0.78
  • Here it is possible to fit a model, but the standard error is extremely large.