Identifiability and linear dependence

09/28/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Multiple regression in R
- Identifiability
- Linear dependence

Multiple regression in R

We have now covered the basic theory that will structure the course.

We will now look at a concrete example of multiple regression from the Faraway package “gala” data set.

library(faraway)
str(gala)

'data.frame':   30 obs. of  7 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...

Q: how many observations \( n \) do we have in this data set, and what is the largest \( p \) number of parameters can we put into a model?
A: there are \( n=30 \) observations, and the largest number is \( p=6 +1 \) parameters (including an intercept and minus one for the response).

Moving into multiple regression – an example

In this case, there are 30 islands in the Galápagos with 7 variables – each observation corresponds to a particular island:

row.names(gala)

 [1] "Baltra"       "Bartolome"    "Caldwell"     "Champion"     "Coamano"     
 [6] "Daphne.Major" "Daphne.Minor" "Darwin"       "Eden"         "Enderby"     
[11] "Espanola"     "Fernandina"   "Gardner1"     "Gardner2"     "Genovesa"    
[16] "Isabela"      "Marchena"     "Onslow"       "Pinta"        "Pinzon"      
[21] "Las.Plazas"   "Rabida"       "SanCristobal" "SanSalvador"  "SantaCruz"   
[26] "SantaFe"      "SantaMaria"   "Seymour"      "Tortuga"      "Wolf"

Multiple regression in R – an example

Recall, We can see some basic summary statistics with “summary”

summary(gala)

    Species          Endemics          Area            Elevation      
 Min.   :  2.00   Min.   : 0.00   Min.   :   0.010   Min.   :  25.00  
 1st Qu.: 13.00   1st Qu.: 7.25   1st Qu.:   0.258   1st Qu.:  97.75  
 Median : 42.00   Median :18.00   Median :   2.590   Median : 192.00  
 Mean   : 85.23   Mean   :26.10   Mean   : 261.709   Mean   : 368.03  
 3rd Qu.: 96.00   3rd Qu.:32.25   3rd Qu.:  59.237   3rd Qu.: 435.25  
 Max.   :444.00   Max.   :95.00   Max.   :4669.320   Max.   :1707.00  
    Nearest          Scruz           Adjacent      
 Min.   : 0.20   Min.   :  0.00   Min.   :   0.03  
 1st Qu.: 0.80   1st Qu.: 11.03   1st Qu.:   0.52  
 Median : 3.05   Median : 46.65   Median :   2.59  
 Mean   :10.06   Mean   : 56.98   Mean   : 261.10  
 3rd Qu.:10.03   3rd Qu.: 81.08   3rd Qu.:  59.24  
 Max.   :47.40   Max.   :290.20   Max.   :4669.32

Multiple regression in R – an example

head(gala)

             Species Endemics  Area Elevation Nearest Scruz Adjacent
Baltra            58       23 25.09       346     0.6   0.6     1.84
Bartolome         31       21  1.24       109     0.6  26.3   572.33
Caldwell           3        3  0.21       114     2.8  58.7     0.78
Champion          25        9  0.10        46     1.9  47.4     0.18
Coamano            2        1  0.05        77     1.9   1.9   903.82
Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

The meaning of different variables can be found in the help file for the “gala” data using the command ?gala.
Species – number of species found on the island
Endemics – number of endemic species
Area – the area of the island in km²
Elevation – the highest elevation of the island
Nearest – the distance from the nearest island
Scruz — the distance from Santa Cruz island
Adjacent – the area of the adjacent island

Creating a linear model in R

We will create a linear model now from scratch with the funciton lm()

lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)

Notice the code syntax:
- The first argument is the “Species”, indicating that we have the number of species as our response variable \( \mathbf{Y} \)
- The ~ here separates the response variables from the explanatory variables.
- Finally, the keyword argument “data” indicates to the function to draw these variables all from the variable “gala”
- This is designed to resemble our mathematical form for the model, where we may write,
\[ \begin{align} \mathbf{Y}_{\mathrm{species}} = \beta_0 + \beta_{\mathrm{Area}}\mathbf{X}_{\mathrm{Area}} + \beta_{\mathrm{Elevation}}\mathbf{X}_{\mathrm{Elevation}} + \beta_{\mathrm{Nearest}}\mathbf{X}_{\mathrm{Nearest}} + \beta_{\mathrm{Scruz}}\mathbf{X}_{\mathrm{Scruz}} + \beta_{\mathrm{Adjacent}}\mathbf{X}_{\mathrm{Adjacent}} \end{align} \]
Like in the simple regression model, we can see exactly how (for all other variables held fixed) the increase in, e.g., \( \mathbf{X}_\mathrm{Area} \) by 1 \( \mathrm{km}^2 \) corresponds to an increase in the predicted mean function by \( \beta_{\mathrm{Area}} \);
- Thus the \( \beta \) values in multiple regression correspond again to the slope due to increase in a particular predictor.

Examining a linear model in R

What does our linear model look like?

summary(lmod)


Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
    data = gala)

Residuals:
     Min       1Q   Median       3Q      Max 
-111.679  -34.898   -7.862   33.460  182.584 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.068221  19.154198   0.369 0.715351    
Area        -0.023938   0.022422  -1.068 0.296318    
Elevation    0.319465   0.053663   5.953 3.82e-06 ***
Nearest      0.009144   1.054136   0.009 0.993151    
Scruz       -0.240524   0.215402  -1.117 0.275208    
Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared:  0.7658,    Adjusted R-squared:  0.7171 
F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

The model summary contains a great deal of information, which we will become familiar with.

Examining a linear model in R

For instance, we can obtain quantities of interest from the “lmod”

names(lmod)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

lmod$coefficients

 (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
 7.068220709 -0.023938338  0.319464761  0.009143961 -0.240524230 -0.074804832

lmod$residuals

      Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
  -58.725946    38.273154   -26.330659    14.635734    38.383916   -25.087705 
Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
   -9.919668    19.018992   -20.314202   -28.785943    49.343513    -3.989598 
    Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
   62.033276   -59.633796    40.497176   -39.403558   -37.694540    -2.037233 
       Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
 -111.679486   -42.475375   -23.075807    -5.553122    73.048122   -40.676318 
   SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
  182.583587   -23.376486    89.383371    -5.805095   -36.935732    -5.700573

Examining a linear model in R

However, linear models can be far from perfect

predicted_Y <- lmod$fitted.values
print(predicted_Y)

      Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
 116.7259460   -7.2731544   29.3306594   10.3642660  -36.3839155   43.0877052 
Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
  33.9196678   -9.0189919   28.3142017   30.7859425   47.6564865   96.9895982 
    Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
  -4.0332759   64.6337956   -0.4971756  386.4035578   88.6945404    4.0372328 
       Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
 215.6794862  150.4753750   35.0758066   75.5531221  206.9518779  277.6763183 
   SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
 261.4164131   85.3764857  195.6166286   49.8050946   52.9357316   26.7005735

Q: what appears to be at issue with our model?
A: in this example, the predicted values are un-physical, including negative numbers of species.
In practice, a great deal of the work will be identifying when the assupmtions of the Gauss-Markov theorem fail to hold;
- when they fail to hold in practice, we will look for transformations of variables and alternative models for which the new system is better conditioned.

Identifiability

As within the context of the Gauss-Markov theorem, we will be interested in when functions are estimable or not.
Consider the normal equations in matrix form, \[ \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y} \end{align} \] for \( \mathbf{X} \in \mathbb{R}^{n \times p} \).
If \( n>p \) and the columns of \( \mathbf{X} \) are linearly independent, then \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right) \) will have an inverse that is defined.
Particularly, this allows us to construct the estimate of the true parameters by least squares, \[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T}\mathbf{Y}. \]

Identifiability

In the case when \( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right) \) is not invertible, this has an important consequence for the normal equation, \[ \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y}. \end{align} \]
In this case there are infinitely many solutions for \( \hat{\boldsymbol{\beta}} \), as the system of equations is under-constrained.
Importantly, this means that \( \hat{\boldsymbol{\beta}} \) is at least in part, “unidentifiable”, as there is no unique solution.
Typically, this will occur in practice by an oversight in the data collection or in our modeling, e.g.:

A person’s weight could be measured both in pounds and kilos and both variables are entered into the model. One variable is just a multiple of the other.
For each individual in a data set we have:

\( X_1= \) the number of years of pre-university education,
\( X_2= \) the number of years of university education
\( X_3= \) the total number of years of education and put all three variables into the model.

In this case, there is an exact linear relation among the variables \( X_1 +X_2 =X_3 \).
We will also have a dependence relationship when have more variables than cases, that is, the model is supersaturated \( p > n \).
Such models are considered in large-scale screening experiments used in product design and manufacture and in bioinformatics;
for example, there are more genes than individuals tested, but there is no hope of uniquely estimating all the parameters in such a model.

A computational example – linear dependence

We will add a new column to the Gala data that is a linear combination of the existing columns

library(faraway)
gala$Adiff <- gala$Area -gala$Adjacent
str(gala)

'data.frame':   30 obs. of  8 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
 $ Adiff    : num  23.25 -571.09 -0.57 -0.08 -903.77 ...

A computational example – linear dependence

Now if we try to fit the model based on these variables

lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala)
sumary(lmod)


Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  7.068221  19.154198  0.3690 0.7153508
Area        -0.023938   0.022422 -1.0676 0.2963180
Elevation    0.319465   0.053663  5.9532 3.823e-06
Nearest      0.009144   1.054136  0.0087 0.9931506
Scruz       -0.240524   0.215402 -1.1166 0.2752082
Adjacent    -0.074805   0.017700 -4.2262 0.0002971

n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77

The default behavior in the R language is to neglect any variables in the design matrix that are clearly linearly dependent to others.
Here, the Adiff variable, which is included last, has been neglected from the model.

A computational example – linear dependence

When there is actually linear dependence between the variables, it is possible to rectify this by methods of data compression, e.g. singular value decomposition, among other techniques.
What is more problematic is when the columns are very close to being dependent, and it isn't clear if this is due to noise.

set.seed(123)
Adiffe <- gala$Adiff+0.001*(runif(30)-0.5)
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe, gala)
sumary(lmod)

               Estimate  Std. Error t value  Pr(>|t|)
(Intercept)  3.2964e+00  1.9434e+01  0.1696    0.8668
Area        -4.5123e+04  4.2583e+04 -1.0596    0.3003
Elevation    3.1302e-01  5.3870e-02  5.8107 6.398e-06
Nearest      3.8273e-01  1.1090e+00  0.3451    0.7331
Scruz       -2.6199e-01  2.1581e-01 -1.2140    0.2371
Adjacent     4.5123e+04  4.2583e+04  1.0596    0.3003
Adiffe       4.5123e+04  4.2583e+04  1.0596    0.3003

n = 30, p = 7, Residual SE = 60.81975, R-Squared = 0.78

Here it is possible to fit a model, but the standard error is extremely large.