# Identifiability and linear dependence

09/28/2020

## Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

## Outline

• The following topics will be covered in this lecture:
• Multiple regression in R
• Identifiability
• Linear dependence

## Multiple regression in R

• We have now covered the basic theory that will structure the course.

• We will now look at a concrete example of multiple regression from the Faraway package “gala” data set.

library(faraway)
str(gala)

'data.frame':   30 obs. of  7 variables:
$Species : num 58 31 3 25 2 18 24 10 8 2 ...$ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
$Area : num 25.09 1.24 0.21 0.1 0.05 ...$ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
$Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...$ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...  • Q: how many observations $$n$$ do we have in this data set, and what is the largest $$p$$ number of parameters can we put into a model? • A: there are $$n=30$$ observations, and the largest number is $$p=6 +1$$ parameters (including an intercept and minus one for the response). ### Moving into multiple regression – an example • In this case, there are 30 islands in the Galápagos with 7 variables – each observation corresponds to a particular island: row.names(gala)    "Baltra" "Bartolome" "Caldwell" "Champion" "Coamano"  "Daphne.Major" "Daphne.Minor" "Darwin" "Eden" "Enderby"  "Espanola" "Fernandina" "Gardner1" "Gardner2" "Genovesa"  "Isabela" "Marchena" "Onslow" "Pinta" "Pinzon"  "Las.Plazas" "Rabida" "SanCristobal" "SanSalvador" "SantaCruz"  "SantaFe" "SantaMaria" "Seymour" "Tortuga" "Wolf"  ### Multiple regression in R – an example • Recall, We can see some basic summary statistics with “summary” summary(gala)   Species Endemics Area Elevation Min. : 2.00 Min. : 0.00 Min. : 0.010 Min. : 25.00 1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.258 1st Qu.: 97.75 Median : 42.00 Median :18.00 Median : 2.590 Median : 192.00 Mean : 85.23 Mean :26.10 Mean : 261.709 Mean : 368.03 3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.237 3rd Qu.: 435.25 Max. :444.00 Max. :95.00 Max. :4669.320 Max. :1707.00 Nearest Scruz Adjacent Min. : 0.20 Min. : 0.00 Min. : 0.03 1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52 Median : 3.05 Median : 46.65 Median : 2.59 Mean :10.06 Mean : 56.98 Mean : 261.10 3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24 Max. :47.40 Max. :290.20 Max. :4669.32  ### Multiple regression in R – an example head(gala)   Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58 23 25.09 346 0.6 0.6 1.84 Bartolome 31 21 1.24 109 0.6 26.3 572.33 Caldwell 3 3 0.21 114 2.8 58.7 0.78 Champion 25 9 0.10 46 1.9 47.4 0.18 Coamano 2 1 0.05 77 1.9 1.9 903.82 Daphne.Major 18 11 0.34 119 8.0 8.0 1.84  • The meaning of different variables can be found in the help file for the “gala” data using the command ?gala. • Species – number of species found on the island • Endemics – number of endemic species • Area – the area of the island in km2 • Elevation – the highest elevation of the island • Nearest – the distance from the nearest island • Scruz — the distance from Santa Cruz island • Adjacent – the area of the adjacent island ### Creating a linear model in R • We will create a linear model now from scratch with the funciton lm() lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)  • Notice the code syntax: • The first argument is the “Species”, indicating that we have the number of species as our response variable $$\mathbf{Y}$$ • The ~ here separates the response variables from the explanatory variables. • Finally, the keyword argument “data” indicates to the function to draw these variables all from the variable “gala” • This is designed to resemble our mathematical form for the model, where we may write, \begin{align} \mathbf{Y}_{\mathrm{species}} = \beta_0 + \beta_{\mathrm{Area}}\mathbf{X}_{\mathrm{Area}} + \beta_{\mathrm{Elevation}}\mathbf{X}_{\mathrm{Elevation}} + \beta_{\mathrm{Nearest}}\mathbf{X}_{\mathrm{Nearest}} + \beta_{\mathrm{Scruz}}\mathbf{X}_{\mathrm{Scruz}} + \beta_{\mathrm{Adjacent}}\mathbf{X}_{\mathrm{Adjacent}} \end{align} • Like in the simple regression model, we can see exactly how (for all other variables held fixed) the increase in, e.g., $$\mathbf{X}_\mathrm{Area}$$ by 1 $$\mathrm{km}^2$$ corresponds to an increase in the predicted mean function by $$\beta_{\mathrm{Area}}$$; • Thus the $$\beta$$ values in multiple regression correspond again to the slope due to increase in a particular predictor. ### Examining a linear model in R • What does our linear model look like? summary(lmod)   Call: lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) Residuals: Min 1Q Median 3Q Max -111.679 -34.898 -7.862 33.460 182.584 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.068221 19.154198 0.369 0.715351 Area -0.023938 0.022422 -1.068 0.296318 Elevation 0.319465 0.053663 5.953 3.82e-06 *** Nearest 0.009144 1.054136 0.009 0.993151 Scruz -0.240524 0.215402 -1.117 0.275208 Adjacent -0.074805 0.017700 -4.226 0.000297 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 60.98 on 24 degrees of freedom Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171 F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07  • The model summary contains a great deal of information, which we will become familiar with. ### Examining a linear model in R • For instance, we can obtain quantities of interest from the “lmod” names(lmod)    "coefficients" "residuals" "effects" "rank"  "fitted.values" "assign" "qr" "df.residual"  "xlevels" "call" "terms" "model"  lmodcoefficients

 (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent
7.068220709 -0.023938338  0.319464761  0.009143961 -0.240524230 -0.074804832

lmod$residuals   Baltra Bartolome Caldwell Champion Coamano Daphne.Major -58.725946 38.273154 -26.330659 14.635734 38.383916 -25.087705 Daphne.Minor Darwin Eden Enderby Espanola Fernandina -9.919668 19.018992 -20.314202 -28.785943 49.343513 -3.989598 Gardner1 Gardner2 Genovesa Isabela Marchena Onslow 62.033276 -59.633796 40.497176 -39.403558 -37.694540 -2.037233 Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador -111.679486 -42.475375 -23.075807 -5.553122 73.048122 -40.676318 SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf 182.583587 -23.376486 89.383371 -5.805095 -36.935732 -5.700573  ### Examining a linear model in R • However, linear models can be far from perfect predicted_Y <- lmod$fitted.values
print(predicted_Y)

      Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major
116.7259460   -7.2731544   29.3306594   10.3642660  -36.3839155   43.0877052
Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina
33.9196678   -9.0189919   28.3142017   30.7859425   47.6564865   96.9895982
Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow
-4.0332759   64.6337956   -0.4971756  386.4035578   88.6945404    4.0372328
Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador
215.6794862  150.4753750   35.0758066   75.5531221  206.9518779  277.6763183
SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf
261.4164131   85.3764857  195.6166286   49.8050946   52.9357316   26.7005735

• Q: what appears to be at issue with our model?

• A: in this example, the predicted values are un-physical, including negative numbers of species.

• In practice, a great deal of the work will be identifying when the assupmtions of the Gauss-Markov theorem fail to hold;

• when they fail to hold in practice, we will look for transformations of variables and alternative models for which the new system is better conditioned.

## Identifiability

• As within the context of the Gauss-Markov theorem, we will be interested in when functions are estimable or not.
• Consider the normal equations in matrix form, \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y} \end{align} for $$\mathbf{X} \in \mathbb{R}^{n \times p}$$.
• If $$n>p$$ and the columns of $$\mathbf{X}$$ are linearly independent, then $$\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)$$ will have an inverse that is defined.
• Particularly, this allows us to construct the estimate of the true parameters by least squares, $\hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T}\mathbf{Y}.$

### Identifiability

• In the case when $$\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)$$ is not invertible, this has an important consequence for the normal equation, \begin{align} \mathbf{X}^\mathrm{T}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^\mathrm{T} \mathbf{Y}. \end{align}
• In this case there are infinitely many solutions for $$\hat{\boldsymbol{\beta}}$$, as the system of equations is under-constrained.
• Importantly, this means that $$\hat{\boldsymbol{\beta}}$$ is at least in part, “unidentifiable”, as there is no unique solution.
• Typically, this will occur in practice by an oversight in the data collection or in our modeling, e.g.:
• A person’s weight could be measured both in pounds and kilos and both variables are entered into the model. One variable is just a multiple of the other.
• For each individual in a data set we have:
1. $$X_1=$$ the number of years of pre-university education,
2. $$X_2=$$ the number of years of university education
3. $$X_3=$$ the total number of years of education and put all three variables into the model.
• In this case, there is an exact linear relation among the variables $$X_1 +X_2 =X_3$$.
• We will also have a dependence relationship when have more variables than cases, that is, the model is supersaturated $$p > n$$.
• Such models are considered in large-scale screening experiments used in product design and manufacture and in bioinformatics;
• for example, there are more genes than individuals tested, but there is no hope of uniquely estimating all the parameters in such a model.

### A computational example – linear dependence

• We will add a new column to the Gala data that is a linear combination of the existing columns
library(faraway)
gala$Adiff <- gala$Area -gala$Adjacent str(gala)  'data.frame': 30 obs. of 8 variables:$ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
$Endemics : num 23 21 3 9 1 11 0 7 4 2 ...$ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
$Elevation: num 346 109 114 46 77 119 93 168 71 112 ...$ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$Scruz : num 0.6 26.3 58.7 47.4 1.9 ...$ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
$Adiff : num 23.25 -571.09 -0.57 -0.08 -903.77 ...  ### A computational example – linear dependence • Now if we try to fit the model based on these variables lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala) sumary(lmod)   Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.068221 19.154198 0.3690 0.7153508 Area -0.023938 0.022422 -1.0676 0.2963180 Elevation 0.319465 0.053663 5.9532 3.823e-06 Nearest 0.009144 1.054136 0.0087 0.9931506 Scruz -0.240524 0.215402 -1.1166 0.2752082 Adjacent -0.074805 0.017700 -4.2262 0.0002971 n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77  • The default behavior in the R language is to neglect any variables in the design matrix that are clearly linearly dependent to others. • Here, the Adiff variable, which is included last, has been neglected from the model. ### A computational example – linear dependence • When there is actually linear dependence between the variables, it is possible to rectify this by methods of data compression, e.g. singular value decomposition, among other techniques. • What is more problematic is when the columns are very close to being dependent, and it isn't clear if this is due to noise. set.seed(123) Adiffe <- gala$Adiff+0.001*(runif(30)-0.5)
sumary(lmod)

               Estimate  Std. Error t value  Pr(>|t|)
(Intercept)  3.2964e+00  1.9434e+01  0.1696    0.8668
Area        -4.5123e+04  4.2583e+04 -1.0596    0.3003
Elevation    3.1302e-01  5.3870e-02  5.8107 6.398e-06
Nearest      3.8273e-01  1.1090e+00  0.3451    0.7331
Scruz       -2.6199e-01  2.1581e-01 -1.2140    0.2371