Orthogonality and correlation

09/30/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • A deeper look at orthogonality
    • Defining a model in terms of anomalies
    • Correlation and orthogonality

Orthogonality

  • We have discussed largely about the case where there is some dependence (i.e., correlation) among explanatory variables.

    • For super-saturated models (with \( p> n \)), there must be linear dependence among explanatory variables and there is no way to recover “best” values for \( \beta \) as there are infinitely many solutions.
    • For cases of linear dependence between variables (with \( p< n \)), there is redundancy between these variables that adds no useful information.
    • Linear dependence, or almost-dependence (i.e., highly correlated variables), when \( p < n \) leads to singularities – particularly, there may be issues with trying to solve by least squares or the uncertainty of predictions will be extremely large.
  • Q: Qualitatively, what occurs when the explanatory variables are totally statistically indpedendent?

  • A: each variable contributes unique information into the model which cannot be inferred from the values of the other variables.

  • Q: How does this aid our analysis?

  • A: in one sense, we maximize the value of each estimated parameter \( \beta_i \) as it corresponds to a statistically independent variable's contribution to the response.

  • This is closely related to the idea of orthogonality, when the space spanned by each variable is perpendicular to each other.

Orthogonality

  • Orthogonality can be loosely read as “perpendicular”.

  • Recall an equivalent description of the vector inner product,

\[ \begin{align} \mathbf{a} \cdot \mathbf{b} & = \parallel \mathbf{a} \parallel \parallel \mathbf{b} \parallel cos(\theta) \\ &= \text{"length of } \mathbf{a}\text{"} \times \text{"length of } \mathbf{b}\text{"} \times cos( \text{"the angle between"}) \end{align} \]

  • Q: If there are 90 degrees between the two vectors \( \mathbf{a} \) and \( \mathbf{b} \), then what does the inner product \( \mathbf{a} \cdot \mathbf{b} \) equal?

  • A: \( cos(90^\circ)=0 \), such that the inner product must vanish – therefore, orthogonal vectors have a zero inner product.

Orthogonal decomposition

  • This idea extends to matrices, when the columns of two matrices are orthogonal.

  • In particular, if \( \mathbf{A} \) is an orthogonal matrix then \( \mathbf{A}^\mathrm{T}\mathbf{A} = \mathbf{I} \).

    • The transpose product here represents the dot product of each column of \( \mathbf{A} \) with each other column.
    • Each column is orthogonal to each of the others, so off diagonal entries are zero.
    • We call the matrix \( \mathbf{A} \) orthogonal when each of the columns are also normalized to have length one, thus giving the ones on the diagonal.

Orthogonal decomposition

  • Suppose we can decompose the explanatory variables into two groups \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) which are orthogonal to each other,

    • we will not assume, however, that each column is of norm 1:

    \[ \begin{align} \mathbf{X} &\triangleq \begin{pmatrix} \mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \end{align} \]

  • Notice that (regardless of orthogonality) we have the equality:

\[ \begin{align} \mathbf{X}\beta &= \begin{pmatrix} \mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix} \\ &= \mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2 \end{align} \]

so that we split the explanatory variables and parameters into two groups via the definition of the matrix product.

Orthogonal decomposition

  • Q: Assume that \( \mathbf{X} \triangleq \begin{pmatrix}\mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \) and \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) are orthogonal to each other. What does \( \mathbf{X}^\mathrm{T} \mathbf{X} \) equal block-wise?

  • Solution: we find that the matrix product yields,

    \[ \begin{align} \mathbf{X}^\mathrm{T} \mathbf{X} &= \begin{pmatrix} \mathbf{X}^\mathrm{T}_1 \mathbf{X}_1 & \mathbf{0} \\ \mathbf{0} & \mathbf{X}^\mathrm{T}_2 \mathbf{X}_2 \end{pmatrix} \end{align} \] due to the orthogonality of the two columns.

  • Q: using the above fact, can you derive what the orthogonal projection operator \[ \mathbf{H}= \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \] equals block-wise?

  • A: From this fact, we can now write the product

    \[ \begin{align} \mathbf{H} &\triangleq \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \\ &=\begin{pmatrix} \mathbf{X}_1\left(\mathbf{X}^\mathrm{T}_1 \mathbf{X}_1\right)^{-1}\mathbf{X}_1^\mathrm{T} & \mathbf{0} \\ \mathbf{0} & \mathbf{X}_2\left(\mathbf{X}^\mathrm{T}_2 \mathbf{X}_2\right)^{-1}\mathbf{X}_2^\mathrm{T} \end{pmatrix} \end{align} \]

Orthogonal decomposition

  • In the previous question, we see that the prediction of the fittted values decomposes entirely along the two sub-sets of variables.

  • Likewise, we will find that \( \hat{\boldsymbol{\beta}} \) decomposes into two sets of parameters \( \hat{\boldsymbol{\beta}}_1 \) and \( \hat{\boldsymbol{\beta}}_2 \).

  • Q: recall that the estimated covariance of the parameter values is given as, \[ \begin{align} cov\left(\hat{\boldsymbol{\beta}}\right) &=\sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} . \end{align} \] What does orthogonality of the columns of \( \mathbf{X} \) imply about the covariance of the parameters \( \hat{\boldsymbol{\beta}} \)?

  • A: in particular, if the columns of \( \mathbf{X} \) are orthogonal to each other, we find that the estimated parameters \( \hat{\boldsymbol{\beta}} \) are uncorrelated.

  • Qualitatively, we should understand that the value of one parameter estimate \( \hat{\beta}_i \) does not inform the value of the estimate \( \hat{\beta}_j \) for \( i\neq j \).

Orthogonality and correlation

  • We want to relate the notion of orthogonality more directly to the correlation between variables in the statistical sense.

  • Let \( \overline{\mathbf{X}}^{(i)} \) be the mean of column \( i \) of the design matrix, i.e., \[ \overline{\mathbf{X}}^{(i)} \triangleq \frac{1}{n} \sum_{k=1}^n X_{k,i}, \] summing over the matrix entries \( X_{k,i} \) along the rows \( k=1,\cdots,n \).

  • We will then define the \( (k,i) \)-th anomaly as \[ a_{(k,i)} = X_{k,i} - \overline{\mathbf{X}}^{(i)}, \] such that the matrix \( \mathbf{A} \) is defined column-wise as \[ \begin{align} \mathbf{A}^{(i)} &\triangleq \mathbf{X}^{(i)} - \frac{1}{n} \boldsymbol{1}\boldsymbol{1}^\mathrm{T}\mathbf{X}^{(i)} \\ \end{align} \] where \( \mathbb{1} \) is the vector of ones, \[ \begin{align} \boldsymbol{1}^\mathrm{T} \triangleq \begin{pmatrix} 1 & 1 & \cdots & 1 \end{pmatrix}. \end{align} \]

Orthogonality and correlation

  • If we consider as is standard that \( \mathbf{X} \) are deterministic constants, we may then discuss the sample-based correlation of the predictors.

  • The sample-based correlation coefficient of the variables \( X_i \) and \( X_j \) can be written, \[ \begin{align} cor(X_i,X_j)\triangleq \frac{\left(\mathbf{A}^{(i)}\right)^\mathrm{T} \mathbf{A}^{(j)}}{\sqrt{\left[\left(\mathbf{A}^{(i)}\right)^\mathrm{T}\mathbf{A}^{(i)}\right] \left[\left(\mathbf{A}^{(j)}\right)^\mathrm{T}\mathbf{A}^{(j)}\right]}}. \end{align} \]

  • Q: if the varaibles \( X_i \) and \( X_j \) are uncorrelated, what does this say about their anomalies?

  • A: they must be orthogonal.

Orthogonality and correlation

  • We can consider thus a change of variables for our standard model, in the case our variables are uncorrelated.

  • Let us suppose the form of the model is

    \[ \begin{align} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_{p-1} X_{p-1} \\ &= \beta_0 + \sum_{i=1}^{p-1} \beta_i X_i\\ &= \beta_0 + \sum_{i=1}^{p-1} \beta_i \left(X_i- \overline{\mathbf{X}}^{(i)} + \overline{\mathbf{X}}^{(i)}\right)\\ &= \left[\beta_0 + \sum_{i=1}^{p-1} \beta_i\overline{\mathbf{X}}^{(i)}\right] + \sum_{i=1}^{p-1} \beta_i A_i \end{align} \] i.e., where we re-write the model in terms of the anomalies as the predictors.

  • Recall, by our assumptions, \( \left[\beta_0 + \sum_{i=1}^{p-1} \beta_i\overline{\mathbf{X}}^{(1)}\right] \) is just a constant which can be re-named as \( \tilde{\beta}_0 \).

  • Q: for the above model, are the parameter estimates for \( \tilde{\beta}_0, \beta_1, \cdots \beta_{p-1} \) correlated?

Orthogonality and correlation

  • A: no, by the orthogonality of the anomalies, the covariance of \( \hat{\boldsymbol{\beta}} \) in terms of the anomalies will be given by,

    \[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right) &= \sigma^2 \left(\mathbf{A}^\mathrm{T}\mathbf{A}\right)^{-1} \\ &=\sigma^2 \left(\mathbf{I}\right)^{-1} \\ &= \sigma^2 \mathbf{I} \end{align} \]

  • Thus in the case that the predictors are uncorrelated, the parameter estimates for the model in terms of the anomalies have the same covariance as the cases themselves.

Orthogonality and correlation

  • We can thus see that having uncorrelated predictors allows us to construct a model (possibly in terms of the anomalies) in which the estimated parameters are also uncorrelated.

  • In this case, we can view the parameter estimates loosely as “close-to-independent”.

    • However, remember that uncorrelated is only close-to-independent but not equivalent.
  • It tells us that we cannot infer information about one parameter from the value of another;

    • therefore, this also tells us that the estimated values for a particular parameter will not change based on the other predictors included in the model.
  • This is an extremely useful property that typically is only a product of good experimental design – in situ data from observations often have more complicated correlation structures.

    • Particularly, the estimated value of one parameter will often depend on the other variables included in our model.

An example of orthogonal predictors

  • An experiment was made to determine the effects of column temperature, gas/liquid ratio and packing height in reducing the unpleasant odor of a chemical product that was sold for household use.
require("faraway")
odor
   odor temp gas pack
1    66   -1  -1    0
2    39    1  -1    0
3    43   -1   1    0
4    49    1   1    0
5    58   -1   0   -1
6    17    1   0   -1
7    -5   -1   0    1
8   -40    1   0    1
9    65    0  -1   -1
10    7    0   1   -1
11   43    0  -1    1
12  -22    0   1    1
13  -31    0   0    0
14  -35    0   0    0
15  -26    0   0    0
  • Note: Temperature has been rescaled by the transformation \[ \text{temp} = \frac{\text{Farenheit} - 80}{40} \]

An example of orthogonal predictors

  • If we reverse the transformation for temp, we get

    \[ \text{Farenheit} = \text{temp}\times 40 + 80 \]

  • Therefore,

farenheit <- odor$temp * 40 + 80
farenheit
 [1]  40 120  40 120  40 120  40 120  80  80  80  80  80  80  80
  • Notice that
mean(farenheit)
[1] 80
  • so that the original model is actually a rescaled anomaly model:
mean(odor$temp)
[1] 0

An example of orthogonal predictors

  • We will compute the covariance of the explanatory variables, temp, gas and pack:
cov(odor[,-1])
          temp       gas      pack
temp 0.5714286 0.0000000 0.0000000
gas  0.0000000 0.5714286 0.0000000
pack 0.0000000 0.0000000 0.5714286
  • With the zero values in the non-diagonal elements, we know that the predictors for this model are orthogonal.

An example of orthogonal predictors

  • We will fit a linear model for the response variable odor with the explanatory variables tem, gas and pack.
lmod <- lm(odor ~ temp + gas + pack, odor)
summary(lmod,cor=T)

Call:
lm(formula = odor ~ temp + gas + pack, data = odor)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.200 -17.137   1.175  20.300  62.925 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   15.200      9.298   1.635    0.130
temp         -12.125     12.732  -0.952    0.361
gas          -17.000     12.732  -1.335    0.209
pack         -21.375     12.732  -1.679    0.121

Residual standard error: 36.01 on 11 degrees of freedom
Multiple R-squared:  0.3337,    Adjusted R-squared:  0.1519 
F-statistic: 1.836 on 3 and 11 DF,  p-value: 0.1989

Correlation of Coefficients:
     (Intercept) temp gas 
temp 0.00                 
gas  0.00        0.00     
pack 0.00        0.00 0.00

An example of orthogonal predictors

  • If we repeat fitting the linear model, but leave out the variable pack, we find almost the same result, except for minor changes in the uncertainty and the increased number of degrees of freedom – this is not guaranteed for all models.
lmod <- lm(odor ~ temp + gas, odor)
summary(lmod,cor=T)

Call:
lm(formula = odor ~ temp + gas, data = odor)

Residuals:
   Min     1Q Median     3Q    Max 
-50.20 -36.76  10.80  26.18  62.92 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   15.200      9.978   1.523    0.154
temp         -12.125     13.663  -0.887    0.392
gas          -17.000     13.663  -1.244    0.237

Residual standard error: 38.64 on 12 degrees of freedom
Multiple R-squared:  0.1629,    Adjusted R-squared:  0.02342 
F-statistic: 1.168 on 2 and 12 DF,  p-value: 0.344

Correlation of Coefficients:
     (Intercept) temp
temp 0.00            
gas  0.00        0.00