Orthogonality and correlation

09/30/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A deeper look at orthogonality
- Defining a model in terms of anomalies
- Correlation and orthogonality

Orthogonality

We have discussed largely about the case where there is some dependence (i.e., correlation) among explanatory variables.
- For super-saturated models (with \( p> n \)), there must be linear dependence among explanatory variables and there is no way to recover “best” values for \( \beta \) as there are infinitely many solutions.
- For cases of linear dependence between variables (with \( p< n \)), there is redundancy between these variables that adds no useful information.
- Linear dependence, or almost-dependence (i.e., highly correlated variables), when \( p < n \) leads to singularities – particularly, there may be issues with trying to solve by least squares or the uncertainty of predictions will be extremely large.
Q: Qualitatively, what occurs when the explanatory variables are totally statistically indpedendent?
A: each variable contributes unique information into the model which cannot be inferred from the values of the other variables.
Q: How does this aid our analysis?
A: in one sense, we maximize the value of each estimated parameter \( \beta_i \) as it corresponds to a statistically independent variable's contribution to the response.
This is closely related to the idea of orthogonality, when the space spanned by each variable is perpendicular to each other.

Orthogonality

Orthogonality can be loosely read as “perpendicular”.
Recall an equivalent description of the vector inner product,

\[ \begin{align} \mathbf{a} \cdot \mathbf{b} & = \parallel \mathbf{a} \parallel \parallel \mathbf{b} \parallel cos(\theta) \\ &= \text{"length of } \mathbf{a}\text{"} \times \text{"length of } \mathbf{b}\text{"} \times cos( \text{"the angle between"}) \end{align} \]

Q: If there are 90 degrees between the two vectors \( \mathbf{a} \) and \( \mathbf{b} \), then what does the inner product \( \mathbf{a} \cdot \mathbf{b} \) equal?
A: \( cos(90^\circ)=0 \), such that the inner product must vanish – therefore, orthogonal vectors have a zero inner product.

Orthogonal decomposition

This idea extends to matrices, when the columns of two matrices are orthogonal.
In particular, if \( \mathbf{A} \) is an orthogonal matrix then \( \mathbf{A}^\mathrm{T}\mathbf{A} = \mathbf{I} \).
- The transpose product here represents the dot product of each column of \( \mathbf{A} \) with each other column.
- Each column is orthogonal to each of the others, so off diagonal entries are zero.
- We call the matrix \( \mathbf{A} \) orthogonal when each of the columns are also normalized to have length one, thus giving the ones on the diagonal.

Orthogonal decomposition

Suppose we can decompose the explanatory variables into two groups \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) which are orthogonal to each other,
- we will not assume, however, that each column is of norm 1:
\[ \begin{align} \mathbf{X} &\triangleq \begin{pmatrix} \mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \end{align} \]
Notice that (regardless of orthogonality) we have the equality:

\[ \begin{align} \mathbf{X}\beta &= \begin{pmatrix} \mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix} \\ &= \mathbf{X}_1 \beta_1 + \mathbf{X}_2 \beta_2 \end{align} \]

so that we split the explanatory variables and parameters into two groups via the definition of the matrix product.

Orthogonal decomposition

Q: Assume that \( \mathbf{X} \triangleq \begin{pmatrix}\mathbf{X}_1 \vert \mathbf{X}_2 \end{pmatrix} \) and \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) are orthogonal to each other. What does \( \mathbf{X}^\mathrm{T} \mathbf{X} \) equal block-wise?
Solution: we find that the matrix product yields,

\[ \begin{align} \mathbf{X}^\mathrm{T} \mathbf{X} &= \begin{pmatrix} \mathbf{X}^\mathrm{T}_1 \mathbf{X}_1 & \mathbf{0} \\ \mathbf{0} & \mathbf{X}^\mathrm{T}_2 \mathbf{X}_2 \end{pmatrix} \end{align} \] due to the orthogonality of the two columns.
Q: using the above fact, can you derive what the orthogonal projection operator \[ \mathbf{H}= \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \] equals block-wise?
A: From this fact, we can now write the product

\[ \begin{align} \mathbf{H} &\triangleq \mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \\ &=\begin{pmatrix} \mathbf{X}_1\left(\mathbf{X}^\mathrm{T}_1 \mathbf{X}_1\right)^{-1}\mathbf{X}_1^\mathrm{T} & \mathbf{0} \\ \mathbf{0} & \mathbf{X}_2\left(\mathbf{X}^\mathrm{T}_2 \mathbf{X}_2\right)^{-1}\mathbf{X}_2^\mathrm{T} \end{pmatrix} \end{align} \]

Orthogonal decomposition

In the previous question, we see that the prediction of the fittted values decomposes entirely along the two sub-sets of variables.
Likewise, we will find that \( \hat{\boldsymbol{\beta}} \) decomposes into two sets of parameters \( \hat{\boldsymbol{\beta}}_1 \) and \( \hat{\boldsymbol{\beta}}_2 \).
Q: recall that the estimated covariance of the parameter values is given as, \[ \begin{align} cov\left(\hat{\boldsymbol{\beta}}\right) &=\sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} . \end{align} \] What does orthogonality of the columns of \( \mathbf{X} \) imply about the covariance of the parameters \( \hat{\boldsymbol{\beta}} \)?
A: in particular, if the columns of \( \mathbf{X} \) are orthogonal to each other, we find that the estimated parameters \( \hat{\boldsymbol{\beta}} \) are uncorrelated.
Qualitatively, we should understand that the value of one parameter estimate \( \hat{\beta}_i \) does not inform the value of the estimate \( \hat{\beta}_j \) for \( i\neq j \).