Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- The covariance and the correlation between two random variables
- Examining sample correlation between variables
- The covariance and the correlation matrix of a random vector
- The sample covariance matrix

We have now introduced the

**expected value for a random vector \( \boldsymbol{\xi} \) as the analog of the center of mass**in multiple variables.In one dimension, the notion of

**variance \( \mathrm{var}\left(X\right)=\sigma_X^2 \) and the standard deviation \( \sigma_X \) give us measures of the spread**of the random variable and the data derived from observations of it.We define the

**variance of \( X \)**once again in terms of,\[ \mathrm{var}\left(X\right) = \sigma^2_X = \mathbb{E}\left[\left(X - \mu_X\right)^2\right] \] which can be seen as the average deviation of the random variable \( X \) from its mean, in the square sense.

When we have two random variables \( X \) and \( Y \), we will need to take additional considerations of how these variables

**co-vary together or oppositely in their conditional probability**.- This will be in the same sense of how they
**vary from their centers of mass, but simultaneously in space**.

- This will be in the same sense of how they

Consider that for the univariate expectation, with the

**two random variables \( X \) and \( Y \)**, we have\[ \begin{align} \mathbb{E}\left[ X + Y \right] &= \mathbb{E}\left[X \right] + \mathbb{E}\left[ Y\right] \\ &=\mu_X + \mu_Y \end{align} \]

However, the

**same does not apply when we take the variance**of the sum of the variables;\[ \begin{align} \mathrm{var}\left( X+Y\right) &= \mathbb{E}\left[ \left(X + Y - \mu_X - \mu_Y\right)^2\right] \\ &=\mathbb{E}\left[\left\{ \left( X - \mu_X \right) +\left( Y - \mu_Y \right) \right\}^2\right]\\ & = \mathbb{E}\left[ \left( X - \mu_X \right)^2 + \left( Y - \mu_Y \right)^2 + 2 \left(X - \mu_X \right)\left(Y - \mu_Y\right)\right] \end{align} \]

**Q:**using the linearity of the expectation, and the definition of the variance, how can the above be simplified?**A:**using that \( \mathrm{var}\left(X\right) =\mathbb{E}\left[\left(X - \mu_X \right)^2 \right] \) and similarly in \( Y \),

\[ \begin{align} \mathrm{var}\left( X+Y\right) &= \mathrm{var}\left(X\right) + \mathrm{var}\left(Y\right) + 2 \mathbb{E}\left[\left(X - \mu_X \right)\left(Y - \mu_Y\right)\right] \end{align} \]

Therefore, the combination of the random variables has a variance that is equal to the

**sum of the variances plus the newly identified cross terms**.

We note that

**if \( X \) and \( Y \) are independent**, i.e.,\[ \begin{align} P(X\vert Y) = P(X) & & P(Y \vert X) = P(Y); \end{align} \]

then we have \[ \begin{align} \mathbb{E}\left[\left(X - \mu_X\right) \left(Y - \mu_Y \right)\right] = \mathbb{E}\left[X - \mu_X\right] \mathbb{E}\left[Y - \mu_Y \right] = 0. \end{align} \]

Therefore, we can consider the

**covariance**,\[ \mathrm{cov}\left(X,Y\right) = \sigma_{X,Y} = \mathbb{E}\left[\left(X - \mu_X \right)\left(Y - \mu_Y\right)\right], \] to be a measure of

**how the variables \( X \) and \( Y \) co-vary together in their conditional probabilities**.We should note that while \( \mathrm{cov}\left(X,Y\right)=0 \) for any pair of independent variables,

**this condition is not the same as independence**.

Particularly, we will denote

\[ \begin{align} \mathrm{cor}(X,Y) =\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sqrt{\mathrm{var}\left(X\right) \mathrm{var}\left(Y\right)}}=\frac{\sigma_{X,Y}}{\sqrt{\sigma_{X}^2 \sigma_{Y}^2}} \end{align} \] the correlation between the variables \( X \) and \( Y \).

If the correlation / covariance of the two variables \( X \) and \( Y \) is equal to zero, then

\[ \mathrm{var}\left( X+Y\right) = \mathrm{var}\left(X\right) + \mathrm{var}\left(Y\right), \] but this

**does not imply that they are independent**, just that we cannot detect the dependence structure with this measure.**Q:**how can you use the above definition of the correlation to show that \( X \) always has correlation \( 1 \) with itself?**A:**notice that the variance of \( X \), \( \sigma_X^2 \) and the standard deviation \( \sigma_X \) can be substituted into the above to obtain,

\[ \mathrm{cor}(X,X) =\rho_{X,X} = \frac{\mathrm{cov}(X,X)}{\sqrt{\mathrm{var}\left(X\right) \mathrm{var}\left(X\right)}}=\frac{\sigma_{X}^2}{\sqrt{\sigma_{X}^2 \sigma_{X}^2}}= 1 \]

More generally, we can say that for any two random variables \( X \) and \( Y \),

\[ -1 \leq \mathrm{cor}\left(X,Y\right)\leq 1. \]

This can be shown as follows, where

\[ \begin{align} 0 & \leq \mathrm{var}\left( \frac{X}{\sigma_X} + \frac{Y}{\sigma_Y} \right) \\ &=\mathrm{var}\left(\frac{X}{\sigma_X}\right) + \mathrm{var}\left(\frac{Y}{\sigma_Y}\right) + 2\mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \end{align} \] using the relationship we have just shown.

We note that when we divide a random variable by its standard deviation, the variance becomes one;

- therefore,

\[ \begin{align} & 0 \leq 1 + 1 +2 \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \\ \Leftrightarrow & -1\leq \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \end{align} \]

Let's recall that we just showed,

\[ -1\leq \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) . \]

Let's note that, \( \mathbb{E}\left[ \frac{X}{\sigma_X} \right] = \frac{\mu_X}{\sigma_X} \) so that \[ \begin{align} \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) &= \mathbb{E}\left[\left(\frac{X -\mu_X}{\sigma_X}\right)\left(\frac{Y - \mu_Y}{\sigma_Y}\right)\right] \\ &= \frac{\mathbb{E}\left[\left(X -\mu_X\right)\left(Y - \mu_Y \right) \right]}{\sigma_X \sigma_Y}\\ &= \frac{\sigma_{XY}}{\sigma_X \sigma_Y} \\ &= \mathrm{cor}(X,Y) \end{align} \]

Using the two statements above, we have \[ \begin{align} \Leftrightarrow & -1 \leq \mathrm{cor}\left(X,Y\right) \end{align} \]

If we repeat the above argument with \( -X \) in the place of \( X \), we will get the statement \( \mathrm{cor}\left(X,Y\right) \leq 1 \) to complete the argument.

In the last slide we showed how we can identify,

\[ -1 \leq \mathrm{cor}\left(X,Y\right)\leq 1 \] for any pair of random variables \( X \) and \( Y \).

With the above range in mind, we say that a

**correlation of “close-to-one”**means that the**variables \( X \) and \( Y \) vary together almost identically**;- an increase in \( X \) corresponds almost identically to a proportional increase in \( Y \).

Conversely, a

**correlation of “close-to-negative-one”**means that the**variables \( X \) and \( Y \) vary together almost identically oppositely**;- and increase in \( X \) corresponds almost identically to a proportionally decrease in \( Y \).

This can be understood similarly by taking the \( \mathrm{cov}\left(-X,X\right) \);

- notice that

\[ \begin{align} \mathrm{cov}\left(-X, X\right) &= \mathbb{E}\left[\left(-X - (-\mu_X) \right)\left( X - \mu_X\right) \right] \\ &= - \mathbb{E}\left[\left( X - \mu_X\right)^2\right]\\ &= - \mathrm{cov}(X,X) \end{align} \]

- Now, as an example we will consider the tabular data
`women`

in base R, with the recorded average heights and weights of American women aged 30 - 39.

```
summary(women)
```

```
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
```

```
head(women)
```

```
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
```

- The sample covariance and correlation can be computed for the above data as,

```
cov(women)
```

```
height weight
height 20 69.0000
weight 69 240.2095
```

```
cor(women)
```

```
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
```

- It is often more informative to visualize the correlation graphically, as is done with the library
`corrplot`

```
require(corrplot)
corrplot(cor(women))
```

This allows us once again to identify the key inter-relationship where height and weight tend to vary together proportionately for women, with correlation close to one.

The notion of the

**covariance matrix and the correlation matrix**pictured before numerically and visually actually have very specific meanings for theoretical reasons.Likewise, their efficient computation over larger data sets is an important problem;

- the remainder of this viewing assignment is to discuss at how we arise at the covariance matrix and its computation in practice.

We suppose now that we have a random vector, \( \boldsymbol{\xi} \) defined by some distribution \( F_\boldsymbol{\xi}(\mathbf{x}) \) where each component is a random variable,

\[ \begin{align} \boldsymbol{\xi} = \begin{pmatrix} \xi_1 \\ \vdots \\ \xi_p \end{pmatrix} \in \mathbb{R}^p. \end{align} \]

For each component random variable, we may similarly define,

\[ \begin{align} \mathrm{var}\left(\xi_i\right) &= \mathbb{E}\left[ \left(\xi_i - \mu_i\right)^2 \right] \\ \mathrm{cov}\left(\xi_i, \xi_j \right) &= \mathbb{E}\left[ \left(\xi_i - \mu_i\right) \left(\xi_j - \mu_j\right) \right] \end{align} \] as we did for \( X \) and \( Y \).

The component-wise definition above is convenient in how it extends from the simple discussion before;

- however, algebraically and computationally, this becomes much simpler to define in terms of the vector outer product.

Our Euclidean norm

\[ \parallel \mathbf{v}\parallel = \sqrt{\mathbf{v}^\mathrm{T} \mathbf{v}} \] gives the general form for a distance in arbitrarily large dimensions.

Notice it is defined in terms of the vector inner product, where

\[ \mathbf{v}^\mathrm{T}\mathbf{v} =\begin{pmatrix}v_1 & \cdots & v_p \end{pmatrix} \begin{pmatrix}v_1 \\ \vdots \\ v_p\end{pmatrix} = \sum_{i=1}^p v_i^2 \]

If we instead change the order of the transpose, we obtain the outer product as

\[ \begin{align} \mathbf{v}\mathbf{v}^\mathrm{T}& = \begin{pmatrix} v_1 \\ \vdots \\ v_p \end{pmatrix} \begin{pmatrix}v_1 & \cdots & v_p\end{pmatrix} \\ &= \begin{pmatrix} v_1 v_1 & v_1 v_2 & \cdots & v_1 v_p \\ v_2 v_1 & v_2 v_2 & \cdots & v_2 v_p \\ \vdots & \vdots & \ddots & \vdots \\ v_p v_1 & v_p v_2 & \cdots & v_p v_p \end{pmatrix}, \end{align} \] which is instead matrix valued in the output.

When we extend the notion of the covariance to a random vector \( \boldsymbol{\xi} \), finding the variances and the covariances of all of its entries, we arrive at the notion of covariance using the outer product.

Particularly, suppose that \( \mathbb{E}\left[\boldsymbol{\xi}\right] = \boldsymbol{\mu} \); then we write

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\xi}\right) = \boldsymbol{\Sigma} = \mathbb{E}\left[\left(\boldsymbol{\xi}-\boldsymbol{\mu}\right) \left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)^\mathrm{T} \right] \end{align} \]

Recall, we define

\[ \boldsymbol{\xi} - \boldsymbol{\mu} = \begin{pmatrix} \xi_1 - \mu_1 \\ \vdots \\ \xi_p - \mu_p \end{pmatrix}. \]

Using the previous outer product formula, we obtain the product

\[ \begin{align} \left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)^\mathrm{T} &= \begin{pmatrix} \left(\xi_1 - \mu_1\right)\left(\xi_1 - \mu_1 \right) & \cdots & \left(\xi_1 - \mu_1 \right) \left(\xi_p - \mu_p \right) \\ \vdots & \ddots & \vdots \\ \left(\xi_p - \mu_p \right)\left(\xi_1 - \mu_1 \right)& \cdots & \left(\xi_p - \mu_p \right)\left(\xi_p - \mu_p \right) \end{pmatrix} \end{align} \]

Using the previous argument, we can easily show that the

**entry in the \( i \)-th row and the \( j \)-th column**is given by\[ \boldsymbol{\Sigma}_{ij} = \begin{cases} \mathrm{var}\left( \xi_i\right) & & \text{when }i=j \\ \mathrm{cov}\left(\xi_i,\xi_j\right) & & \text{when } i \neq j \end{cases} \]

The above covariances and variances are to be understood in the same sense as in the univariate discussion, but for the component random variables \( \xi_i \) and \( \xi_j \).

Note, the covariance \( \mathrm{cov}\left(\xi_i, \xi_j\right) = \mathrm{cov}\left(\xi_j, \xi_i\right) \) is symmetric;

- therefore, \( \boldsymbol{\Sigma} \) enjoys all of the properties of the spectral theorem.

Furthermore, the

**eigenvalues of \( \boldsymbol{\Sigma} \) are all non-negative in general**.If the component random variables \( \xi_i,\xi_j \) are uncorrelated, \( \boldsymbol{\Sigma} \) is also diagonal,

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \mathrm{var}(\xi_1) & 0 & \cdots & 0 \\ 0 & \mathrm{var}(\xi_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots & \cdots & \mathrm{var}(\xi_p) \end{pmatrix} \] and the eigenvalues are identically the variances.

This is strongly related to the concept we introduced with

**curvature of the Hessian**, which we will see for maximum likelihood estimation has important consequences.

Some basic properties of the covariance follow immediately from the linearity of the expectation over sums.

Suppose that \( \mathbf{A} \) is a constant valued matrix, \( \mathbf{b} \) is a constant valued vector and \( \boldsymbol{\xi} \) is a random vector with expected value \( \boldsymbol{\mu} \) and covariance \( \boldsymbol{\Sigma} \).

Then notice that,

\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\xi} + \mathbf{b} \right] &= \mathbb{E}\left[\boldsymbol{\xi} \right] + \mathbf{b}\\ &= \boldsymbol{\mu} + \mathbf{b} \end{align} \]

Therefore, we have that,

\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\xi} + \mathbf{b}\right) &= \mathbb{E}\left[\left(\boldsymbol{\xi} + \mathbf{b} - \boldsymbol{\mu} - \mathbf{b}\right)\left(\boldsymbol{\xi} + \mathbf{b} - \boldsymbol{\mu} - \mathbf{b}\right)^\mathrm{T} \right]\\ &= \mathbb{E}\left[\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)^\mathrm{T} \right]\\ &= \mathrm{cov}\left(\boldsymbol{\xi}\right) \end{align} \]

We have also discussed that

\[ \begin{align} \mathbb{E}\left[ \mathbf{A} \boldsymbol{\xi} \right] &= \mathbf{A}\mathbb{E}\left[ \boldsymbol{\xi}\right] \\ &= \mathbf{A} \boldsymbol{\mu} \end{align} \]

It follows as a direct consequence that,

\[ \begin{align} \mathrm{cov}\left(\mathbf{A}\boldsymbol{\xi}\right)&= \mathbb{E}\left[\left(\mathbf{A}\boldsymbol{\xi} - \mathbf{A}\boldsymbol{\mu} \right)\left(\mathbf{A}\boldsymbol{\xi} - \mathbf{A}\boldsymbol{\mu} \right)^\mathrm{T} \right]\\ &=\mathbb{E}\left[\left\{ \mathbf{A} \left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\right\} \left\{ \mathbf{A} \left(\boldsymbol{\xi} - \boldsymbol{\mu} \right) \right\}^\mathrm{T} \right] \\ &= \mathbf{A}\mathbb{E}\left[\left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)\left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)^\mathrm{T}\right] \mathbf{A}^\mathrm{T} \\ &=\mathbf{A}\mathrm{cov}\left(\boldsymbol{\xi}\right)\mathbf{A}^\mathrm{T} \end{align} \]

Let us recall our construction of the random sample of vectors, drawn from some parent distribution \( F_\boldsymbol{\xi} \).

We will consider a sample size \( n \), with random vectors \( \{\boldsymbol{\xi}_i\}_{i=1}^n \) of length \( p \), to all be drawn from the same parent distribution for \( \boldsymbol{\xi} \sim F_\boldsymbol{\xi} \).

- I.e., we will say that this is an i.i.d. sample of the distribution / population for all of the jointly distributed component variables of \( \boldsymbol{\xi} \),

\[ \boldsymbol{\xi} = \begin{pmatrix} \xi_1 \\ \vdots \\ \xi_p\end{pmatrix}. \]

For each copy of the component random variable \( \xi_j \) we denote the \( i \)-th member in the sample \( \xi_{i,j} \) and we assemble these copies into the matrix \( \boldsymbol{\Xi} \) column-wise;

- that is, copies of \( \xi_j \) lie in the same column \( j \) and have rows indexed by \( i \)

\[ \begin{align} \boldsymbol{\Xi} = \begin{pmatrix} \xi_{1,1} & \cdots & \xi_{1,p} \\ \xi_{2,1} & \cdots & \xi_{2,p} \\ \vdots & \cdots & \vdots \\ \xi_{n,1}& \cdots & \xi_{n,p} \end{pmatrix} \end{align} \]

Let us recall for we again define the sample-based mean of the random variable \( \xi_j \) corresponding to the \( j \)-th entry of each above random vector as

\[ \begin{align} \overline{\xi}_j = \frac{1}{n} \sum_{i=1}^n \xi_{ij}, \end{align} \] which is itself a random variable.

This is equivalent to taking a column-average of the random matrix, taken over all rows \( i=1,\cdots, n \)

\[ \begin{align} \boldsymbol{\Xi} & = \begin{pmatrix} \xi_{11} & \cdots & \xi_{1p} \\ \vdots & \ddots & \vdots \\ \xi_{n1} & \cdots & \xi_{np} \end{pmatrix} \end{align} \]

Particularly, for the vector with all entries equal to one \( \boldsymbol{1}_n \), we have,

\[ \begin{align} \overline{\boldsymbol{\xi}} = \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n} \mathbf{1}_n = \begin{pmatrix} \overline{\xi}_1 \\ \vdots \\ \overline{\xi}_p \end{pmatrix} \end{align} \] as defined above.

We can thus define the sample covariance matrix in a way analogously to how we define the sample mean – we will only sketch this here.

Particularly, if we follow the matrix multiplication with the transpose, we find that

\[ \begin{align} \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n}\mathbf{1}_n\mathbf{1}_n^\mathrm{T} = \begin{pmatrix} \overline{\xi}_1 & \cdots & \overline{\xi}_1 \\ \vdots & \ddots & \vdots \\ \overline{\xi}_p & \cdots &\overline{\xi}_p \end{pmatrix}\in\mathbb{R}^{p \times n} \end{align} \]

- Using the element-wise subtraction, this says that,

\[ \begin{align} \boldsymbol{\Xi}^\mathrm{T} - \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n}\mathbf{1}_n\mathbf{1}_n^\mathrm{T} = \begin{pmatrix} \xi_{1,1} - \overline{\xi}_1 & \cdots &\xi_{1,n}- \overline{\xi}_1 \\ \vdots & \ddots & \vdots \\ \xi_{p,1} - \overline{\xi}_p & \cdots & \xi_{p,n} - \overline{\xi}_p \end{pmatrix} \end{align} \]

Now recall, the sample variance of a \( \{X_i\}_{i=1}^n \) can simply be written as

\[ \begin{align} S_X^2 = \frac{1}{n-1} \sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \end{align} \]

We can use the previous matrix to derive the analogous statement for the random sample of the vectors as,

\[ \begin{align} \hat{\boldsymbol{\Sigma}} = \frac{1}{n-1} \boldsymbol{\Xi}^\mathrm{T}\left(\mathbf{I}_n - \frac{1}{n}\mathbf{1}_n\mathbf{1}^\mathrm{T}_n\right)\boldsymbol{\Xi} \end{align} \] where:

- \( \mathbf{I}_n \) is the identity matrix in \( n \) dimensions;
- \( \mathcal{H} = \frac{1}{n-1}\left(\mathbf{I}_n - \frac{1}{n}\mathbf{1}_n\mathbf{1}^\mathrm{T}\right) \) is known as the centering matrix;
- the resulting product above computes the sample covariance of the component random variables of the vectors in each entry of the matrix;
- for observations of a random sample \( \mathbf{X} \), we can use the above formula with \( \mathbf{X} \) in place of \( \boldsymbol{\Xi} \) to compute a point estimate of the true \( \mathrm{cov}\left(\boldsymbol{\xi}\right) \).

When we have a matrix of tabular data or a dataframe in R, this is actually the exact formula that is used to compute the covariance matrix for the sample.

Numerically this provides an efficient implementation, as well as an algebraicly compact means to represent this in terms of matrix multiplication.