Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We have now introduced the expected value for a random vector \( \boldsymbol{\xi} \) as the analog of the center of mass in multiple variables.
In one dimension, the notion of variance \( \mathrm{var}\left(X\right)=\sigma_X^2 \) and the standard deviation \( \sigma_X \) give us measures of the spread of the random variable and the data derived from observations of it.
We define the variance of \( X \) once again in terms of,
\[ \mathrm{var}\left(X\right) = \sigma^2_X = \mathbb{E}\left[\left(X - \mu_X\right)^2\right] \] which can be seen as the average deviation of the random variable \( X \) from its mean, in the square sense.
When we have two random variables \( X \) and \( Y \), we will need to take additional considerations of how these variables co-vary together or oppositely in their conditional probability.
Consider that for the univariate expectation, with the two random variables \( X \) and \( Y \), we have
\[ \begin{align} \mathbb{E}\left[ X + Y \right] &= \mathbb{E}\left[X \right] + \mathbb{E}\left[ Y\right] \\ &=\mu_X + \mu_Y \end{align} \]
However, the same does not apply when we take the variance of the sum of the variables;
\[ \begin{align} \mathrm{var}\left( X+Y\right) &= \mathbb{E}\left[ \left(X + Y - \mu_X - \mu_Y\right)^2\right] \\ &=\mathbb{E}\left[\left\{ \left( X - \mu_X \right) +\left( Y - \mu_Y \right) \right\}^2\right]\\ & = \mathbb{E}\left[ \left( X - \mu_X \right)^2 + \left( Y - \mu_Y \right)^2 + 2 \left(X - \mu_X \right)\left(Y - \mu_Y\right)\right] \end{align} \]
Q: using the linearity of the expectation, and the definition of the variance, how can the above be simplified?
\[ \begin{align} \mathrm{var}\left( X+Y\right) &= \mathrm{var}\left(X\right) + \mathrm{var}\left(Y\right) + 2 \mathbb{E}\left[\left(X - \mu_X \right)\left(Y - \mu_Y\right)\right] \end{align} \]
Therefore, the combination of the random variables has a variance that is equal to the sum of the variances plus the newly identified cross terms.
We note that if \( X \) and \( Y \) are independent, i.e.,
\[ \begin{align} P(X\vert Y) = P(X) & & P(Y \vert X) = P(Y); \end{align} \]
then we have \[ \begin{align} \mathbb{E}\left[\left(X - \mu_X\right) \left(Y - \mu_Y \right)\right] = \mathbb{E}\left[X - \mu_X\right] \mathbb{E}\left[Y - \mu_Y \right] = 0. \end{align} \]
Therefore, we can consider the covariance,
\[ \mathrm{cov}\left(X,Y\right) = \sigma_{X,Y} = \mathbb{E}\left[\left(X - \mu_X \right)\left(Y - \mu_Y\right)\right], \] to be a measure of how the variables \( X \) and \( Y \) co-vary together in their conditional probabilities.
We should note that while \( \mathrm{cov}\left(X,Y\right)=0 \) for any pair of independent variables, this condition is not the same as independence.
Particularly, we will denote
\[ \begin{align} \mathrm{cor}(X,Y) =\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sqrt{\mathrm{var}\left(X\right) \mathrm{var}\left(Y\right)}}=\frac{\sigma_{X,Y}}{\sqrt{\sigma_{X}^2 \sigma_{Y}^2}} \end{align} \] the correlation between the variables \( X \) and \( Y \).
If the correlation / covariance of the two variables \( X \) and \( Y \) is equal to zero, then
\[ \mathrm{var}\left( X+Y\right) = \mathrm{var}\left(X\right) + \mathrm{var}\left(Y\right), \] but this does not imply that they are independent, just that we cannot detect the dependence structure with this measure.
Q: how can you use the above definition of the correlation to show that \( X \) always has correlation \( 1 \) with itself?
\[ \mathrm{cor}(X,X) =\rho_{X,X} = \frac{\mathrm{cov}(X,X)}{\sqrt{\mathrm{var}\left(X\right) \mathrm{var}\left(X\right)}}=\frac{\sigma_{X}^2}{\sqrt{\sigma_{X}^2 \sigma_{X}^2}}= 1 \]
More generally, we can say that for any two random variables \( X \) and \( Y \),
\[ -1 \leq \mathrm{cor}\left(X,Y\right)\leq 1. \]
This can be shown as follows, where
\[ \begin{align} 0 & \leq \mathrm{var}\left( \frac{X}{\sigma_X} + \frac{Y}{\sigma_Y} \right) \\ &=\mathrm{var}\left(\frac{X}{\sigma_X}\right) + \mathrm{var}\left(\frac{Y}{\sigma_Y}\right) + 2\mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \end{align} \] using the relationship we have just shown.
We note that when we divide a random variable by its standard deviation, the variance becomes one;
\[ \begin{align} & 0 \leq 1 + 1 +2 \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \\ \Leftrightarrow & -1\leq \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) \end{align} \]
Let's recall that we just showed,
\[ -1\leq \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) . \]
Let's note that, \( \mathbb{E}\left[ \frac{X}{\sigma_X} \right] = \frac{\mu_X}{\sigma_X} \) so that \[ \begin{align} \mathrm{cov}\left(\frac{X}{\sigma_X},\frac{Y}{\sigma_Y}\right) &= \mathbb{E}\left[\left(\frac{X -\mu_X}{\sigma_X}\right)\left(\frac{Y - \mu_Y}{\sigma_Y}\right)\right] \\ &= \frac{\mathbb{E}\left[\left(X -\mu_X\right)\left(Y - \mu_Y \right) \right]}{\sigma_X \sigma_Y}\\ &= \frac{\sigma_{XY}}{\sigma_X \sigma_Y} \\ &= \mathrm{cor}(X,Y) \end{align} \]
Using the two statements above, we have \[ \begin{align} \Leftrightarrow & -1 \leq \mathrm{cor}\left(X,Y\right) \end{align} \]
If we repeat the above argument with \( -X \) in the place of \( X \), we will get the statement \( \mathrm{cor}\left(X,Y\right) \leq 1 \) to complete the argument.
In the last slide we showed how we can identify,
\[ -1 \leq \mathrm{cor}\left(X,Y\right)\leq 1 \] for any pair of random variables \( X \) and \( Y \).
With the above range in mind, we say that a correlation of “close-to-one” means that the variables \( X \) and \( Y \) vary together almost identically;
Conversely, a correlation of “close-to-negative-one” means that the variables \( X \) and \( Y \) vary together almost identically oppositely;
This can be understood similarly by taking the \( \mathrm{cov}\left(-X,X\right) \);
\[ \begin{align} \mathrm{cov}\left(-X, X\right) &= \mathbb{E}\left[\left(-X - (-\mu_X) \right)\left( X - \mu_X\right) \right] \\ &= - \mathbb{E}\left[\left( X - \mu_X\right)^2\right]\\ &= - \mathrm{cov}(X,X) \end{align} \]
women
in base R, with the recorded average heights and weights of American women aged 30 - 39.summary(women)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
cov(women)
height weight
height 20 69.0000
weight 69 240.2095
cor(women)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
corrplot
require(corrplot)
corrplot(cor(women))
This allows us once again to identify the key inter-relationship where height and weight tend to vary together proportionately for women, with correlation close to one.
The notion of the covariance matrix and the correlation matrix pictured before numerically and visually actually have very specific meanings for theoretical reasons.
Likewise, their efficient computation over larger data sets is an important problem;
We suppose now that we have a random vector, \( \boldsymbol{\xi} \) defined by some distribution \( F_\boldsymbol{\xi}(\mathbf{x}) \) where each component is a random variable,
\[ \begin{align} \boldsymbol{\xi} = \begin{pmatrix} \xi_1 \\ \vdots \\ \xi_p \end{pmatrix} \in \mathbb{R}^p. \end{align} \]
For each component random variable, we may similarly define,
\[ \begin{align} \mathrm{var}\left(\xi_i\right) &= \mathbb{E}\left[ \left(\xi_i - \mu_i\right)^2 \right] \\ \mathrm{cov}\left(\xi_i, \xi_j \right) &= \mathbb{E}\left[ \left(\xi_i - \mu_i\right) \left(\xi_j - \mu_j\right) \right] \end{align} \] as we did for \( X \) and \( Y \).
The component-wise definition above is convenient in how it extends from the simple discussion before;
Our Euclidean norm
\[ \parallel \mathbf{v}\parallel = \sqrt{\mathbf{v}^\mathrm{T} \mathbf{v}} \] gives the general form for a distance in arbitrarily large dimensions.
Notice it is defined in terms of the vector inner product, where
\[ \mathbf{v}^\mathrm{T}\mathbf{v} =\begin{pmatrix}v_1 & \cdots & v_p \end{pmatrix} \begin{pmatrix}v_1 \\ \vdots \\ v_p\end{pmatrix} = \sum_{i=1}^p v_i^2 \]
If we instead change the order of the transpose, we obtain the outer product as
\[ \begin{align} \mathbf{v}\mathbf{v}^\mathrm{T}& = \begin{pmatrix} v_1 \\ \vdots \\ v_p \end{pmatrix} \begin{pmatrix}v_1 & \cdots & v_p\end{pmatrix} \\ &= \begin{pmatrix} v_1 v_1 & v_1 v_2 & \cdots & v_1 v_p \\ v_2 v_1 & v_2 v_2 & \cdots & v_2 v_p \\ \vdots & \vdots & \ddots & \vdots \\ v_p v_1 & v_p v_2 & \cdots & v_p v_p \end{pmatrix}, \end{align} \] which is instead matrix valued in the output.
When we extend the notion of the covariance to a random vector \( \boldsymbol{\xi} \), finding the variances and the covariances of all of its entries, we arrive at the notion of covariance using the outer product.
Particularly, suppose that \( \mathbb{E}\left[\boldsymbol{\xi}\right] = \boldsymbol{\mu} \); then we write
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\xi}\right) = \boldsymbol{\Sigma} = \mathbb{E}\left[\left(\boldsymbol{\xi}-\boldsymbol{\mu}\right) \left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)^\mathrm{T} \right] \end{align} \]
Recall, we define
\[ \boldsymbol{\xi} - \boldsymbol{\mu} = \begin{pmatrix} \xi_1 - \mu_1 \\ \vdots \\ \xi_p - \mu_p \end{pmatrix}. \]
Using the previous outer product formula, we obtain the product
\[ \begin{align} \left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)^\mathrm{T} &= \begin{pmatrix} \left(\xi_1 - \mu_1\right)\left(\xi_1 - \mu_1 \right) & \cdots & \left(\xi_1 - \mu_1 \right) \left(\xi_p - \mu_p \right) \\ \vdots & \ddots & \vdots \\ \left(\xi_p - \mu_p \right)\left(\xi_1 - \mu_1 \right)& \cdots & \left(\xi_p - \mu_p \right)\left(\xi_p - \mu_p \right) \end{pmatrix} \end{align} \]
Using the previous argument, we can easily show that the entry in the \( i \)-th row and the \( j \)-th column is given by
\[ \boldsymbol{\Sigma}_{ij} = \begin{cases} \mathrm{var}\left( \xi_i\right) & & \text{when }i=j \\ \mathrm{cov}\left(\xi_i,\xi_j\right) & & \text{when } i \neq j \end{cases} \]
The above covariances and variances are to be understood in the same sense as in the univariate discussion, but for the component random variables \( \xi_i \) and \( \xi_j \).
Note, the covariance \( \mathrm{cov}\left(\xi_i, \xi_j\right) = \mathrm{cov}\left(\xi_j, \xi_i\right) \) is symmetric;
Furthermore, the eigenvalues of \( \boldsymbol{\Sigma} \) are all non-negative in general.
If the component random variables \( \xi_i,\xi_j \) are uncorrelated, \( \boldsymbol{\Sigma} \) is also diagonal,
\[ \boldsymbol{\Sigma} = \begin{pmatrix} \mathrm{var}(\xi_1) & 0 & \cdots & 0 \\ 0 & \mathrm{var}(\xi_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots & \cdots & \mathrm{var}(\xi_p) \end{pmatrix} \] and the eigenvalues are identically the variances.
This is strongly related to the concept we introduced with curvature of the Hessian, which we will see for maximum likelihood estimation has important consequences.
Some basic properties of the covariance follow immediately from the linearity of the expectation over sums.
Suppose that \( \mathbf{A} \) is a constant valued matrix, \( \mathbf{b} \) is a constant valued vector and \( \boldsymbol{\xi} \) is a random vector with expected value \( \boldsymbol{\mu} \) and covariance \( \boldsymbol{\Sigma} \).
Then notice that,
\[ \begin{align} \mathbb{E}\left[ \boldsymbol{\xi} + \mathbf{b} \right] &= \mathbb{E}\left[\boldsymbol{\xi} \right] + \mathbf{b}\\ &= \boldsymbol{\mu} + \mathbf{b} \end{align} \]
Therefore, we have that,
\[ \begin{align} \mathrm{cov}\left(\boldsymbol{\xi} + \mathbf{b}\right) &= \mathbb{E}\left[\left(\boldsymbol{\xi} + \mathbf{b} - \boldsymbol{\mu} - \mathbf{b}\right)\left(\boldsymbol{\xi} + \mathbf{b} - \boldsymbol{\mu} - \mathbf{b}\right)^\mathrm{T} \right]\\ &= \mathbb{E}\left[\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)^\mathrm{T} \right]\\ &= \mathrm{cov}\left(\boldsymbol{\xi}\right) \end{align} \]
We have also discussed that
\[ \begin{align} \mathbb{E}\left[ \mathbf{A} \boldsymbol{\xi} \right] &= \mathbf{A}\mathbb{E}\left[ \boldsymbol{\xi}\right] \\ &= \mathbf{A} \boldsymbol{\mu} \end{align} \]
It follows as a direct consequence that,
\[ \begin{align} \mathrm{cov}\left(\mathbf{A}\boldsymbol{\xi}\right)&= \mathbb{E}\left[\left(\mathbf{A}\boldsymbol{\xi} - \mathbf{A}\boldsymbol{\mu} \right)\left(\mathbf{A}\boldsymbol{\xi} - \mathbf{A}\boldsymbol{\mu} \right)^\mathrm{T} \right]\\ &=\mathbb{E}\left[\left\{ \mathbf{A} \left(\boldsymbol{\xi} - \boldsymbol{\mu}\right)\right\} \left\{ \mathbf{A} \left(\boldsymbol{\xi} - \boldsymbol{\mu} \right) \right\}^\mathrm{T} \right] \\ &= \mathbf{A}\mathbb{E}\left[\left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)\left(\boldsymbol{\xi} - \boldsymbol{\mu} \right)^\mathrm{T}\right] \mathbf{A}^\mathrm{T} \\ &=\mathbf{A}\mathrm{cov}\left(\boldsymbol{\xi}\right)\mathbf{A}^\mathrm{T} \end{align} \]
Let us recall our construction of the random sample of vectors, drawn from some parent distribution \( F_\boldsymbol{\xi} \).
We will consider a sample size \( n \), with random vectors \( \{\boldsymbol{\xi}_i\}_{i=1}^n \) of length \( p \), to all be drawn from the same parent distribution for \( \boldsymbol{\xi} \sim F_\boldsymbol{\xi} \).
\[ \boldsymbol{\xi} = \begin{pmatrix} \xi_1 \\ \vdots \\ \xi_p\end{pmatrix}. \]
For each copy of the component random variable \( \xi_j \) we denote the \( i \)-th member in the sample \( \xi_{i,j} \) and we assemble these copies into the matrix \( \boldsymbol{\Xi} \) column-wise;
\[ \begin{align} \boldsymbol{\Xi} = \begin{pmatrix} \xi_{1,1} & \cdots & \xi_{1,p} \\ \xi_{2,1} & \cdots & \xi_{2,p} \\ \vdots & \cdots & \vdots \\ \xi_{n,1}& \cdots & \xi_{n,p} \end{pmatrix} \end{align} \]
Let us recall for we again define the sample-based mean of the random variable \( \xi_j \) corresponding to the \( j \)-th entry of each above random vector as
\[ \begin{align} \overline{\xi}_j = \frac{1}{n} \sum_{i=1}^n \xi_{ij}, \end{align} \] which is itself a random variable.
This is equivalent to taking a column-average of the random matrix, taken over all rows \( i=1,\cdots, n \)
\[ \begin{align} \boldsymbol{\Xi} & = \begin{pmatrix} \xi_{11} & \cdots & \xi_{1p} \\ \vdots & \ddots & \vdots \\ \xi_{n1} & \cdots & \xi_{np} \end{pmatrix} \end{align} \]
Particularly, for the vector with all entries equal to one \( \boldsymbol{1}_n \), we have,
\[ \begin{align} \overline{\boldsymbol{\xi}} = \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n} \mathbf{1}_n = \begin{pmatrix} \overline{\xi}_1 \\ \vdots \\ \overline{\xi}_p \end{pmatrix} \end{align} \] as defined above.
We can thus define the sample covariance matrix in a way analogously to how we define the sample mean – we will only sketch this here.
Particularly, if we follow the matrix multiplication with the transpose, we find that
\[ \begin{align} \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n}\mathbf{1}_n\mathbf{1}_n^\mathrm{T} = \begin{pmatrix} \overline{\xi}_1 & \cdots & \overline{\xi}_1 \\ \vdots & \ddots & \vdots \\ \overline{\xi}_p & \cdots &\overline{\xi}_p \end{pmatrix}\in\mathbb{R}^{p \times n} \end{align} \]
\[ \begin{align} \boldsymbol{\Xi}^\mathrm{T} - \boldsymbol{\Xi}^\mathrm{T}\frac{1}{n}\mathbf{1}_n\mathbf{1}_n^\mathrm{T} = \begin{pmatrix} \xi_{1,1} - \overline{\xi}_1 & \cdots &\xi_{1,n}- \overline{\xi}_1 \\ \vdots & \ddots & \vdots \\ \xi_{p,1} - \overline{\xi}_p & \cdots & \xi_{p,n} - \overline{\xi}_p \end{pmatrix} \end{align} \]
Now recall, the sample variance of a \( \{X_i\}_{i=1}^n \) can simply be written as
\[ \begin{align} S_X^2 = \frac{1}{n-1} \sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \end{align} \]
We can use the previous matrix to derive the analogous statement for the random sample of the vectors as,
\[ \begin{align} \hat{\boldsymbol{\Sigma}} = \frac{1}{n-1} \boldsymbol{\Xi}^\mathrm{T}\left(\mathbf{I}_n - \frac{1}{n}\mathbf{1}_n\mathbf{1}^\mathrm{T}_n\right)\boldsymbol{\Xi} \end{align} \] where:
When we have a matrix of tabular data or a dataframe in R, this is actually the exact formula that is used to compute the covariance matrix for the sample.
Numerically this provides an efficient implementation, as well as an algebraicly compact means to represent this in terms of matrix multiplication.