Conditional expectations and Bayesian inference

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Conditional expectations and Bayesian inference
- The conditional Gaussian distribution
- Correlation / Independence for the Gaussian
- Affine closure of the Gaussian

Conditional expectations and Bayesian inference

Now that we have developed some necessary theoretical tools, we will begin to consider the primary problem of this course: Bayesian inference.
Note, this bears similarity to the (linear / nonlinear) inverse problem we have seen already.
- In particular, we will be focused on how to determine the inputs of a relationship given an observable output.
However, inverse problems are, in some sense, a less realistic approach to our investigation.
That is, suppose we want to find the all the random physical states of the real atmosphere given satellite observations over a sparse grid of the earth.
Philosophically, an inverse problem is problematic;
- the true atmosphere does not live in our numerical representation, which is a coarse, unrealistic representation of reality.
On the other hand, discussing which numerical model states are most likely, given our data and our prior knowledge of the physical process, is a well-posed problem.
- We do not need to model reality exactly, but we can consider which of our representations are best suited given our present knowledge.
This follows the old statistical adage,

“All models are wrong, but some are useful.”

Conditional expectations

We briefly introduced conditional probabilities as part of our first look at probability.
In doing so, we purposefully went for an intuitive approach over a mathematical one.
In truth, there is more to conditional probabilities than one might suspect.
First of all, they are actually special cases of conditional expectations.
Also, they are random variables, not scalar values like regular, or unconditional, expectations.
We will not belabor the details of conditional expectations which require a measure-theoretic approach to rigorously derive.
However, we will introduce some intuition about this object more formally, before introducing some important properties of the conditional Gaussian.

Conditional expectations

To illustrate, let us consider two random variables, \( X \) and \( Y \), both of which are defined over a probability space \( (\Omega, \mathcal{A}, \mathcal{P} ) \).
- In the above, \( \mathcal{A} \) represents the collection of all events generated by simple events in the probability space.
We will assume that \( \mathcal{A} \) is generated by observable outcomes of the random variable \( X \);
- However, it is important to note that \( \mathcal{A} \) is not the only possible collection of observable events of the probability space.
We consider instead the collection of events associated to the second random variable when \( Y=y \) for an arbitrary \( y \), i.e., let the simple event of \( Y=y \) be given as

\[ \begin{align} B_y = \{ \omega: Y (\omega) = y\} \subset \Omega; \end{align} \]
We define the complete collection of all events generated from these simple events, varying \( y \), to be \( \mathcal{B} \).
We have implicitly assumed in this construction that \( \mathcal{B}\subset\mathcal{A} \) such that \( \mathcal{B} \) represents a coarser collection of outcomes than those generated by \( X \).
- This is to say that, observing such an outcome \( y \) of \( Y \) actually puts a restriction on the possible outcomes of \( X \).
- This follows the earlier analogy with the restriction of the sample space in the Venn diagram.

Conditional expectations

Let's consider then, if we restrict ourselves to the simple event associated to \( Y=y \), \( B_y \), we can define a random variable, \( \mathbb{E}\left[X |B_y \right] \), via

\[ \begin{align} \int_{B_y} \mathbb{E}\left[ X | B_y \right] \mathrm{d}\mathcal{P}(\omega) := \int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega). \end{align} \]
In the above, we are writing the conditional expectation \( \mathbb{E}\left[X |B_y \right] \) as the expectation of the random variable \( X \), but as restricted to the events associated to of \( Y=y \), where \( y \) is a free variable.
- If all of the event associated to \( Y=y \) is \( \Omega \), this is simply the regular expectation of \( X \).
However, taking \( y \) as a free variable, the above represents a random variable dependent on this outcome \( Y=y \).
Note, the conditional expectation is constant over this collection, because \( B_y \) is a simple event, such that

\[ \begin{align} & \mathbb{E}\left[X| B_y\right] \mathcal{P}\left(B_y\right) = \int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega)\\ \Leftrightarrow & \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega), \end{align} \] provided \( \mathcal{P}(B_y)\neq 0 \).

Conditional expectations

From the last slide we define the following.

Conditional expectations
Let \( y \) be some observable outcome of \( Y \), with the simple event \( B_y\subset \Omega \) associated to this value \( y \). The conditional expectation for \( X \) given \( Y=y \) is given as \[ \begin{align} \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]

This gives a mathematical sketch of what we mean by a conditional expectation.
This strongly resembles our intuitive axiom of probability, where we say that
- the probability of an event \( A \) given some event \( B \) is given by
- the total number of observable outcomes in \( A \), given the event \( B \),
- relative to the total number of outcomes in the collection \( B \).
This connection is made explicit in our next definition.

Conditional expectations

Consider the special case where \( X \) is actually just an indicator function on \( \mathcal{A} \), i.e.,

\[ \begin{align} X(\omega) := \begin{cases} 1 & \text{if }\omega\in A \\ 0 & \text{else} \end{cases}. \end{align} \]
In this special case, we actually thus define the following.

Conditional probability (advanced version)
Let \( (\Omega, \mathcal{A},\mathcal{P}) \) be a probability space, generated by the simple outcomes of the indicator function random variable \( X \) above. For a simple event \( B_y \) associated to the outcome \( Y=y \), and an event \( A\in \mathcal{A} \), the conditional probability is defined as \[ \begin{align} \mathcal{P}\left(A | B_y\right) &:= \mathbb{E}\left[X| B_y\right] \\ &= \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]

This tells us that the conditional probability of \( A \) given \( B_y \) is a special case of the conditional expectation, not the other way around.

Conditional expectations

If we continue this special case, we can write

\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} \mathcal{P}\left(A | B_y\right) \mathrm{d}\mathcal{P}\\ &= \mathcal{P}\left(A | B_y\right) \mathcal{P}\left(B_y\right) \end{align} \] as the above is constant over the simple events of \( Y \).
On the other hand, we can also write,

\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} X(\omega)\mathrm{d}\mathcal{P}(\omega)\\ &= \int_{A \cap B_y} \mathrm{d}\mathcal{P}(\omega) = \mathcal{P}\left(A \cap B_y\right). \end{align} \]
Putting the above equivalence together, we have that

\[ \begin{align} \mathcal{P}\left(A | B_y\right) = \frac{\mathcal{P}\left(A \cap B_y\right)}{\mathcal{P}\left(B_y\right)}, \end{align} \] recovering our original notion of conditional probability.
The other properties of the marginal, conditional and joint densities already seen are similarly recovered by following similar arguments.

The conditional Gaussian

We will now consider explicitly our main distribution for our approximations, the multivariate Gaussian.
We will start with the bi-variate Gaussian, as nearly all aspects generalize directly for arbitrary dimensions.
Suppose now that we have a random vector

\[ \begin{align} \pmb{x}:= \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{x}_1 \\ \overline{x}_2\end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} \right). \end{align} \]
- In the above, \( \rho \) refers to the background (theoretical) correlation coefficient between \( X_1 \) and \( X_2 \), i.e.,
\[ \begin{align} \rho := \frac{\sigma_{12}}{\sigma_1 \sigma_2}, \end{align} \] giving the standard form of the covariance by equivalence.
In this case, the conditional random variable for \( X_1 | X_2 =a \) is defined as

\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]
For those familiar already with regression, you may note that the above term

\[ \begin{align} \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right) \end{align} \] is the simple regression for the mean of \( X_1 \), given \( X_2 = a \).

The conditional Gaussian

Recall the formula from the last slide,

\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]
Similarly, \( \left(1 - \rho^2 \right)\sigma_1^2 \) is the variance of the simple regression around the mean function.
Without assuming a specific outcome for \( X_2=a \), we find the conditional expectation given as

\[ \begin{align} \mathbb{E}\left[X_1 | X_2 \right]:= \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(X_2 - \overline{x}_2\right), \end{align} \] where this again refers to the expected value of \( X_1 \) (its mean) given the outcome of \( X_2 \) (as a random variable).
Notice that the conditional variance is given as,

\[ \begin{align} \mathrm{var}\left(X_1 | X_2 \right):= \left(1 - \rho^2 \right)\sigma_1^2 \end{align} \] where this again does not depend on the particular outcome of \( X_2 \), like in the original formula.

The conditional Gaussian

More generally, let's suppose that \( \pmb{x}\in \mathbb{R}^{N_x} \) is an arbitrary Gaussian random vector, partitioned as

\[ \begin{align} \pmb{x}:= \begin{pmatrix} \pmb{x}_1 \\ \pmb{x}_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_1 \\ \overline{\pmb{x}}_2\end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_1 & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_2 \end{pmatrix} \right). \end{align} \]
We suppose that the dimensions are given then as,

\[ \begin{align} \pmb{x}_1,\overline{\pmb{x}}_1 \in \mathbb{R}^{n}, \quad \pmb{x}_2,\overline{\pmb{x}}_2 \in \mathbb{R}^{N_x -n}, \quad \boldsymbol{\Sigma}_{1} \in \mathbb{R}^{n\times n}, \quad \boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{21}^\top \in \mathbb{R}^{n \times N_x - n }, \quad \boldsymbol{\Sigma}_{2} \in \mathbb{R}^{N_x - n \times N_x - n}. \end{align} \]

General conditional Gaussian
Let \( \pmb{x}_1,\pmb{x}_2 \) be given as above, then the general form of the conditional distribution for \( \pmb{x}_1 | \pmb{x}_2 =\pmb{a} \) is given by the Gaussian \[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right), \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21}\right). \end{align} \]

The term \( \overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{2}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right) \) again represents the conditional mean of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2=\pmb{a} \).
Likewise, \( \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21} \) is the covariance of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2 = \pmb{a} \).
Similar definitions apply for the random vector / matrix \( \mathbb{E}\left[\pmb{x}_1 | \pmb{x}_2 \right] \) and \( \mathrm{cov}\left(\pmb{x}_1 | \pmb{x}_2\right) \).
For those familiar, you may recognize these as the classical Kalman filter equations in disguise – we'll return to this idea shortly in the course.

Correlation and independence for the conditional Gaussian

Let's suppose now that the components of the vector \( \pmb{x} \) are not correlated, i.e,

\[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}} , \begin{pmatrix} \boldsymbol{\Sigma}_1 & \pmb{0} \\ \pmb{0}^\top & \boldsymbol{\Sigma}_2 \end{pmatrix}\right). \end{align} \]
From the form of the conditional distribution for \( \pmb{x}_1|\pmb{x}_2=\pmb{a} \) we note that

\[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1, \boldsymbol{\Sigma}_{1}\right), \end{align} \] given the cancellation due to the zero matrices \( \pmb{0} = \boldsymbol{\Sigma}_{12}= \boldsymbol{\Sigma}_{21}^\top \).
- Furthermore, we can use the symmetry in the indices to derive the same property for \( \pmb{x}_2 | \pmb{x}_1 \).
This simple property reveals an important consequence of the conditional Gaussian.

Correlation and independence for the Gaussian
Suppose that \( \pmb{x}_1, \pmb{x}_2 \) are jointly Gaussian distributed, uncorrelated as above. Then \( \mathcal{P}(\pmb{x}_1 | \pmb{x}_2 = \pmb{a}) = \mathcal{P}(\pmb{x}_1) \) for all \( \pmb{a} \) and \( \mathcal{P}(\pmb{x}_2 | \pmb{x}_1 = \pmb{b}) = \mathcal{P}(\pmb{x}_2) \) for all \( \pmb{b} \). Therefore, uncorrelated, jointly Gaussian distributed random variables are independent.

Note that, in general, de-correlation is not equivalent to independence;
- this is a special property of the Gaussian, but one that we can utilize to simplify approximations with the Gaussian.

Affine closure of the Gaussian

Recall, we are principally interested in time-varying systems, modeling random states.
A highly useful property of the Gaussian approximation is that Gaussians are closed under a general extension of linear transformations.
We will make this slightly more formal as follows.

Affine transformations
A mapping \( \pmb{f}:\mathbb{R}^{N}\rightarrow \mathbb{R}^{N} \) is called an affine transformation if it is composed as vector addition and a linear transformation as \[ \begin{align} \pmb{f}(\pmb{x}) = \mathbf{A}\pmb{x} + \pmb{b}. \end{align} \]

Note that in the above, this is only a linear transformation when \( \pmb{b}=\pmb{0} \).
Rather, this can be interpreted as a generalization of a linear transformation, but translated to a point in space \( \pmb{b} \), even when \( \pmb{b}\neq \pmb{0} \).
This bears striking similarity to the (linear / nonlinear) inverse problem, and the first order approximation of a nonlinear function;
- we will return to this in a moment.

Affine closure of the Gaussian

A critical property of the multivariate Gaussian is that a Gaussian random variable, under an affine transformation, remains Gaussian.

Affine closure of the Gaussian
Let \( \pmb{x} \) be distributed as \[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}}, \mathbf{B}\right). \end{align} \] Then the random variable \( \pmb{y} := \pmb{b} + \mathbf{A}\pmb{x} \) is distributed as \[ \begin{align} \pmb{y} \sim N \left(\pmb{b}+\mathbf{A}\overline{\pmb{x}}, \mathbf{A}\mathbf{B}\mathbf{A}^\top \right). \end{align} \]

Suppose we model a Gaussian random vector as a perturbation from its mean state, i.e.,

\[ \begin{align} \pmb{x} = \overline{\pmb{x}} + \pmb{\delta}, \end{align} \] where \( \pmb{\delta} \sim N(\pmb{0}, \mathbf{B}) \).
Consider then the first order approximation of a nonlinear function \( \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^N \)

\[ \begin{align} \pmb{f}(\pmb{x}) \approx \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}}) \pmb{\delta}, \end{align} \] which is an affine transformation of the Gaussian random variable \( \pmb{\delta} \).

Tangent, linear-Gaussian approximation
Suppose \( \pmb{x}:= \overline{\pmb{x}} + \pmb{\delta} \) is a perturbation of the mean as defined above. Provided the tangent approximation is valid (small perturbations and small errors), then \( \pmb{f}(\pmb{x}) \) is approximately distributed under the linear-Gaussian approximation as \[ \begin{align} \pmb{f}(\pmb{x}) \sim N\left( \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}})\pmb{\delta},\left[ \nabla\pmb{f}(\overline{\pmb{x}})\right]\mathbf{B}\left[\nabla\pmb{f}(\overline{\pmb{x}})\right]^\top\right). \end{align} \]