Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- Conditional expectations and Bayesian inference
- The conditional Gaussian distribution
- Correlation / Independence for the Gaussian
- Affine closure of the Gaussian

Now that we have developed some necessary theoretical tools, we will begin to consider the

**primary problem of this course**:**Bayesian inference**.Note, this bears

**similarity to the (linear / nonlinear) inverse problem**we have seen already.- In particular, we will be focused on how to determine the inputs of a relationship given an observable output.

However, inverse problems are, in some sense, a less realistic approach to our investigation.

That is, suppose we want to find the all the random physical states of the real atmosphere given satellite observations over a sparse grid of the earth.

Philosophically, an inverse problem is problematic;

- the
**true atmosphere does not live in our numerical representation**, which is a coarse, unrealistic representation of reality.

- the
On the other hand, discussing

**which numerical model states are most likely**,**given our data and our prior knowledge of the physical process**,**is a well-posed problem**.- We do not need to model reality exactly, but we can consider which of our representations are best suited given our present knowledge.

This follows the old statistical adage,

“All models are wrong, but some are useful.”

We briefly introduced conditional probabilities as part of our first look at probability.

In doing so, we purposefully went for an intuitive approach over a mathematical one.

In truth, there is more to conditional probabilities than one might suspect.

First of all, they are actually special cases of

**conditional expectations**.Also,

**they are random variables**, not scalar values like regular, or unconditional, expectations.We will not belabor the details of conditional expectations which require a measure-theoretic approach to rigorously derive.

However, we will introduce some intuition about this object more formally, before introducing some important properties of the conditional Gaussian.

To illustrate, let us consider two random variables, \( X \) and \( Y \), both of which are defined over a probability space \( (\Omega, \mathcal{A}, \mathcal{P} ) \).

- In the above, \( \mathcal{A} \)
**represents the collection of all events generated by simple events in the probability space**.

- In the above, \( \mathcal{A} \)
We will assume that \( \mathcal{A} \) is generated by observable outcomes of the random variable \( X \);

- However, it is important to note that \( \mathcal{A} \) is not the only possible collection of observable events of the probability space.

We consider instead the

**collection of events associated to the second random variable**when \( Y=y \) for an arbitrary \( y \), i.e., let the**simple event**of \( Y=y \) be given as\[ \begin{align} B_y = \{ \omega: Y (\omega) = y\} \subset \Omega; \end{align} \]

We define the

**complete collection of all events**generated from these simple events, varying \( y \), to be \( \mathcal{B} \).We have implicitly assumed in this construction that \( \mathcal{B}\subset\mathcal{A} \) such that \( \mathcal{B} \)

**represents a coarser collection of outcomes**than those generated by \( X \).- This is to say that,
**observing such an outcome**\( y \) of \( Y \) actually**puts a restriction on the possible outcomes**of \( X \). - This follows the earlier analogy with the restriction of the sample space in the Venn diagram.

- This is to say that,

Let's consider then, if we

**restrict ourselves to the simple event**associated to \( Y=y \), \( B_y \), we can define a random variable, \( \mathbb{E}\left[X |B_y \right] \), via\[ \begin{align} \int_{B_y} \mathbb{E}\left[ X | B_y \right] \mathrm{d}\mathcal{P}(\omega) := \int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega). \end{align} \]

In the above, we are writing the conditional expectation \( \mathbb{E}\left[X |B_y \right] \) as the

**expectation of the random variable**\( X \), but as**restricted to the events associated to**of \( Y=y \), where \( y \) is a free variable.- If all of the event associated to \( Y=y \) is \( \Omega \), this is simply the regular expectation of \( X \).

However,

**taking \( y \) as a free variable**, the above represents a**random variable dependent on this outcome**\( Y=y \).Note, the

**conditional expectation is constant over this collection**, because \( B_y \) is a**simple event**, such that\[ \begin{align} & \mathbb{E}\left[X| B_y\right] \mathcal{P}\left(B_y\right) = \int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega)\\ \Leftrightarrow & \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega), \end{align} \] provided \( \mathcal{P}(B_y)\neq 0 \).

- From the last slide we define the following.

Conditional expectations

Let \( y \) be some observable outcome of \( Y \), with the simple event \( B_y\subset \Omega \) associated to this value \( y \). Theconditional expectationfor \( X \) given \( Y=y \) is given as \[ \begin{align} \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]

This gives a mathematical sketch of what we mean by a conditional expectation.

This strongly resembles our intuitive axiom of probability, where we say that

- the probability of an event \( A \) given some event \( B \) is given by
- the total number of observable outcomes in \( A \), given the event \( B \),
- relative to the total number of outcomes in the collection \( B \).

This connection is made explicit in our next definition.

Consider the special case where \( X \) is actually just an indicator function on \( \mathcal{A} \), i.e.,

\[ \begin{align} X(\omega) := \begin{cases} 1 & \text{if }\omega\in A \\ 0 & \text{else} \end{cases}. \end{align} \]

In this special case, we actually thus define the following.

Conditional probability (advanced version)

Let \( (\Omega, \mathcal{A},\mathcal{P}) \) be a probability space, generated by the simple outcomes of the indicator function random variable \( X \) above. For a simple event \( B_y \) associated to the outcome \( Y=y \), and an event \( A\in \mathcal{A} \), theconditional probabilityis defined as \[ \begin{align} \mathcal{P}\left(A | B_y\right) &:= \mathbb{E}\left[X| B_y\right] \\ &= \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]

- This tells us that the conditional probability of \( A \) given \( B_y \) is a special case of the conditional expectation, not the other way around.

If we continue this special case, we can write

\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} \mathcal{P}\left(A | B_y\right) \mathrm{d}\mathcal{P}\\ &= \mathcal{P}\left(A | B_y\right) \mathcal{P}\left(B_y\right) \end{align} \] as the above is constant over the simple events of \( Y \).

On the other hand, we can also write,

\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} X(\omega)\mathrm{d}\mathcal{P}(\omega)\\ &= \int_{A \cap B_y} \mathrm{d}\mathcal{P}(\omega) = \mathcal{P}\left(A \cap B_y\right). \end{align} \]

Putting the above equivalence together, we have that

\[ \begin{align} \mathcal{P}\left(A | B_y\right) = \frac{\mathcal{P}\left(A \cap B_y\right)}{\mathcal{P}\left(B_y\right)}, \end{align} \] recovering our original notion of conditional probability.

The other properties of the

**marginal, conditional and joint densities**already seen are**similarly recovered by following similar arguments**.

We will now consider explicitly our

**main distribution for our approximations**, the**multivariate Gaussian**.We will start with the bi-variate Gaussian, as nearly all aspects generalize directly for arbitrary dimensions.

Suppose now that we have a random vector

\[ \begin{align} \pmb{x}:= \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{x}_1 \\ \overline{x}_2\end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} \right). \end{align} \]

- In the above, \( \rho \) refers to the
**background (theoretical) correlation coefficient**between \( X_1 \) and \( X_2 \), i.e.,

\[ \begin{align} \rho := \frac{\sigma_{12}}{\sigma_1 \sigma_2}, \end{align} \] giving the standard form of the covariance by equivalence.

- In the above, \( \rho \) refers to the
In this case, the

**conditional random variable**for \( X_1 | X_2 =a \) is defined as\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]

For those familiar already with regression, you may note that the above term

\[ \begin{align} \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right) \end{align} \] is the

**simple regression for the mean**of \( X_1 \), given \( X_2 = a \).

Recall the formula from the last slide,

\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]

Similarly, \( \left(1 - \rho^2 \right)\sigma_1^2 \) is the

**variance of the simple regression around the mean function**.Without assuming a specific outcome for \( X_2=a \), we find the

**conditional expectation**given as\[ \begin{align} \mathbb{E}\left[X_1 | X_2 \right]:= \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(X_2 - \overline{x}_2\right), \end{align} \] where this again refers to the expected value of \( X_1 \) (its mean) given the outcome of \( X_2 \) (as a random variable).

Notice that the conditional variance is given as,

\[ \begin{align} \mathrm{var}\left(X_1 | X_2 \right):= \left(1 - \rho^2 \right)\sigma_1^2 \end{align} \] where this again

**does not depend on the particular outcome**of \( X_2 \), like in the original formula.

More generally, let's suppose that \( \pmb{x}\in \mathbb{R}^{N_x} \) is an

**arbitrary Gaussian random vector**, partitioned as\[ \begin{align} \pmb{x}:= \begin{pmatrix} \pmb{x}_1 \\ \pmb{x}_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_1 \\ \overline{\pmb{x}}_2\end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_1 & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_2 \end{pmatrix} \right). \end{align} \]

We suppose that the dimensions are given then as,

\[ \begin{align} \pmb{x}_1,\overline{\pmb{x}}_1 \in \mathbb{R}^{n}, \quad \pmb{x}_2,\overline{\pmb{x}}_2 \in \mathbb{R}^{N_x -n}, \quad \boldsymbol{\Sigma}_{1} \in \mathbb{R}^{n\times n}, \quad \boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{21}^\top \in \mathbb{R}^{n \times N_x - n }, \quad \boldsymbol{\Sigma}_{2} \in \mathbb{R}^{N_x - n \times N_x - n}. \end{align} \]

General conditional Gaussian

Let \( \pmb{x}_1,\pmb{x}_2 \) be given as above, then the general form of the conditional distribution for \( \pmb{x}_1 | \pmb{x}_2 =\pmb{a} \) is given by the Gaussian \[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right), \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21}\right). \end{align} \]

The term \( \overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{2}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right) \) again represents the conditional mean of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2=\pmb{a} \).

Likewise, \( \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21} \) is the covariance of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2 = \pmb{a} \).

Similar definitions apply for the random vector / matrix \( \mathbb{E}\left[\pmb{x}_1 | \pmb{x}_2 \right] \) and \( \mathrm{cov}\left(\pmb{x}_1 | \pmb{x}_2\right) \).

For those familiar, you may recognize these as the

**classical Kalman filter equations**in disguise – we'll return to this idea shortly in the course.

Let's suppose now that the components of the vector \( \pmb{x} \) are

**not correlated**, i.e,\[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}} , \begin{pmatrix} \boldsymbol{\Sigma}_1 & \pmb{0} \\ \pmb{0}^\top & \boldsymbol{\Sigma}_2 \end{pmatrix}\right). \end{align} \]

From the form of the conditional distribution for \( \pmb{x}_1|\pmb{x}_2=\pmb{a} \) we note that

\[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1, \boldsymbol{\Sigma}_{1}\right), \end{align} \] given the cancellation due to the zero matrices \( \pmb{0} = \boldsymbol{\Sigma}_{12}= \boldsymbol{\Sigma}_{21}^\top \).

- Furthermore, we can use the symmetry in the indices to derive the same property for \( \pmb{x}_2 | \pmb{x}_1 \).

This simple property reveals an important consequence of the conditional Gaussian.

Correlation and independence for the Gaussian

Suppose that \( \pmb{x}_1, \pmb{x}_2 \) are jointly Gaussian distributed, uncorrelated as above. Then \( \mathcal{P}(\pmb{x}_1 | \pmb{x}_2 = \pmb{a}) = \mathcal{P}(\pmb{x}_1) \) for all \( \pmb{a} \) and \( \mathcal{P}(\pmb{x}_2 | \pmb{x}_1 = \pmb{b}) = \mathcal{P}(\pmb{x}_2) \) for all \( \pmb{b} \). Therefore,uncorrelated, jointly Gaussian distributed random variables are independent.

Note that, in general, de-correlation is not equivalent to independence;

- this is a special property of the Gaussian, but one that we can utilize to simplify approximations with the Gaussian.

Recall, we are principally interested in

**time-varying systems, modeling random states**.A highly useful property of the Gaussian approximation is that

**Gaussians are closed**under a general extension of linear transformations.We will make this slightly more formal as follows.

Affine transformations

A mapping \( \pmb{f}:\mathbb{R}^{N}\rightarrow \mathbb{R}^{N} \) is called an affine transformation if it is composed as vector addition and a linear transformation as \[ \begin{align} \pmb{f}(\pmb{x}) = \mathbf{A}\pmb{x} + \pmb{b}. \end{align} \]

Note that in the above, this is only a linear transformation when \( \pmb{b}=\pmb{0} \).

Rather, this can be interpreted as a

**generalization of a linear transformation**, but**translated to a point in space \( \pmb{b} \)**, even when \( \pmb{b}\neq \pmb{0} \).This bears striking

**similarity to the (linear / nonlinear) inverse problem**, and the first order approximation of a nonlinear function;- we will return to this in a moment.

- A critical property of the multivariate Gaussian is that a Gaussian random variable, under an affine transformation,
**remains Gaussian**.

Affine closure of the Gaussian

Let \( \pmb{x} \) be distributed as \[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}}, \mathbf{B}\right). \end{align} \] Then the random variable \( \pmb{y} := \pmb{b} + \mathbf{A}\pmb{x} \) is distributed as \[ \begin{align} \pmb{y} \sim N \left(\pmb{b}+\mathbf{A}\overline{\pmb{x}}, \mathbf{A}\mathbf{B}\mathbf{A}^\top \right). \end{align} \]

Suppose we

**model a Gaussian random vector as a perturbation from its mean state**, i.e.,\[ \begin{align} \pmb{x} = \overline{\pmb{x}} + \pmb{\delta}, \end{align} \] where \( \pmb{\delta} \sim N(\pmb{0}, \mathbf{B}) \).

Consider then the

**first order approximation**of a nonlinear function \( \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^N \)\[ \begin{align} \pmb{f}(\pmb{x}) \approx \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}}) \pmb{\delta}, \end{align} \] which is an

**affine transformation of the Gaussian random variable**\( \pmb{\delta} \).

Tangent, linear-Gaussian approximation

Suppose \( \pmb{x}:= \overline{\pmb{x}} + \pmb{\delta} \) is a perturbation of the mean as defined above. Provided thetangent approximationis valid (small perturbations and small errors), then \( \pmb{f}(\pmb{x}) \) is approximately distributed under thelinear-Gaussian approximationas \[ \begin{align} \pmb{f}(\pmb{x}) \sim N\left( \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}})\pmb{\delta},\left[ \nabla\pmb{f}(\overline{\pmb{x}})\right]\mathbf{B}\left[\nabla\pmb{f}(\overline{\pmb{x}})\right]^\top\right). \end{align} \]