Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Now that we have developed some necessary theoretical tools, we will begin to consider the primary problem of this course: Bayesian inference.
Note, this bears similarity to the (linear / nonlinear) inverse problem we have seen already.
However, inverse problems are, in some sense, a less realistic approach to our investigation.
That is, suppose we want to find the all the random physical states of the real atmosphere given satellite observations over a sparse grid of the earth.
Philosophically, an inverse problem is problematic;
On the other hand, discussing which numerical model states are most likely, given our data and our prior knowledge of the physical process, is a well-posed problem.
This follows the old statistical adage,
“All models are wrong, but some are useful.”
We briefly introduced conditional probabilities as part of our first look at probability.
In doing so, we purposefully went for an intuitive approach over a mathematical one.
In truth, there is more to conditional probabilities than one might suspect.
First of all, they are actually special cases of conditional expectations.
Also, they are random variables, not scalar values like regular, or unconditional, expectations.
We will not belabor the details of conditional expectations which require a measure-theoretic approach to rigorously derive.
However, we will introduce some intuition about this object more formally, before introducing some important properties of the conditional Gaussian.
To illustrate, let us consider two random variables, \( X \) and \( Y \), both of which are defined over a probability space \( (\Omega, \mathcal{A}, \mathcal{P} ) \).
We will assume that \( \mathcal{A} \) is generated by observable outcomes of the random variable \( X \);
We consider instead the collection of events associated to the second random variable when \( Y=y \) for an arbitrary \( y \), i.e., let the simple event of \( Y=y \) be given as
\[ \begin{align} B_y = \{ \omega: Y (\omega) = y\} \subset \Omega; \end{align} \]
We define the complete collection of all events generated from these simple events, varying \( y \), to be \( \mathcal{B} \).
We have implicitly assumed in this construction that \( \mathcal{B}\subset\mathcal{A} \) such that \( \mathcal{B} \) represents a coarser collection of outcomes than those generated by \( X \).
Let's consider then, if we restrict ourselves to the simple event associated to \( Y=y \), \( B_y \), we can define a random variable, \( \mathbb{E}\left[X |B_y \right] \), via
\[ \begin{align} \int_{B_y} \mathbb{E}\left[ X | B_y \right] \mathrm{d}\mathcal{P}(\omega) := \int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega). \end{align} \]
In the above, we are writing the conditional expectation \( \mathbb{E}\left[X |B_y \right] \) as the expectation of the random variable \( X \), but as restricted to the events associated to of \( Y=y \), where \( y \) is a free variable.
However, taking \( y \) as a free variable, the above represents a random variable dependent on this outcome \( Y=y \).
Note, the conditional expectation is constant over this collection, because \( B_y \) is a simple event, such that
\[ \begin{align} & \mathbb{E}\left[X| B_y\right] \mathcal{P}\left(B_y\right) = \int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega)\\ \Leftrightarrow & \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega) \mathrm{d}\mathcal{P}(\omega), \end{align} \] provided \( \mathcal{P}(B_y)\neq 0 \).
Conditional expectations
Let \( y \) be some observable outcome of \( Y \), with the simple event \( B_y\subset \Omega \) associated to this value \( y \). The conditional expectation for \( X \) given \( Y=y \) is given as \[ \begin{align} \mathbb{E}\left[X| B_y\right] = \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]
This gives a mathematical sketch of what we mean by a conditional expectation.
This strongly resembles our intuitive axiom of probability, where we say that
This connection is made explicit in our next definition.
Consider the special case where \( X \) is actually just an indicator function on \( \mathcal{A} \), i.e.,
\[ \begin{align} X(\omega) := \begin{cases} 1 & \text{if }\omega\in A \\ 0 & \text{else} \end{cases}. \end{align} \]
In this special case, we actually thus define the following.
Conditional probability (advanced version)
Let \( (\Omega, \mathcal{A},\mathcal{P}) \) be a probability space, generated by the simple outcomes of the indicator function random variable \( X \) above. For a simple event \( B_y \) associated to the outcome \( Y=y \), and an event \( A\in \mathcal{A} \), the conditional probability is defined as \[ \begin{align} \mathcal{P}\left(A | B_y\right) &:= \mathbb{E}\left[X| B_y\right] \\ &= \frac{1}{\mathcal{P}\left(B_y\right)}\int_{B_y} X(\omega)\mathrm{d} \mathcal{P}(\omega). \end{align} \]
If we continue this special case, we can write
\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} \mathcal{P}\left(A | B_y\right) \mathrm{d}\mathcal{P}\\ &= \mathcal{P}\left(A | B_y\right) \mathcal{P}\left(B_y\right) \end{align} \] as the above is constant over the simple events of \( Y \).
On the other hand, we can also write,
\[ \begin{align} \int_{B_y}\mathbb{E}\left[ X | B_y\right] \mathrm{d}\mathcal{P}(\omega) &:= \int_{B_y} X(\omega)\mathrm{d}\mathcal{P}(\omega)\\ &= \int_{A \cap B_y} \mathrm{d}\mathcal{P}(\omega) = \mathcal{P}\left(A \cap B_y\right). \end{align} \]
Putting the above equivalence together, we have that
\[ \begin{align} \mathcal{P}\left(A | B_y\right) = \frac{\mathcal{P}\left(A \cap B_y\right)}{\mathcal{P}\left(B_y\right)}, \end{align} \] recovering our original notion of conditional probability.
The other properties of the marginal, conditional and joint densities already seen are similarly recovered by following similar arguments.
We will now consider explicitly our main distribution for our approximations, the multivariate Gaussian.
We will start with the bi-variate Gaussian, as nearly all aspects generalize directly for arbitrary dimensions.
Suppose now that we have a random vector
\[ \begin{align} \pmb{x}:= \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{x}_1 \\ \overline{x}_2\end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} \right). \end{align} \]
\[ \begin{align} \rho := \frac{\sigma_{12}}{\sigma_1 \sigma_2}, \end{align} \] giving the standard form of the covariance by equivalence.
In this case, the conditional random variable for \( X_1 | X_2 =a \) is defined as
\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]
For those familiar already with regression, you may note that the above term
\[ \begin{align} \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right) \end{align} \] is the simple regression for the mean of \( X_1 \), given \( X_2 = a \).
Recall the formula from the last slide,
\[ \begin{align} X_1 | X_2 =a \sim N\left(\overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(a - \overline{x}_2\right), \left(1 - \rho^2 \right)\sigma_1^2\right). \end{align} \]
Similarly, \( \left(1 - \rho^2 \right)\sigma_1^2 \) is the variance of the simple regression around the mean function.
Without assuming a specific outcome for \( X_2=a \), we find the conditional expectation given as
\[ \begin{align} \mathbb{E}\left[X_1 | X_2 \right]:= \overline{x}_1 + \rho \frac{\sigma_1}{\sigma_2}\left(X_2 - \overline{x}_2\right), \end{align} \] where this again refers to the expected value of \( X_1 \) (its mean) given the outcome of \( X_2 \) (as a random variable).
Notice that the conditional variance is given as,
\[ \begin{align} \mathrm{var}\left(X_1 | X_2 \right):= \left(1 - \rho^2 \right)\sigma_1^2 \end{align} \] where this again does not depend on the particular outcome of \( X_2 \), like in the original formula.
More generally, let's suppose that \( \pmb{x}\in \mathbb{R}^{N_x} \) is an arbitrary Gaussian random vector, partitioned as
\[ \begin{align} \pmb{x}:= \begin{pmatrix} \pmb{x}_1 \\ \pmb{x}_2 \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_1 \\ \overline{\pmb{x}}_2\end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_1 & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_2 \end{pmatrix} \right). \end{align} \]
We suppose that the dimensions are given then as,
\[ \begin{align} \pmb{x}_1,\overline{\pmb{x}}_1 \in \mathbb{R}^{n}, \quad \pmb{x}_2,\overline{\pmb{x}}_2 \in \mathbb{R}^{N_x -n}, \quad \boldsymbol{\Sigma}_{1} \in \mathbb{R}^{n\times n}, \quad \boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{21}^\top \in \mathbb{R}^{n \times N_x - n }, \quad \boldsymbol{\Sigma}_{2} \in \mathbb{R}^{N_x - n \times N_x - n}. \end{align} \]
General conditional Gaussian
Let \( \pmb{x}_1,\pmb{x}_2 \) be given as above, then the general form of the conditional distribution for \( \pmb{x}_1 | \pmb{x}_2 =\pmb{a} \) is given by the Gaussian \[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right), \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21}\right). \end{align} \]
The term \( \overline{\pmb{x}}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{2}^{-1}\left(\pmb{a} - \overline{\pmb{x}}_2\right) \) again represents the conditional mean of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2=\pmb{a} \).
Likewise, \( \boldsymbol{\Sigma}_{1} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{2}^{-1} \boldsymbol{\Sigma}_{21} \) is the covariance of \( \pmb{x}_1 \) given the observed value \( \pmb{x}_2 = \pmb{a} \).
Similar definitions apply for the random vector / matrix \( \mathbb{E}\left[\pmb{x}_1 | \pmb{x}_2 \right] \) and \( \mathrm{cov}\left(\pmb{x}_1 | \pmb{x}_2\right) \).
For those familiar, you may recognize these as the classical Kalman filter equations in disguise – we'll return to this idea shortly in the course.
Let's suppose now that the components of the vector \( \pmb{x} \) are not correlated, i.e,
\[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}} , \begin{pmatrix} \boldsymbol{\Sigma}_1 & \pmb{0} \\ \pmb{0}^\top & \boldsymbol{\Sigma}_2 \end{pmatrix}\right). \end{align} \]
From the form of the conditional distribution for \( \pmb{x}_1|\pmb{x}_2=\pmb{a} \) we note that
\[ \begin{align} \pmb{x}_1 | \pmb{x}_2 = \pmb{a} \sim N\left(\overline{\pmb{x}}_1, \boldsymbol{\Sigma}_{1}\right), \end{align} \] given the cancellation due to the zero matrices \( \pmb{0} = \boldsymbol{\Sigma}_{12}= \boldsymbol{\Sigma}_{21}^\top \).
This simple property reveals an important consequence of the conditional Gaussian.
Correlation and independence for the Gaussian
Suppose that \( \pmb{x}_1, \pmb{x}_2 \) are jointly Gaussian distributed, uncorrelated as above. Then \( \mathcal{P}(\pmb{x}_1 | \pmb{x}_2 = \pmb{a}) = \mathcal{P}(\pmb{x}_1) \) for all \( \pmb{a} \) and \( \mathcal{P}(\pmb{x}_2 | \pmb{x}_1 = \pmb{b}) = \mathcal{P}(\pmb{x}_2) \) for all \( \pmb{b} \). Therefore, uncorrelated, jointly Gaussian distributed random variables are independent.
Note that, in general, de-correlation is not equivalent to independence;
Recall, we are principally interested in time-varying systems, modeling random states.
A highly useful property of the Gaussian approximation is that Gaussians are closed under a general extension of linear transformations.
We will make this slightly more formal as follows.
Affine transformations
A mapping \( \pmb{f}:\mathbb{R}^{N}\rightarrow \mathbb{R}^{N} \) is called an affine transformation if it is composed as vector addition and a linear transformation as \[ \begin{align} \pmb{f}(\pmb{x}) = \mathbf{A}\pmb{x} + \pmb{b}. \end{align} \]
Note that in the above, this is only a linear transformation when \( \pmb{b}=\pmb{0} \).
Rather, this can be interpreted as a generalization of a linear transformation, but translated to a point in space \( \pmb{b} \), even when \( \pmb{b}\neq \pmb{0} \).
This bears striking similarity to the (linear / nonlinear) inverse problem, and the first order approximation of a nonlinear function;
Affine closure of the Gaussian
Let \( \pmb{x} \) be distributed as \[ \begin{align} \pmb{x} \sim N\left( \overline{\pmb{x}}, \mathbf{B}\right). \end{align} \] Then the random variable \( \pmb{y} := \pmb{b} + \mathbf{A}\pmb{x} \) is distributed as \[ \begin{align} \pmb{y} \sim N \left(\pmb{b}+\mathbf{A}\overline{\pmb{x}}, \mathbf{A}\mathbf{B}\mathbf{A}^\top \right). \end{align} \]
Suppose we model a Gaussian random vector as a perturbation from its mean state, i.e.,
\[ \begin{align} \pmb{x} = \overline{\pmb{x}} + \pmb{\delta}, \end{align} \] where \( \pmb{\delta} \sim N(\pmb{0}, \mathbf{B}) \).
Consider then the first order approximation of a nonlinear function \( \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^N \)
\[ \begin{align} \pmb{f}(\pmb{x}) \approx \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}}) \pmb{\delta}, \end{align} \] which is an affine transformation of the Gaussian random variable \( \pmb{\delta} \).
Tangent, linear-Gaussian approximation Suppose \( \pmb{x}:= \overline{\pmb{x}} + \pmb{\delta} \) is a perturbation of the mean as defined above. Provided the tangent approximation is valid (small perturbations and small errors), then \( \pmb{f}(\pmb{x}) \) is approximately distributed under the linear-Gaussian approximation as \[ \begin{align} \pmb{f}(\pmb{x}) \sim N\left( \pmb{f}(\overline{\pmb{x}}) + \nabla\pmb{f}(\overline{\pmb{x}})\pmb{\delta},\left[ \nabla\pmb{f}(\overline{\pmb{x}})\right]\mathbf{B}\left[\nabla\pmb{f}(\overline{\pmb{x}})\right]^\top\right). \end{align} \]