Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
In the last two lectures, we saw some classical means of statistical estimation:
In general, these two estimation schemes lead to different formulations and results in real problems;
Furthermore, the conditional mean and the covariance of this estimator around the modeled state fully parameterizes the posterior density.
Given a Gauss-Markov model, with sequential, indirect and noisy observations of the modeled state,
\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k, \end{align} \] this suggests an immediate path forward on how one can efficiently update this estimate…
Assume the Gauss-Markov model,
\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k, \end{align} \]
where each of the following are satisfied:
Then we know that the free forecast of the initial state / density is Gaussian at all times and that it can be written recursively in terms of
\[ \begin{align} \pmb{x}_k | \pmb{x}_{k-1} \sim N\left(\mathbf{M}_k \overline{\pmb{x}}_{k-1}, \mathbf{M}_k\mathbf{B}_{k-1}\mathbf{M}_k^\top + \mathbf{Q}_k\right). \end{align} \]
Furthermore, given the form of the observations,
\[ \begin{align} \pmb{y}_k - \mathbf{H}_k \pmb{x}_k = \pmb{v}_k \sim N(\pmb{0}, \mathbf{R}_k); \end{align} \]
\[ \begin{align} p(\pmb{y}_k | \pmb{x}_k) &= \left(2 \pi\right)^{-\frac{N_y}{2}} \vert\mathbf{R}_k\vert^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left(\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k}\right)^\top \mathbf{R}^{-1}_k\left(\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k}\right)\right\} \propto \exp\left\{-\frac{1}{2} \parallel\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k} \parallel_{\mathbf{R}_k}^2\right\} \end{align} \]
Combining the forecast and the likelihood in Bayes' law, one can derive that
\[ \begin{align} \pmb{p}(\pmb{x}_k | \pmb{y}_k) = \frac{p(\pmb{y}_k | \pmb{x}_k) p(\pmb{x}_k)}{p(\pmb{y}_k)}, \end{align} \] which can be shown also to be Gaussian.
There are a variety of ways to find this posterior, but we will start with the classical version in which we derive the recursion on the mean and the covariance to parameterize the full posterior.
This becomes a two-stage problem then, with:
This assumes implicitly the knowledge of the model / observation error covariances, as well as the covariance and mean of the first prior.
However, in reality, we will also want to consider when \( \mathbf{B}_0,\mathbf{Q}_k,\mathbf{R}_k \) themselves are unknown values, and what guarantees we have about obtaining a “good” estimate for \( p(\pmb{x}_k| \pmb{y}_{k:1}) \) when we have uncertainty in these parameters.
Recall we assume that \( \pmb{v}_k \) is independent from \( \pmb{x}_k \), which implies that \( \pmb{x}_k \) and \( \pmb{y}_k \) are jointly Gaussian.
Consider then the mean of \( \pmb{y}_k \),
\[ \begin{align} \mathbb{E}\left[\pmb{y}_k\right] &= \mathbb{E}\left[\mathbf{H}_k \pmb{x}_k + \pmb{v}_k \right]\\ &= \mathbf{H}_k \overline{\pmb{x}}_k \end{align} \]
and its covariance,
\[ \begin{align} \mathrm{cov}(\pmb{y}_k) &= \mathbb{E}\left[\left(\mathbf{H}_k\pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)\left(\mathbf{H}_k\pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)^\top \right]\\ &= \mathbf{H}_k \mathbf{B}_k \mathbf{H}_k^\top + \mathbf{R}_k, \end{align} \]
and its cross-covariance with \( \pmb{x}_k \),
\[ \begin{align} \mathrm{cov}\left(\pmb{x}_k,\pmb{y}_k\right) &= \mathbb{E}\left[\left(\pmb{\delta}_{\pmb{x}_k}\right)\left(\mathbf{H}_k \pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)^\top \right]\\ &= \mathbf{B}_k \mathbf{H}_k^\top. \end{align} \]
Consider thus the jointly distributed Gaussian vector
\[ \begin{align} \begin{pmatrix} \pmb{x}_k \\ \pmb{y}_k \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_k \\ \mathbf{H}_k \overline{\pmb{x}}_k\end{pmatrix}, \begin{pmatrix} \mathbf{B}_k & \mathbf{B}_k \mathbf{H}_k^\top \\ \mathbf{H}_k\mathbf{B}_k & \mathbf{H}_k \mathbf{B}_k \mathbf{H}_k^\top + \mathbf{R}_k \end{pmatrix}\right)... \end{align} \]
writing the conditional Gaussian from the joint Gaussian with knowledge of some observation \( \pmb{y}_k \) gives
\[ \begin{align} \pmb{x}_k | \pmb{y}_k \sim N \big(& \overline{\pmb{x}}_k + \mathbf{B}_{k}\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_k\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\left(\pmb{y}_k - \mathbf{H}_k \overline{\pmb{x}}_k\right),\\ &\mathbf{B}_k - \mathbf{B}_k\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_k\mathbf{H}_k^\top +\mathbf{R}_k\right)^{-1} \mathbf{H}_k \mathbf{B}_k \big). \end{align} \] which are formally equivalent to the classic Kalman filter.
Particularly, the classic Kalman filter simply applies the recursion for the mean and covariance directly where:
Put together, we often write for short
\[ \begin{align} \overline{\pmb{x}}_{k|k} &:= \overline{\pmb{x}}_{k|k-1} + \mathbf{K}_k \pmb{\delta}_{k|k-1} \\ \mathbf{B}_{k|k} &:= \left(\mathbf{I} - \mathbf{K}_k \mathbf{H}_k\right) \mathbf{B}_{k|k-1} \end{align} \] where the \( i|j \) refers to the estimate at time \( t_i \), while conditioning on observed information up to time \( t_j \).
Along with the forecast recursion, this actually completely describes the posterior \( p(\pmb{x}_k | \pmb{y}_{k:1}) \).
Particularly, if we apply the independence of the observation error / model error / Markov hypothesis, we note that
\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1}) &= \frac{p(\pmb{x}_k,\pmb{y}_{k:1})}{p(\pmb{y}_{k:1})} \\ &=\frac{p(\pmb{y}_{k}|\pmb{x}_k,\pmb{y}_{k-1:1})p(\pmb{x}_k,\pmb{y}_{k-1:1})}{p(\pmb{y}_k,\pmb{y}_{k-1:1})}\\ &=\frac{p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1})p(\pmb{y}_{k-1:1})}{p(\pmb{y}_k|\pmb{y}_{k-1:1})p(\pmb{y}_{k-1:1})}\\ &=\frac{p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1})}{p(\pmb{y}_k|\pmb{y}_{k-1:1})}\\ \end{align} \]
Recall that \( p\left(\pmb{y}_k \vert \pmb{y}_{k-1:1}\right) \) is the marginal of joint density \( p( \pmb{y}_k, \pmb{x}_k \vert \pmb{y}_{k-1:1}) \) integrating out the model state
\[ \begin{align} p\left(\pmb{y}_k \vert \pmb{y}_{k-1:1}\right) = \int p(\pmb{y}_k \vert \pmb{x}_k) p(\pmb{x}_k \vert \pmb{y}_{k-1:1})\mathrm{d}\pmb{x}_{k}. \end{align} \]
Therefore, we write that the posterior is proportional as
\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1})\propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}), \end{align} \] up to a normalizing constant that does not depend on the model state.
From the last slide, we had that
\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1})\propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}), \end{align} \]
so that we interpret:
\[ \begin{align} p(\pmb{x}_{k}|\pmb{y}_{k-1:1}) = \int p(\pmb{x}_k|\pmb{x}_{k-1}) p(\pmb{x}_{k-1}|\pmb{y}_{k-1:1})\mathrm{d}\pmb{x}_{k-1} \end{align} \] to obtain the forecast density for \( \pmb{x}_{k|k-1} \).
\[ \begin{align} p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}); \end{align} \] and
All of these steps are implicitly encoded in the Kalman filter equations for the recursive conditional mean and covariance.
Therefore,
Sequentially updating the forecast with the likelihood of the new data given our best current estimate is equivalent to the posterior conditioning the current state estimate on all past data.
Put another way, the Kalman filter equations sequentially parameterize the marginal posterior density,
\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1}) = \int p(\pmb{x}_{k:0}|\pmb{y}_{k:1}) \mathrm{d}\pmb{x}_{k-1:0} \end{align} \]
This is a profound statement and is at the core of the efficiency of the Kalman filter approach, and is worth re-iterating.
In particular,
However, the optimality of this approach relies on a variety of assumptions which may not be realistic at all in practice.
Therefore, we should qualify that the sequential filter estimate is a reasonable approach for many realistic problems, but a superior estimate may be obtained by estimating
\[ p(\pmb{x}_k|\pmb{y}_{k:1}) \] directly instead.
A common modification of the Kalman filter equations is to change this to a square root filter, both for numerical stability and for dimensional reduction purposes.
In order to demonstrate this, we need a highly useful matrix algebra lemma known as the Sherman-Morrison-Woodbury matrix inversion
Sherman-Morrsion-Woodbury matrix inversion
Let \( \mathbf{A}\in \mathbb{R}^{n\times n} \), \( \mathbf{U}\in \mathbb{R}^{n\times m} \), \( \mathbf{V}\in\mathbb{R}^{m\times n} \) and \( \mathbf{C}\in\mathbb{R}^{m\times m} \). Then the equality holds, \[ \begin{align} \left(\mathbf{A} + \mathbf{U}\mathbf{C}\mathbf{V}\right)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}\mathbf{A}^{-1}\mathbf{U}\right)^{-1}\mathbf{V}\mathbf{A}^{-1} \end{align} \] provided \( \mathbf{A}^{-1} \) and \( \mathbf{C}^{-1} \) exist.
In the above, it may seem that we simply make the inverse more complicated by lots of inverses.
At the moment, the power of this relationship will not be entirely obvious as it requires a reduced-rank approximation of the covariance to deliver a major dimensional reduction.
Recall the form of the Kalman covariance update
\[ \begin{align} \mathbf{B}_{k|k} := \mathbf{B}_{k|k-1} - \mathbf{B}_{k|k-1}\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_{k|k-1}\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\mathbf{H}_k\mathbf{B}_{k|k-1} \end{align} \]
Suppose we factorize \( \mathbf{B}_{i|j}= \boldsymbol{\Sigma}_{i|j}\boldsymbol{\Sigma}_{i|j}^\top \) where this could represent, e.g., a Cholesky factor or a symmetric SVD factor.
Applying this as such becomes
\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left( \mathbf{I} - \boldsymbol{\Sigma}^\top_{k|k-1}\mathbf{H}_k^\top \left(\mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\mathbf{H}_k\boldsymbol{\Sigma}_{k|k-1}\right) \boldsymbol{\Sigma}_{k|k-1}^\top . \end{align} \]
Let us make the substitutions
\[ \begin{align} \mathbf{A}=\mathbf{I} & & \mathbf{U}=\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top & & \mathbf{V}= \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} & & \mathbf{C}^{-1}=\mathbf{R}_k \end{align} \]
where applying the Sherman-Morrison-Woodbury matrix inversion we obtain
\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \mathbf{R}^{-1}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top \end{align} \]
The inversion is always numerically stable, because we are adding a symmetric positive definite matrix to the identity.
If \( \mathbf{R}_k \) has eigenvalues close to zero, we can simply take these observations to be perfect (without error) and simply input this data anyway.
Let's consider now the meaning of the square root Kalman filter equations
\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \mathbf{R}^{-1}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top \end{align} \]
In the above, let us define
\[ \begin{align} \mathbf{S}_{k|k-1} := \mathbf{R}^{-\frac{1}{2}}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \end{align} \] where this loosely represents the standard deviation of the background state, transmitted into the observed variables, relative to the standard deviation of the observation error.
Re-writing the above,
\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\mathbf{S}_{k|k-1}^\top \mathbf{S}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top, \end{align} \]
let \( \mathbf{T}_k := \left(\mathbf{I} +\mathbf{S}_{k|k-1}^\top \mathbf{S}_{k|k-1} \right)^{-\frac{1}{2}} \).
We thus say, up to any mean-preserving orthogonal transformation, i.e.,
\[ \begin{align} \mathbf{U}\pmb{1} = \pmb{1}, \end{align} \] the update of the forecast covariance to the posterior covariance can be written implictly by
\[ \begin{align} \boldsymbol{\Sigma}_{k|k} = \boldsymbol{\Sigma}_{k|k-1}\mathbf{T}_k \mathbf{U}. \end{align} \]
The update equation for the covariance
\[ \begin{align} \boldsymbol{\Sigma}_{k|k} = \boldsymbol{\Sigma}_{k|k-1}\mathbf{T}_k \mathbf{U}. \end{align} \] is known as the right transform method.
This allows us to implicitly represent the covariance in terms of the square root without ever needing to explicitly calculate it.
The Kalman gain is thus computed from the above in terms of
\[ \begin{align} \mathbf{K}_k &:= \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \left(\mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\\ &=\boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top\mathbf{R}^{-\frac{1}{2}}\left(\mathbf{I} + \mathbf{S}_{k|k-1}\mathbf{S}_{k|k-1}^\top \right)^{-1}\mathbf{R}^{-\frac{1}{2}}\\ &= \boldsymbol{\Sigma}_{k|k-1}\mathbf{S}_{k|k-1}^\top \mathbf{T}^2_k \mathbf{R}_k^{-\frac{1}{2}}. \end{align} \]
Therefore, all conditioning operations for the new information are closed in the square root form.
The forecast of the covariance can be generated in the square root form when we assume that there is no model noise by
\[ \begin{align} \boldsymbol{\Sigma}_{k+1|k}=\mathbf{M}_{k+1} \boldsymbol{\Sigma}_{k|k}, \end{align} \] or by a sampling approach with SDEs under the random flow map, though the square root approach will not be fully closed when there is a nontrivial noise term \( \pmb{w}_k \).