The Kalman filter part I

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Extending the Gauss-Markov model with observations
- The classic Kalman filter equations
- The square root Kalman filter equations

Motivation

In the last two lectures, we saw some classical means of statistical estimation:
1. Minimum variance estimation – finding the linear unbiased estimator that has the least spread about the true solution;
2. Maximum likelihood (a posterori) estimation – finding the value that maximizes the target density;
In general, these two estimation schemes lead to different formulations and results in real problems;
- however, for the multivariate the Gaussian, we see how the conditional mean actually satisfies both of these criteria simultaneously.
Furthermore, the conditional mean and the covariance of this estimator around the modeled state fully parameterizes the posterior density.
Given a Gauss-Markov model, with sequential, indirect and noisy observations of the modeled state,

\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k, \end{align} \] this suggests an immediate path forward on how one can efficiently update this estimate…

Motivation

Assume the Gauss-Markov model,

\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k, \end{align} \]
where each of the following are satisfied:
1. $ p(\pmb{x}_0) = N(\overline{\pmb{x}}_0, \mathbf{B}_0) $; and
2. $ \pmb{w}_k \sim N(\pmb{0}, \mathbf{Q}_k) $.
Then we know that the free forecast of the initial state / density is Gaussian at all times and that it can be written recursively in terms of

\[ \begin{align} \pmb{x}_k | \pmb{x}_{k-1} \sim N\left(\mathbf{M}_k \overline{\pmb{x}}_{k-1}, \mathbf{M}_k\mathbf{B}_{k-1}\mathbf{M}_k^\top + \mathbf{Q}_k\right). \end{align} \]
Furthermore, given the form of the observations,

\[ \begin{align} \pmb{y}_k - \mathbf{H}_k \pmb{x}_k = \pmb{v}_k \sim N(\pmb{0}, \mathbf{R}_k); \end{align} \]
- we see that the likelihood of the data is given as
\[ \begin{align} p(\pmb{y}_k | \pmb{x}_k) &= \left(2 \pi\right)^{-\frac{N_y}{2}} \vert\mathbf{R}_k\vert^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left(\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k}\right)^\top \mathbf{R}^{-1}_k\left(\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k}\right)\right\} \propto \exp\left\{-\frac{1}{2} \parallel\pmb{y}_k - \mathbf{H}_k\pmb{x}_{k} \parallel_{\mathbf{R}_k}^2\right\} \end{align} \]

Motivation

Combining the forecast and the likelihood in Bayes' law, one can derive that

\[ \begin{align} \pmb{p}(\pmb{x}_k | \pmb{y}_k) = \frac{p(\pmb{y}_k | \pmb{x}_k) p(\pmb{x}_k)}{p(\pmb{y}_k)}, \end{align} \] which can be shown also to be Gaussian.
There are a variety of ways to find this posterior, but we will start with the classical version in which we derive the recursion on the mean and the covariance to parameterize the full posterior.
This becomes a two-stage problem then, with:
1. a forecast step which we have already derivied the recursion for;
2. an “analysis” step in which we update this conditional density with the information in $ \pmb{y}_k $.
This assumes implicitly the knowledge of the model / observation error covariances, as well as the covariance and mean of the first prior.
- None of these are typically known in practice, so this initial demonstration simply lays the foundation for the “optimal” solution to such a model.
However, in reality, we will also want to consider when $ \mathbf{B}_0,\mathbf{Q}_k,\mathbf{R}_k $ themselves are unknown values, and what guarantees we have about obtaining a “good” estimate for $ p(\pmb{x}_k| \pmb{y}_{k:1}) $ when we have uncertainty in these parameters.
- This will be the focus of part two.

The classic Kalman filter equations

Recall we assume that $ \pmb{v}_k $ is independent from $ \pmb{x}_k $, which implies that $ \pmb{x}_k $ and $ \pmb{y}_k $ are jointly Gaussian.
Consider then the mean of $ \pmb{y}_k $,

\[ \begin{align} \mathbb{E}\left[\pmb{y}_k\right] &= \mathbb{E}\left[\mathbf{H}_k \pmb{x}_k + \pmb{v}_k \right]\\ &= \mathbf{H}_k \overline{\pmb{x}}_k \end{align} \]
and its covariance,

\[ \begin{align} \mathrm{cov}(\pmb{y}_k) &= \mathbb{E}\left[\left(\mathbf{H}_k\pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)\left(\mathbf{H}_k\pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)^\top \right]\\ &= \mathbf{H}_k \mathbf{B}_k \mathbf{H}_k^\top + \mathbf{R}_k, \end{align} \]
and its cross-covariance with $ \pmb{x}_k $,

\[ \begin{align} \mathrm{cov}\left(\pmb{x}_k,\pmb{y}_k\right) &= \mathbb{E}\left[\left(\pmb{\delta}_{\pmb{x}_k}\right)\left(\mathbf{H}_k \pmb{\delta}_{\pmb{x}_k} + \pmb{v}_k \right)^\top \right]\\ &= \mathbf{B}_k \mathbf{H}_k^\top. \end{align} \]
Consider thus the jointly distributed Gaussian vector

\[ \begin{align} \begin{pmatrix} \pmb{x}_k \\ \pmb{y}_k \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_k \\ \mathbf{H}_k \overline{\pmb{x}}_k\end{pmatrix}, \begin{pmatrix} \mathbf{B}_k & \mathbf{B}_k \mathbf{H}_k^\top \\ \mathbf{H}_k\mathbf{B}_k & \mathbf{H}_k \mathbf{B}_k \mathbf{H}_k^\top + \mathbf{R}_k \end{pmatrix}\right)... \end{align} \]

The classic Kalman filter equations

writing the conditional Gaussian from the joint Gaussian with knowledge of some observation $ \pmb{y}_k $ gives

\[ \begin{align} \pmb{x}_k | \pmb{y}_k \sim N \big(& \overline{\pmb{x}}_k + \mathbf{B}_{k}\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_k\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\left(\pmb{y}_k - \mathbf{H}_k \overline{\pmb{x}}_k\right),\\ &\mathbf{B}_k - \mathbf{B}_k\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_k\mathbf{H}_k^\top +\mathbf{R}_k\right)^{-1} \mathbf{H}_k \mathbf{B}_k \big). \end{align} \] which are formally equivalent to the classic Kalman filter.
Particularly, the classic Kalman filter simply applies the recursion for the mean and covariance directly where:
1. the Kalman gain is defined $$\begin{align} \mathbf{K}_k := \mathbf{B}_{k}\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_k\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1} \end{align}$$ which is the optimal correction for the forecast in the sense of the minimum variance and maximum a posteriori estimate; and
2. the innovation vector is defined as $$\begin{align} \pmb{\delta}_k :=\pmb{y}_k-\mathbf{H}_k \overline{\pmb{x}}_k, \end{align}$$ i.e., the discrepancy between our optimal prediction (the forecast mean) and the observed data
Put together, we often write for short

\[ \begin{align} \overline{\pmb{x}}_{k|k} &:= \overline{\pmb{x}}_{k|k-1} + \mathbf{K}_k \pmb{\delta}_{k|k-1} \\ \mathbf{B}_{k|k} &:= \left(\mathbf{I} - \mathbf{K}_k \mathbf{H}_k\right) \mathbf{B}_{k|k-1} \end{align} \] where the $ i|j $ refers to the estimate at time $ t_i $, while conditioning on observed information up to time $ t_j $.
Along with the forecast recursion, this actually completely describes the posterior $ p(\pmb{x}_k | \pmb{y}_{k:1}) $.

The classic Kalman filter equations

Particularly, if we apply the independence of the observation error / model error / Markov hypothesis, we note that

\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1}) &= \frac{p(\pmb{x}_k,\pmb{y}_{k:1})}{p(\pmb{y}_{k:1})} \\ &=\frac{p(\pmb{y}_{k}|\pmb{x}_k,\pmb{y}_{k-1:1})p(\pmb{x}_k,\pmb{y}_{k-1:1})}{p(\pmb{y}_k,\pmb{y}_{k-1:1})}\\ &=\frac{p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1})p(\pmb{y}_{k-1:1})}{p(\pmb{y}_k|\pmb{y}_{k-1:1})p(\pmb{y}_{k-1:1})}\\ &=\frac{p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1})}{p(\pmb{y}_k|\pmb{y}_{k-1:1})}\\ \end{align} \]
Recall that $ p\left(\pmb{y}_k \vert \pmb{y}_{k-1:1}\right) $ is the marginal of joint density $ p( \pmb{y}_k, \pmb{x}_k \vert \pmb{y}_{k-1:1}) $ integrating out the model state

\[ \begin{align} p\left(\pmb{y}_k \vert \pmb{y}_{k-1:1}\right) = \int p(\pmb{y}_k \vert \pmb{x}_k) p(\pmb{x}_k \vert \pmb{y}_{k-1:1})\mathrm{d}\pmb{x}_{k}. \end{align} \]
Therefore, we write that the posterior is proportional as

\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1})\propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}), \end{align} \] up to a normalizing constant that does not depend on the model state.

The classic Kalman filter equations

From the last slide, we had that

\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1})\propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}), \end{align} \]
so that we interpret:
- Given the last posterior density $ p(\pmb{x}_{k-1}|\pmb{y}_{k-1:1}) $, we can apply the Markov transition kernel
\[ \begin{align} p(\pmb{x}_{k}|\pmb{y}_{k-1:1}) = \int p(\pmb{x}_k|\pmb{x}_{k-1}) p(\pmb{x}_{k-1}|\pmb{y}_{k-1:1})\mathrm{d}\pmb{x}_{k-1} \end{align} \] to obtain the forecast density for $ \pmb{x}_{k|k-1} $.
- We condition based on the likelihood of the observed data, $ \pmb{y}_k $ by multiplication
\[ \begin{align} p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}); \end{align} \] and
- after re-normalization by a constant, we obtain the posterior $ p(\pmb{x}_k | \pmb{y}_{k:1}) $.
All of these steps are implicitly encoded in the Kalman filter equations for the recursive conditional mean and covariance.
- By applying the fact that all of the above densities are Gaussian for the Gauss-Markov model, this directly applies the likelihood step and the normalization step seen above.

The classic Kalman filter equations

Therefore,

Sequentially updating the forecast with the likelihood of the new data given our best current estimate is equivalent to the posterior conditioning the current state estimate on all past data.
Put another way, the Kalman filter equations sequentially parameterize the marginal posterior density,

\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1}) = \int p(\pmb{x}_{k:0}|\pmb{y}_{k:1}) \mathrm{d}\pmb{x}_{k-1:0} \end{align} \]
This is a profound statement and is at the core of the efficiency of the Kalman filter approach, and is worth re-iterating.
In particular,
- the sequential estimate as above by forward forecasting the model state with the Gauss-Markov model;
- and recursively updating the forecast with new data;
- is the same optimal posterior estimate of the current state as with a global analysis over all past data.
However, the optimality of this approach relies on a variety of assumptions which may not be realistic at all in practice.
Therefore, we should qualify that the sequential filter estimate is a reasonable approach for many realistic problems, but a superior estimate may be obtained by estimating

\[ p(\pmb{x}_k|\pmb{y}_{k:1}) \] directly instead.

The square root Kalman filter equations

A common modification of the Kalman filter equations is to change this to a square root filter, both for numerical stability and for dimensional reduction purposes.
In order to demonstrate this, we need a highly useful matrix algebra lemma known as the Sherman-Morrison-Woodbury matrix inversion

Sherman-Morrsion-Woodbury matrix inversion
Let $ \mathbf{A}\in \mathbb{R}^{n\times n} $, $ \mathbf{U}\in \mathbb{R}^{n\times m} $, $ \mathbf{V}\in\mathbb{R}^{m\times n} $ and $ \mathbf{C}\in\mathbb{R}^{m\times m} $. Then the equality holds, \[ \begin{align} \left(\mathbf{A} + \mathbf{U}\mathbf{C}\mathbf{V}\right)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}\mathbf{A}^{-1}\mathbf{U}\right)^{-1}\mathbf{V}\mathbf{A}^{-1} \end{align} \] provided $ \mathbf{A}^{-1} $ and $ \mathbf{C}^{-1} $ exist.

In the above, it may seem that we simply make the inverse more complicated by lots of inverses.
- However, matrix inversion is an expensive numerical operation for large matrices;
- therefore, if $ m \ll n $ and $ \mathbf{A} $ is simple to invert (e.g., diagonal);
- this reduces the primary numerical operation from $ n\times n $ to $ m\times m $.
At the moment, the power of this relationship will not be entirely obvious as it requires a reduced-rank approximation of the covariance to deliver a major dimensional reduction.
- However, this can still be used to derive the square root Kalman filter equations which are inherently more numerically stable than the traditional approach.

The square root Kalman filter equations

Recall the form of the Kalman covariance update

\[ \begin{align} \mathbf{B}_{k|k} := \mathbf{B}_{k|k-1} - \mathbf{B}_{k|k-1}\mathbf{H}_k^\top \left(\mathbf{H}_k \mathbf{B}_{k|k-1}\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\mathbf{H}_k\mathbf{B}_{k|k-1} \end{align} \]
Suppose we factorize $ \mathbf{B}_{i|j}= \boldsymbol{\Sigma}_{i|j}\boldsymbol{\Sigma}_{i|j}^\top $ where this could represent, e.g., a Cholesky factor or a symmetric SVD factor.
Applying this as such becomes

\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left( \mathbf{I} - \boldsymbol{\Sigma}^\top_{k|k-1}\mathbf{H}_k^\top \left(\mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\mathbf{H}_k\boldsymbol{\Sigma}_{k|k-1}\right) \boldsymbol{\Sigma}_{k|k-1}^\top . \end{align} \]
Let us make the substitutions

\[ \begin{align} \mathbf{A}=\mathbf{I} & & \mathbf{U}=\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top & & \mathbf{V}= \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} & & \mathbf{C}^{-1}=\mathbf{R}_k \end{align} \]
where applying the Sherman-Morrison-Woodbury matrix inversion we obtain

\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \mathbf{R}^{-1}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top \end{align} \]
The inversion is always numerically stable, because we are adding a symmetric positive definite matrix to the identity.
If $ \mathbf{R}_k $ has eigenvalues close to zero, we can simply take these observations to be perfect (without error) and simply input this data anyway.

The square root Kalman filter equations

Let's consider now the meaning of the square root Kalman filter equations

\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \mathbf{R}^{-1}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top \end{align} \]
In the above, let us define

\[ \begin{align} \mathbf{S}_{k|k-1} := \mathbf{R}^{-\frac{1}{2}}_k \mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1} \end{align} \] where this loosely represents the standard deviation of the background state, transmitted into the observed variables, relative to the standard deviation of the observation error.
Re-writing the above,

\[ \begin{align} \boldsymbol{\Sigma}_{k|k}\boldsymbol{\Sigma}_{k|k}^\top = \boldsymbol{\Sigma}_{k|k-1}\left(\mathbf{I} +\mathbf{S}_{k|k-1}^\top \mathbf{S}_{k|k-1} \right)^{-1} \boldsymbol{\Sigma}_{k|k-1}^\top, \end{align} \]
let $ \mathbf{T}_k := \left(\mathbf{I} +\mathbf{S}_{k|k-1}^\top \mathbf{S}_{k|k-1} \right)^{-\frac{1}{2}} $.
We thus say, up to any mean-preserving orthogonal transformation, i.e.,

\[ \begin{align} \mathbf{U}\pmb{1} = \pmb{1}, \end{align} \] the update of the forecast covariance to the posterior covariance can be written implictly by

\[ \begin{align} \boldsymbol{\Sigma}_{k|k} = \boldsymbol{\Sigma}_{k|k-1}\mathbf{T}_k \mathbf{U}. \end{align} \]

The square root Kalman filter equations

The update equation for the covariance

\[ \begin{align} \boldsymbol{\Sigma}_{k|k} = \boldsymbol{\Sigma}_{k|k-1}\mathbf{T}_k \mathbf{U}. \end{align} \] is known as the right transform method.
This allows us to implicitly represent the covariance in terms of the square root without ever needing to explicitly calculate it.
The Kalman gain is thus computed from the above in terms of

\[ \begin{align} \mathbf{K}_k &:= \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top \left(\mathbf{H}_k \boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top + \mathbf{R}_k\right)^{-1}\\ &=\boldsymbol{\Sigma}_{k|k-1}\boldsymbol{\Sigma}_{k|k-1}^\top\mathbf{H}_k^\top\mathbf{R}^{-\frac{1}{2}}\left(\mathbf{I} + \mathbf{S}_{k|k-1}\mathbf{S}_{k|k-1}^\top \right)^{-1}\mathbf{R}^{-\frac{1}{2}}\\ &= \boldsymbol{\Sigma}_{k|k-1}\mathbf{S}_{k|k-1}^\top \mathbf{T}^2_k \mathbf{R}_k^{-\frac{1}{2}}. \end{align} \]
Therefore, all conditioning operations for the new information are closed in the square root form.
The forecast of the covariance can be generated in the square root form when we assume that there is no model noise by

\[ \begin{align} \boldsymbol{\Sigma}_{k+1|k}=\mathbf{M}_{k+1} \boldsymbol{\Sigma}_{k|k}, \end{align} \] or by a sampling approach with SDEs under the random flow map, though the square root approach will not be fully closed when there is a nontrivial noise term $ \pmb{w}_k $.