The Kalman filter part II

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Observability and controllability
- Filter boundedness and stability
- Innovation and residual statistics
- Estimating \( \mathbf{R}_k \)
- Estimating \( \mathbf{Q}_k \)
- Biased priors

Motivation

In the last lecture, we saw a general derivation of the Kalman filter equations for a discrete Gauss-Markov model.
- This includes both the classic approach, and the more numerically stable square root covariance update equations.
We also have a number of guarantees of the optimality of the solution in the state estimation:
- for a linear-Gaussian system, the conditional mean is the minimum variance linear unbiased estimator; and
- is the maximum a posteriori estimate; and
- the mean and covariance parameterize the Bayesian marginal posterior directly, knowing that this is a Gaussian, derived as
\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1}) = \int p(\pmb{x}_{k:0}|\pmb{y}_{k:1})\mathrm{d}\pmb{x}_{k-1:0} \end{align} \] having averaged out all the past states from the joint posterior in time.
When the error distributions are non-Gaussian, this still remains the BLUE, but may not parameterize the posterior nor be the maximum a posteriori estimate.
However, even if the governing mechanistic laws are linear, \( \mathbf{M}_k \), and the observation operator is linear \( \mathbf{H}_k \), with Gaussian error distributions

\[ \begin{align} \pmb{x}_0 \sim N(\overline{\pmb{x}}_0 , \mathbf{B}_0), & & \pmb{w}_k \sim N(\pmb{0}, \mathbf{Q}_k), & & \pmb{v}_k \sim N(\pmb{0}, \mathbf{R}_k), \end{align} \]
- we generally do not actually know any of the above parameters \( \overline{\pmb{x}}_0, \mathbf{B}_0, \mathbf{Q}_k,\mathbf{R}_k \) in practice…

Motivation

Two important related questions emerge then:
- The question of how do we guarantee that the background error covariance \( \mathbf{B}_k \) does not grow to infinite variances is known as filter boundedness.
- The question of how do we guarantee “optimal” performance of a linear Kalman filter with uncertain parameters is known as filter stability.
In the case that \( \mathbf{Q}_k \) and \( \mathbf{R}_k \) are known,
- and they satisfy “observability” and “controlability” conditions,
it turns out that the initialization of the prior covariance doesn't imperil the long-term performance, either in the sense of boundedness or stability.
When these parameters are unknown, a variety of techniques have been developed to estimate these parameters;
- we will consider some classical results based on “innovation” and “residual” statistics, though more modern approaches may consider, e.g., Bayesian hierarchical models.
Additionally, we will consider the issue of a biased first prior and empirical means of handling this.

Observability and controllability

Recall the discrete Gauss-Markov model,

\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{y}_k. \end{align} \]
To introduce the fundamental boundedness / stability result of the linear Kalman filter, we need to introduce the following definitions.

The information matrix
For the model defined above, the time-varying information matrix is defined as, \[ \begin{align} \boldsymbol{\Phi}_{k:j} := \sum_{l=j}^k \mathbf{M}_{k:l}^{-\top} \mathbf{H}_l^\top \mathbf{R}_l^{-1} \mathbf{H}_l\mathbf{M}_{k:l}^{-1} \end{align} \]

The information matrix above can be considered to be a representation of how much information is transmitted backward-in-time from time \( t_k \) to time \( t_l \) through the observations over this window.

The controllability matrix
For the model defined above, the time-varying controllability matrix is defined as, \[ \begin{align} \boldsymbol{\Upsilon}_{k:j}:= \sum_{l=j}^k \mathbf{M}_{k:l}\mathbf{Q}_l \mathbf{M}_{k:l}^\top \end{align} \]

The controllability matrix above respectively represents how an arbitrary initial state can be driven to another state by the sequence of noise realizations combined with the mechanistic laws.

Observability and controllability

Two key concepts about the observation model and the mechanistic dynamic model then determine the boundedness and stability properties of the filter.
In order to understand this, we need to introduce the partial ordering on symmetric, positive semi-definite matrices.

Partial ordering on symmetric, positive semi-definite matrices
Let \( \mathbf{A} \) and \( \mathbf{B} \) be symmetric, positive semi-definite matrices. Then we can declare \[ \begin{align} \mathbf{A} \leq \mathbf{B} \end{align} \] if and only if all of the eigenvalues of \( \mathbf{B} \) are greater than or equal to those of \( \mathbf{A} \).

The above ordering allows us to consider a variety of properties about the covariance of the estimator, including how we mean to bound the covariance.
- Similarly, this allows us to place lower and upper bounds on the information and controllability matrices.

Uniform complete observability / controllability
We say that the system is uniformly completely observable (respectively controllable) if and only if there exists constants \( 0 < a < b < \infty \) independent of \( k \), and some \( N\geq 1 \), for which if \( k \) is sufficiently large \[ \begin{align} a\mathbf{I} \leq \boldsymbol{\Phi}_{k, k-N} \leq b \mathbf{I} \\ a \mathbf{I} \leq \boldsymbol{\Upsilon}_{k, k- N} \leq b \mathbf{I} \end{align} \] for all such \( k \).

Filter boundedness and stability

The previous uniform complete observability and controllability conditions respectively guarantee that:
- given finitely many observations, the initial state of the system (\( N \) steps back in time) can be reconstructed from this information as a linear combination;
- respectively, the controllability condition describes the ability to move the system from any initial state to a desired state given a finite sequence of control actions—in our case the moves are the realizations of model error.
The model error controllabilty condition thus describes a kind of memorlyless condition similar to ergodicity;
- particularly, no state of the system remains completely time-invariant with respect to the dynamics, and the model is free to explore the entire state space.
Put together, this gives the fundamental result of the classical Kalman filter,

Filter boundedness and stability
Let \( \mathbf{B}_0 > 0\mathbf{I} \) be any initialization of the prior covariance satisfying this lower bound on the partial ordering. There exists a constants \( 0 < a < b < \infty \) and universal sequence \( \overline{\mathbf{B}}_k \) for which, if \( \mathbf{B}_k \) is generated by the Kalman filtering equations with \( \mathbf{B}_0 \) as the initialization, \[ \begin{align} \parallel \mathbf{B}_k - \overline{\mathbf{B}}_k \parallel \rightarrow 0 \end{align} \] exponentially fast in \( k \), and \( a\mathbf{I} < \overline{\mathbf{B}}_k < b\mathbf{I} \) for all \( k \).

The above means that even for any first prior covariance (background uncertainty), the system exponentially forgets about the prior and reaches a unique, bounded variance, optimal sequence of posterior estimates.

Filter boundedness and stability

We should just remark that it is also possible to derive filter boundedness and stability results in the case where the system is sufficiently observed but is noiseless.
This type of system is sometimes denoted a “perfect model”, as the mechanistic process \( \mathbf{M}_k \) completely describes the evolution of the uncertain initial data.
This again is in relation to, e.g., an initial value problem with a linear system of ODEs, or with a nonlinear system of ODEs in the space of perturbations (the tangent space), when the tangent-linear model is sufficiently accurate.
Under a generic ergodicity assumption (that holds almost surely for the tangent-linear model);
- and an assumption of the uniform complete observability of the system's dynamical instabilities;
- with a sufficient rank of the initial covariance;
all covariances converge to a universal sequence \( \overline{\mathbf{B}}_k \) which has a column span identical to the unstable and neutral covariant / backward Lyapunov vectors for the system.
This is to say that the system's predictive uncertainty is asymptotically low-rank, and the only non-zero variances are in directions of the dynamic instability of the mechanistic model sequence \( \mathbf{M}_k \).
This is a modern result that provides some additional extensions to the classical filter boundedness / stability analysis for systems defined by a “perfect model” as above.

Innovation and residual statistics

Consider how we earlier defined the Kalman filter innovation and the Kalman filter residual, but let us replace the conditional mean with the conditional expectation which we will denote \( \hat{\pmb{x}}_{k|j} \).

\[ \begin{align} \pmb{\delta}_{k|k-1} &:= \pmb{y}_k - \mathbf{H}_k \hat{\pmb{x}}_{k|k-1},\\ \pmb{\epsilon}_{k|k} &:= \pmb{x}_{k|k} - \hat{\pmb{x}}_{k|k}. \end{align} \]
In the above, we are considering the conditional mean as a conditional expectation, depending on the outcomes of \( \pmb{y}_{k:1} \).
Important properties about these variables are actually their orthogonality properties, and their independence properties, which we discuss as follows.

Properties of the innovations / residuals
The innovations and residuals defined above satisfy the following general properties of least-squares estimators: \[ \begin{align} \mathbb{E}\left[\pmb{\epsilon}_{k|k} \hat{\pmb{x}}_{k|k}^\top \right] &= \pmb{0} & & \mathbb{E}\left[\pmb{\delta}_{k|k-1} \pmb{\delta}_{j|j-1}^\top\right] = \delta_{k,j} \left( \mathbf{H}_k \mathbf{B}_{k|k-1}\mathbf{H}_k^\top + \mathbf{R}_k\right) \\ \mathbb{E}\left[\pmb{\epsilon}_{k|k} \pmb{y}_k^\top \right]&= \pmb{0} & & \mathbb{E}\left[\pmb{\delta}_{k|k} \pmb{\delta}_{j|j}^\top\right] = \delta_{k,j} \mathbf{R}_k \end{align} \] where \( \delta_{k,j} \) above is the Kronecker delta.

Particularly, the estimator and its error, and the error and the observations, are uncorrelated.
Moreover, the residuals are white-in-time, with the known non-zero covariance given above only for matching time indices.

Estimating \( \mathbf{R}_k \)

The importance of the last properties is in the fact that it gives a criterion for the accurate specification of the error statistics in the algorithm.
Particularly, if we suppose that \( \mathbf{R}_k \) is time-invariant, or slowly varying, we can use the innovation statistics to estimate \( \mathbf{R}_k \).
For simplicity, suppose \( \mathbf{R}_k\equiv \mathbf{R} \) is constant;
- then with an unbiased initial prior, supposing that the model is specified correctly, the model error is specified correctly and \( \mathbf{R} \) is specified correctly
\[ \begin{align} \hat{\mathbf{R}} := \frac{1}{L} \sum_{k=1}^L \left[\pmb{y}_k - \mathbf{H}_k \hat{\pmb{x}}_{k|k} \right]\left[\pmb{y}_k - \mathbf{H}_k \hat{\pmb{x}}_{k|k} \right]^\top \end{align} \] can be shown to be an unbiased estimator for \( \mathbf{R} \), though will be reduced rank when the number of lagged residuals \( L < N_y \).
A miss-match between this estimate and the specified \( \mathbf{R} \) used in the Kalman filter equations evidences an incorrectly specified \( \mathbf{R} \).
- This can thus be used to tune \( \mathbf{R} \) to find a “correct” observation error covariance.
Alternatively, various techniques can then be used to specify the observation error covariance adaptively, such as expectation maximization using the above relationship.

Estimating \( \mathbf{Q}_k \)

As with the observation error covariance, we can similarly estimate the model error covariance in the case in which \( \mathbf{Q}_k \) is time-invariant or slowly varies in time.
For simplicity, suppose that \( \mathbf{Q}_k \equiv \mathbf{Q} \) fixed in time.
- Similarly, with an unbiased initial prior, supposing that the model is specified correctly, the model error is specified correctly and \( \mathbf{R} \) is specified correctly
\[ \begin{align} \hat{\mathbf{Q} } := \frac{1}{L} \sum_{k=1}^L \left[\hat{\pmb{x}}_{k|k} - \mathbf{M}_k \hat{\pmb{x}}_{k-1|k-1} \right]\left[\hat{\pmb{x}}_{k|k} - \mathbf{M}_k \hat{\pmb{x}}_{k-1|k-1} \right]^\top \end{align} \] can be shown to be an unbiased estimator for \( \mathbf{Q} \).
As with the last estimator, \( \hat{\mathbf{Q}} \) will be reduced rank if the number of lagged states \( L < N_x \).
This similarly give a criterion to check if the model error is specified correctly in the simulations;
- alternatively, adaptive error estimation is a rich area and has likewise been performed in classical settings with expectation maximization.

Biased priors

You may note that the variety of results we have given have relied on a critical assumption that the prior

\[ \begin{align} N(\overline{\pmb{x}}_0 ,\mathbf{B}_0) \end{align} \] is actually unbiased, i.e., \( \mathbb{E}\left[\pmb{x}_0\right] = \overline{\pmb{x}}_0 \).
This is actually a non-trivial criterion to satisfy, and it isn't easily dealt with in practice.
In principle, if we gather enough data, we may be able to find an unbiased estimate for the initialization of a simulation.
- However, the reality of this is actually quite challenging, and we may in general initialize with a biased prior.
Unlike the general convergence of background covariances, biased priors aren't guaranteed generally to lose their initial bias, and may have long-term effects in the prediction cycle.
Various techniques are used in practice, including estimating the biases of predictions;
- we may also consider that, the effect of a biased prior is reduced by having a larger background uncertainty.
If we inflate our background uncertainty (increase the variances), we put less importance on our prior knowledge and the algorithm is more receptive to the data.
Particularly, this reflects the trade off in the optimal weights in the relative uncertainty of the observations and the background state.
As a general rule, it is better to over estimate our background uncertainty than to underestimate – the later can often lead to what is known as filter divergence in real problems.