Joint state-parameter estimation

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Estimating process parameters
- Extended state formalism
- Iterative nonlinear least-squares estimation
- Noise models for the extended state
- Sample-based estimation

Motivation – estimating process parameters

We have now seen a variety of techniques for linear estimation, and some that bridge to nonlinear estimation such as Gauss-Newton.
- Likewise, this development describes the equivalence of the statistical and the optimization approach in terms of:
- the square root Kalman filter; and
- recursive least-squares and Newton's descent.
The statistical and optimization approaches to DA share commonality in nonlinear estimation as well,
- however, we will focus first on their distinct approaches.
After we have developed the fundamentals of their nonlinear application, we will consider how these can be hybridzed in practice.
To begin with a bridge to the nonlinear estimation problem, we will consider when we want to estimate a process model parameter simultaneously with the unknown state \( \pmb{x} \).

Motivation – estimating process parameters

Particularly, we will suppose that we have a hidden Markov model of the form,

\[ \begin{align} \pmb{x}_{k} &= \mathbf{M}_{\pmb{\lambda},k} \pmb{x}_{k-1} + \pmb{w}_k\\ \pmb{y}_{k} &= \mathbf{H}_k\pmb{x}_{k} + \pmb{v}_k \end{align} \]
where \( \pmb{\lambda} \in \mathbb{R}^{N_p} \) is a vector of parameters that the model evolution depends on;
- i.e., if \( \pmb{\lambda} \) is known, the model reduces to \( \mathbf{M}_k \) alone.
However, if \( \pmb{\lambda} \) represents a vector of free variables for the system,
- which themselves are unknown, or known only in an uncertain prior knowledge sense,
the issue arises of how these parameters can or cannot be estimated simultaneously with the evolving state \( \pmb{x}_k \).
In particular, if we were to suppose that there are some true values for which process evolution depends on \( \pmb{\lambda}^\ast \), where

\[ \begin{align} \pmb{x}^\mathrm{true}_k &= \mathbf{M}_{\pmb{\lambda}^\ast, k} \pmb{x}_{k-1}^\mathrm{true} + \pmb{w}_k, \\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_{k}^\mathrm{true} + \pmb{v}_k, \end{align} \]
our simulation with respect to a different choice of parameters \( \pmb{\lambda} \) will generically produce a biased forecast.
In this case, we can frame the joint estimation in a view similar to our discussion of nonlinear least-squares.

Extended state formalism

It is both a classical dynamical systems trick / Bayesian trick to turn the analysis of joint estimation into an “extended state”.
By extended state, we say suppose we have the parameter-dependent hidden Markov Model,

\[ \begin{align} \pmb{x}_{k} &= \mathbf{M}_{\pmb{\lambda},k} \pmb{x}_{k-1} + \pmb{w}_k;\\ \pmb{y}_{k} &= \mathbf{H}_k\pmb{x}_{k} + \pmb{v}_k; \end{align} \]
then, let us write the extended state vector, and noise vector, as

\[ \begin{align} \tilde{\pmb{x}}:= \begin{pmatrix} \pmb{x} \\ \pmb{\lambda} \end{pmatrix}, & & \tilde{\pmb{w}}_k := \begin{pmatrix} \pmb{w}_k \\ \pmb{0} \end{pmatrix}, & & \tilde{\pmb{w}}_k \sim N\left(\pmb{0},\begin{pmatrix}\mathbf{Q}_k & \pmb{0} \\ \pmb{0} & \pmb{0} \end{pmatrix} \right). \end{align} \]
Similarly, let us define the following mechanistic model, model noise and observation operator

\[ \begin{align} \widetilde{\mathcal{M}}_k &:= \begin{pmatrix}\mathbf{M}_{\pmb{\lambda},k} & \pmb{0} \\ \pmb{0} & \mathbf{I}_{N_p} \end{pmatrix} & & \widetilde{\mathbf{H}}_k := \begin{pmatrix} \mathbf{H}_k & \pmb{0} \\ \pmb{0} & \pmb{0} \end{pmatrix}. \end{align} \]
Then an extended state model is defined by

\[ \begin{align} \tilde{\pmb{x}}_{k} &= \widetilde{\mathcal{M}}_{k} \left(\tilde{\pmb{x}}_{k-1}\right) + \tilde{\pmb{w}}_k;\\ \pmb{y}_{k} &= \widetilde{\mathbf{H}}_k\tilde{\pmb{x}}_{k} + \pmb{v}_k; \end{align} \] where \( \widetilde{\mathcal{M}}_k \) is actually a nonlinear function of the extended state.

Iterative nonlinear least-squares estimation

From the extended state model on the last slide, we now have a formal version of a hidden Markov model more generally.
- We may suppose that we initialize with some prior,
\[ \begin{align} \tilde{\pmb{x}}_0 \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_0 \\ \overline{\pmb{\lambda}}_0 \end{pmatrix},\begin{pmatrix}\mathbf{B}_{\pmb{x}} & \mathbf{B}_{\pmb{x},\pmb{\lambda}}\\ \mathbf{B}_{\pmb{\lambda},\pmb{x}} & \mathbf{B}_{\pmb{\lambda}} \end{pmatrix} \right) \equiv N\left( \overline{\tilde{\pmb{x}}}, \widetilde{\mathbf{B}}_0\right); \end{align} \]
however, even with a Gaussian prior, the subsequent evolution will be non-Gaussian generically due to the lack of closure under a nonlinear evolution.
Let's suppose for the moment that we have prior cross covariances between the state and the parameter values, \( \mathbf{B}_{\pmb{x},\pmb{\lambda}} \).
In this case, an iterative approach in terms of the state one-step-back-in-time can be relatively simply formulated as an optimization problem,

\[ \begin{align} \mathcal{J}(\tilde{\pmb{x}}) := \frac{1}{2} \parallel \overline{\tilde{\pmb{x}}}_0 - \tilde{\pmb{x}} \parallel^2_{\widetilde{\mathbf{B}}_0} + \frac{1}{2} \parallel \pmb{y}_1 - \widetilde{\mathbf{H}}_1 \widetilde{\mathcal{M}}_{1}\left( \tilde{\pmb{x}}\right) \parallel^2_{\mathbf{R}_1} . \end{align} \]
That is to say, we wish to minimize the discrepancy between the observed data and the output from input parameters in the extended state,
- in terms of both the time-evolving \( \pmb{x} \) and the fixed \( \pmb{\lambda} \),
where this is now a nonlinear least-squares optimization, which is not necessarily convex.

Iterative nonlinear least-squares estimation

As was discussed with nonlinear least-squares, this type of estimation is often approached via Gauss-Newton, due to the efficient approximation of the Hessian.
Particularly, this requires a calculation of the derivative of \( \nabla_{\pmb{\lambda}} \mathbf{M}_{\pmb{\lambda},k} \);
- even when there is no underlying dependence of \( \mathbf{M}_k \) on a time-evolving state \( \pmb{x} \), this is a complicated procedure, and the objective function Hessian is often not directly computable.
A Gauss-Newton estimation, therefore, iteratively minimizes the cost with the approximate Newton procedure

\[ \begin{align} \tilde{\pmb{x}}_{0}^i := \tilde{\pmb{x}}_0^{i-1} - \widetilde{\mathbf{H}}_{\mathcal{J}}^{-1} \nabla_{\tilde{\pmb{x}}} \mathcal{J}|_{\tilde{\pmb{x}}^{i-1}}, \end{align} \] where:
- \( \tilde{\pmb{x}}^{i}_0 \) is the \( i \)-th iteration for the minimizer for the best estimate of the initial condition in state and parameter space; and
- \( \widetilde{\mathbf{H}}_{\mathcal{J}} \) is the Gauss-Newton approximation to the Hessian.
Such an optimization can be run until the discrepancy between the output and the data reaches a tolerated error level.
- Likewise, this extends naturally over multiple observation times, like the extended cost function seen with the adjoint approximation.
To guarantee a descent, line-search methods modify this technique to efficiently look for a correct step size within the approximate Newton direction.
This has a parallel development in what is known as a “trust region” optimization, which can be derived in a slightly different classical optimization scheme, Levenberg-Marquardt.

Noise models for the extended state

One of the issues about the direct optimization approach seen for the extended state is the susceptibility to local minima for the parameter value \( \pmb{\lambda} \).
In particular, nonlinear cost functions for joint state-parameter estimation tend to be highly sensitive, due to the many possible parameter values that could produce different behaviors in the time-evolution process.
One commonly used rectification for the local minima is to instead view \( \pmb{\lambda} \) as a time-evolving random state, governed by, e.g., a random walk in parameter space,

\[ \begin{align} \tilde{\pmb{x}}:= \begin{pmatrix} \pmb{x} \\ \pmb{\lambda} \end{pmatrix}, & & \tilde{\pmb{w}}_k := \begin{pmatrix} \pmb{w}_k \\ \pmb{p}_k \end{pmatrix}, & & \tilde{\pmb{w}}_k \sim N\left(\pmb{0},\begin{pmatrix}\mathbf{Q}_k & \pmb{0} \\ \pmb{0} & \mathbf{Q}_{\pmb{p}} \end{pmatrix} \right), \\ \tilde{\pmb{x}}_{k} = \widetilde{\mathcal{M}}_{k} \left(\tilde{\pmb{x}}_{k-1}\right) + \tilde{\pmb{w}}_k, & & \pmb{y}_{k} = \widetilde{\mathbf{H}}_k\tilde{\pmb{x}}_{k} + \pmb{v}_k. \end{align} \]
This simple adjustment then allows one to explore the parameter space with a noise model;
- with such a model, this prevents getting trapped in local minima for the optimization by jittering the parameter values at each time.
However, the jittering may also knock the estimate away from the global minima;
- thus tuning the covariance of the random walk \( \mathbf{Q}_{\pmb{p}} \) is of considerable importance, and the “optimality” of the covariance random walk is not in general guaranteed.
The rule-of-thumb is to try such an estimator over multiple walk settings and see for what noise levels one gets consistent results in replication.

Sample-based estimation

When estimating process parameters as above, the quality of the estimates also depends strongly on the level of correlation between the parameter values and the model state evolution.
If the time-evolution of the states is strongly connected to the choice of parameter settings, the parameters can be easily identified by the cross correlation between the observed data and the simulation outputs.
However, these correlations are poorly known in most cases, and specifying \( \mathbf{B}_{\pmb{x},\pmb{\lambda}} \) can rarely be done accurately.
Suppose we do not have any knowledge about the cross covariances, but we do have prior knowledge of their respective backgrounds.
If we applied a Kalman-like estimator directly without specified cross correlation \( \mathbf{B}_{\pmb{x},\pmb{\lambda}} = \pmb{0} \),
- this treats the outcomes in state space as independent from the parameter values in an approximate conditional Gaussian model, which is not what is wanted.
However, these correlations can often be constructed empirically with a sample-based approach from Bayesian analysis…

Sample-based estimation

Instead, we can take a random draw from the prior on the extended state,

\[ \begin{align} \tilde{\pmb{x}}_0^j \sim N\left(\begin{pmatrix}\overline{\pmb{x}}_0 \\ \overline{\pmb{\lambda}}_0 \end{pmatrix},\begin{pmatrix}\mathbf{B}_{\pmb{x}} &\pmb{0}\\ \pmb{0} & \mathbf{B}_{\pmb{\lambda}} \end{pmatrix} \right) & & j=1,\cdots, N_e. \end{align} \]
Although the sample will lack correlation between the two components, by this initialization, the evolution through the process builds correlations between the states and parameters, i.e., let

\[ \begin{align} \tilde{\pmb{x}}^j_1:= \begin{pmatrix} \mathbf{M}_{\pmb{\lambda}^j,1} \pmb{x}_0^j + \pmb{w}_k^j \\ \pmb{\lambda}^j_0 + \pmb{p}^j_1\end{pmatrix} \end{align} \] where:
- \( \pmb{x}_0^j \) refers to the \( j \)-th replicate of the state variable;
- \( \pmb{\lambda}^j_0 \) refers to the \( j \)-th replicate of the initial parameter vector;
- \( \mathbf{M}_{\pmb{\lambda}^j,k} \) refers to the model setting with \( \pmb{\lambda} \) specified as \( \pmb{\lambda}^j \), the \( j \)-th parameter replicate;
- \( \tilde{\pmb{x}}^j_1 \) refers to the \( j \)-th replicate of the extended state at the next time.

Sample-based estimation

Let us denote the ensemble matrix once again, with replicates of the extended state forming the columns

\[ \begin{align} \mathbf{E}_1 := \begin{pmatrix} \tilde{\pmb{x}}_1^1 & \cdots & \tilde{\pmb{x}}_1^{N_e}\end{pmatrix}, \end{align} \] and respectively for the anomaly matrix,

\[ \begin{align} \mathbf{X}_1 := \begin{pmatrix} \tilde{\pmb{x}}_1^1 - \hat{\tilde{\pmb{x}}}_1 & \cdots & \tilde{\pmb{x}}_1^{N_e} -\hat{\tilde{\pmb{x}}}_1 \end{pmatrix}/\sqrt{N_e -1} \end{align} \]
Generically, the ensemble covariance, from the replicates

\[ \begin{align} \mathbf{P}_1 := \mathbf{X}_1 \mathbf{X}^\top_1 = \begin{pmatrix} \mathbf{P}_{\pmb{x}} & \mathbf{P}_{\pmb{x},\pmb{\lambda}} \\ \mathbf{P}_{\pmb{\lambda},\pmb{x}} & \mathbf{P}_{\pmb{\lambda}}\end{pmatrix}, \end{align} \] will not have zero cross covariances.
However, we have yet-to-discuss, how do we use future information like this to condition estimates of past states.
This is strongly related to the extended-in-time cost function in optimization,
- but this also has an important interpretation in terms of a retrospective statistical analysis.
Using future information to update the estimate of a past state is what is broadly known in stochastic filtering as the smoothing problem.
- That is to say, we will apply a retrospective analysis, smoothing all estimates with all future information available.
The differences between the filtering, smoothing and sequential smoothing problems will be discussed in the next lecture.