Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
In the last lecture we saw the many correspondences between the statistical estimation approach as in the Kalman filter and the objective function minimization approach from optimization analysis.
In particular, we saw how the square root Kalman filter, with right transform analysis, can be interpreted as a recursive least-squares problem with Newton's algorithm.
Similarly, the direct, penalized optimization approach with the state-space cost function revealed that the posterior covariance is actually the inverse Hessian for the cost function.
In this section, we will discuss two important extensions of the optimization approach to estimation:
We recall that, for the Kalman filter, we stated that the approach by recursive estimation,
\[ \begin{align} p(\pmb{x}_k | \pmb{y}_{k:1}) \propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k|\pmb{y}_{k-1:1}) \end{align} \] is equivalent to the direct approach on the left-hand-side.
A similar idea to estimating \( \pmb{x}_k \) through the direct approach comes from optimal analysis, when assuming that the Gauss-Markov model is “perfect”, i.e., where there are no shocks to the model evolution
\[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1},\\ \pmb{y}_k &= \mathbf{H}_k + \pmb{v}_k. \end{align} \]
This scenario is covered by the classical results of an initial value problem, in which finding an “optimal” initial condition is sufficient to find an optimal unique solution over an entire time window.
Particularly, we can formally extend the previous penalized objective function as,
\[ \begin{align} \mathcal{J}(\pmb{x}_0) = \frac{1}{2}\parallel \overline{\pmb{x}}_0 -\pmb{x}_0 \parallel^2_{\mathbf{B}_0} + \frac{1}{2}\sum_{k=1}^L \parallel \pmb{y}_k - \mathbf{H}_k \pmb{x}_k\parallel_{\mathbf{R}_k}^2 , \end{align} \]
where we are measuring the deviation of the entire model trajectory \( \pmb{x}_{L:0} \) from the observation time series \( \pmb{y}_{L:1} \) and the background prior mean \( \overline{\pmb{x}}_0 \) simultaneously.
Under the perfect model assumption, we recall that existence and uniqueness guarantees,
\[ \begin{align} \pmb{x}_k := \mathbf{M}_{k:1} \pmb{x}_0 \end{align} \] for any given \( k \), though where we again have uncertain initial data of the form
\[ \begin{align} \pmb{x}_0 \sim N(\overline{\pmb{x}}_0 , \mathbf{B}_0). \end{align} \]
Using this relationship, we can then write the objective function directly in terms of the uncertain initial condition as follows,
\[ \begin{align} \mathcal{J}(\pmb{x}_0) = \frac{1}{2}\parallel \overline{\pmb{x}}_0 - \pmb{x}_0 \parallel_{\mathbf{B}_0}^2 + \frac{1}{2}\sum_{k=1}^L \parallel \pmb{y}_k - \mathbf{H}_k\mathbf{M}_{k:1} \pmb{x}_0 \parallel^2_{\mathbf{R}_k}. \end{align} \]
In this simplistic, perfect model setting, it is easy to show that a global analysis can be obtained like the square root Kalman filter as
\[ \begin{align} \mathbf{B}_0 = \boldsymbol{\Sigma}_{0}\boldsymbol{\Sigma}_0^\top & & \mathbf{M}_{k:1}\boldsymbol{\Sigma}_{0} = \boldsymbol{\Sigma}_k & & \pmb{x}_0 := \overline{\pmb{x}}_0 + \boldsymbol{\Sigma}_0 \pmb{w} \\ \overline{\pmb{x}}_k = \mathbf{M}_{k:1}\overline{\pmb{x}}_0 & & \boldsymbol{\Gamma}_k := \mathbf{R}^{-\frac{1}{2}}_k \mathbf{H}_k\boldsymbol{\Sigma}_k & & \overline{\pmb{\delta}}_k = \pmb{y}_k - \mathbf{H}_k \overline{\pmb{x}}_k \end{align} \]
rendering the global cost function
\[ \begin{align} \mathcal{J}_w(\pmb{w}) = \frac{1}{2} \parallel \pmb{w}\parallel^2 + \frac{1}{2} \sum_{k=1}^L \parallel \overline{\pmb{\delta}}_k - \boldsymbol{\Gamma}_k \pmb{w} \parallel^2. \end{align} \]
Particularly, from the last slide, we can perform the same analysis where \( L=1 \) and show that an optimal initial condition is generated via
\[ \begin{align} \nabla_{\pmb{w}} \mathcal{J}_w &= \pmb{w} - \sum_{k=1}^L \mathbf{S}_k^\top\left(\pmb{\delta}_k - \mathbf{S}_k\pmb{w} \right)\\ \mathbf{H}_{\mathcal{J}_w} &= \mathbf{I} + \sum_{k=1}^L \mathbf{S}_k^\top \mathbf{S}_k\\ \hat{\pmb{w}}& = \pmb{0} - \mathbf{H}_{\mathcal{J}_w}^{-1} \nabla_{\pmb{w}}\mathcal{J}_w |_{\pmb{w}=\pmb{0}}, \end{align} \] i.e., with a single Newton iteration.
However, while the above approach works well for the linear-Gaussian perfect model, there are several challenges in applying this technique to a nonlinear system, even with the tangent-linear approximation.
Particularly, if we imagine that \( \mathbf{M}_k \) is actually a fundamental matrix solution for the tangent-linear model,
then \( \nabla_{\pmb{x}_0} \mathcal{J} \) requires computing the derivative of the model equations itself, with the dependence on the underlying solution.
In principle, one can use a finite-difference approximation of the gradient for the objective function,
We will present here the simplified form of the adjoint solution for the gradient for a linear-perfect dynamics like in the motivation above.
Hybrid approaches (including variational analysis, stastical analysis and machine learning combined) are at the current frontier and therefore a basic understanding of this approach is useful to understand the state-of-the-art.
Similarly, the adjoint approximation of the gradient is widely used in machine learning for training neural networks, where it is known as “back propagation”.
This technique has longer roots in functional analysis and the “calculus of variations”, though due to the complexity of these tools we will suppress many of the details in this course.
For now, let us suppress the background term in the objective function, and only focus on an optimization with respect to the observations,
\[ \begin{align} \mathcal{J}(\pmb{x}_0) = \frac{1}{2}\sum_{k=1}^L \parallel \pmb{y}_k - \mathbf{H}_k \mathbf{M}_{k:1}\pmb{x}_0 \parallel^2_{\mathbf{R}_k}. \end{align} \]
Including the background term additionally will follow in straight-forward way from the analysis with the observations alone.
Let us denote the following,
\[ \begin{align} \pmb{d}_k := \pmb{y}_k - \mathbf{H}_k\mathbf{M}_{k:1}\pmb{x}_0, & & \Delta_k := \mathbf{R}^{-1}\pmb{d}_k. \end{align} \]
From this, we re-write the objective function as,
\[ \begin{align} \mathcal{J}(\pmb{x}_0) = \frac{1}{2}\sum_{k=1}^L \pmb{d}_k^\top\mathbf{R}_k^{-1}\pmb{d}_k. \end{align} \]
Informally, we will denote a variation \( \delta \) on a function as a “perturbation in the function space”;
Nonetheless, we will loosely interpret this through the usual chain rule and product rule as follows, where we take a variation with respect to the function of time generated by initial data \( \pmb{x}_0 \)
\[ \begin{align} \delta\mathcal{J}(\pmb{x}_0) &= \frac{1}{2}\sum_{k=1}^L \left[ \delta\pmb{d}_k^\top\mathbf{R}_k^{-1}\pmb{d}_k + \pmb{d}_k^\top \mathbf{R}_k^{-1}\delta \pmb{d}_k\right]\\ &= \sum_{k=1}^L \delta\pmb{d}_k^\top \mathbf{R}^{-1}_k \pmb{d}_k \\ &= -\sum_{k=1}^L \left(\mathbf{H}_k\left[\mathbf{M}_{k:1}\delta \pmb{x}_0 \right]\right)^\top\Delta_k \\ &=- \sum_{k=1}^L \delta \pmb{x}_0^\top \left[\mathbf{M}_1^\top \cdots \mathbf{M}_k^\top \right]\mathbf{H}_k^\top \Delta_k. \end{align} \]
The variation of the objective function can be considered to be a directional derivative in the function perturbation \( \delta \pmb{x}_0 \), or the gradient's inner product with a vector, so that,
\[ \begin{align} \nabla_{\pmb{x}_0} \mathcal{J}& = - \sum_{k=1}^L \left[\mathbf{M}_1^\top \cdots \mathbf{M}_k^\top \right]\mathbf{H}_k^\top \Delta_k \\ &=-\mathbf{M}_1\left[\mathbf{H}_1\Delta_1 + \mathbf{M}_2^\top \left[\mathbf{H}_2^\top \Delta_2 + \mathbf{M}_3^\top \left[\mathbf{H}_3^\top \Delta_3 + \cdots + \left[\mathbf{M}_L^\top \mathbf{H}_L^\top \Delta_L \right] \right] \right] \right] \end{align} \] where the above recursive calculation is known as a Horner factorization.
The benefit of the above approach is that this gives an extremely efficient calculation of the gradient, that translates well to nonlinear dynamics and the tangent-linear approximation.
Specifically, one will,
We interpret, then, the gradient can be computed by a freely evolved forward solution, with a back-propagation of the sensitivities via the adjoint model.
With the linear derivation above, it may seem trivial the form of the solution as this simply required defining the adjoint model as,
\[ \begin{align} \tilde{\pmb{x}}_k = \mathbf{M}_k^\top \tilde{\pmb{x}}_{k+1}. \end{align} \]
The trick is that when this is actually defined as a linear approximation of a nonlinear model, this refers to a backward-in-time solution to the linear equation
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \tilde{\pmb{\delta}} = -\left(\nabla_{\pmb{x}} \pmb{f}(\pmb{x}(t))\right)^\top \tilde{\pmb{\delta}}, \end{align} \] with the underlying dependence on the nonlinear solution over the time interval.
Therefore, in a realistic nonlinear setting one constructs the gradient for the objective function, differentiating the nonlinear model, via:
This a very effective and efficient solution, but relies on the differentiability (and the ability to compute the derivative) of the function \( \pmb{f} \);
We can imagine that for a function \( \frac{\mathrm{d}}{\mathrm{d}t}\pmb{x}=\pmb{f} \) that governs the dynamics,
It is useful then to consider a slightly more general model for Newton's descent.
Let us take a general optimization notation where, we will denote the least-squares in terms of residuals
\[ \begin{align} \mathcal{J}(\pmb{x}) &:= \frac{1}{2} \parallel \pmb{y} - \pmb{\phi}(\pmb{x})\parallel^2 \\ &= \frac{1}{2} \pmb{r}^\top \pmb{r} \end{align} \] and
This phrases the optimization back as a minimization of a generalized sum of residuals squared (RSS).
The linear case occurs as before where \( \pmb{\phi} \) is a linear function, e.g., \( \pmb{\phi}(\pmb{x}) = \mathbf{H} \pmb{x} \).
We've seen how Newton's descent is an effective means of minimizing such a cost function when \( \pmb{\phi} \) is a linear function;
Nonetheless, it is a special feature of least-squares problems that this can often be well-approximated without a direct calculation of the Hessian.
Specifically, consider that the gradient is given by
\[ \begin{align} \nabla_{\pmb{x}} \mathcal{J} =\left( \nabla_{\pmb{x}} \pmb{r}\right)^\top \pmb{r}, \end{align} \]
where we interpret
\[ \begin{align} \pmb{r} = \begin{pmatrix} r_1(\pmb{x}) \\ \vdots \\ r_m(\pmb{x}) \end{pmatrix} & & \nabla_{\pmb{x}} \pmb{r} = \begin{pmatrix} \partial_{x_1} r_1(\pmb{x}) & \cdots & \partial_{x_m} r_1(\pmb{x}) \\ \vdots & \ddots & \vdots \\ \partial_{x_1} r_n(\pmb{x}) & \cdots & \partial_{x_m} r_n(\pmb{x}) \end{pmatrix}, \end{align} \] in terms of the Jacobian of the combined residual vector.
Using this form, we will expand with Taylor's theorem to derive a form for the Hessian of \( \mathcal{J} \).
Specifically, we can write
\[ \begin{align} \nabla_{\pmb{x}} \mathcal{J} = \left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \pmb{r} & & \mathbf{H}_{\mathcal{J}} = \left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \left(\nabla_{\pmb{x}}\pmb{r}\right) + \sum_{k=1}^{N_y} r_k \mathbf{H}_{r_k} \end{align} \]
In most cases, \( \nabla_{\pmb{x}} \pmb{r} \) is easy to calculate;
Consider then, for a perturbation \( \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} \), we have the second-order Taylor expansion given as,
\[ \begin{align} \mathcal{J}(\pmb{x}_1) = \mathcal{J}(\pmb{x}_0) + \left(\nabla_{\pmb{x}} \mathcal{J}(\pmb{x}_0) \right)^\top\pmb{\delta}_{\pmb{x}_1} + \frac{1}{2} \pmb{\delta}_{\pmb{x}_1}^\top \mathbf{H}_{\mathcal{J}} \pmb{\delta}_{\pmb{x}_1} +\mathcal{O}\left(\parallel \pmb{\delta}_{\pmb{x}_1}\parallel^3\right). \end{align} \]
This leads to a direct approximation of Newton's decent method where we define the second-order approximation as
\[ \begin{align} m(\pmb{\delta}_{\pmb{x}_1}) = \mathcal{J}(\pmb{x}_0) + \pmb{r}^\top\left(\nabla_{\pmb{x}}\pmb{r}\right) \pmb{\delta}_{\pmb{x}_1} + \frac{1}{2}\boldsymbol{\delta}_{\pmb{x}_1}^\top \left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \left(\nabla_{\pmb{x}}\pmb{r}\right)\boldsymbol{\delta}_{x_1} \end{align} \] neglecting the trailing second order terms.
Recall the second-order approximation from the last slide,
\[ \begin{align} m(\pmb{\delta}_{\pmb{x}_1}) = \mathcal{J}(\pmb{x}_0) +\pmb{r}^\top \left(\nabla_{\pmb{x}}\pmb{r}\right) \pmb{\delta}_{\pmb{x}_1} + \frac{1}{2}\boldsymbol{\delta}_{\pmb{x}_1}^\top \left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \left(\nabla_{\pmb{x}}\pmb{r}\right)\boldsymbol{\delta}_{x_1}. \end{align} \]
Setting the derivative of the second-order approximation to zero with respect to the perturbation, we get the Gauss-Newton approximation
\[ \boldsymbol{\delta}_{x_1} = - \left[ \left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \left(\nabla_{\pmb{x}}\pmb{r}\right) \right]^{-1}\left(\nabla_{\pmb{x}}\pmb{r}\right)^\top \pmb{r} \] which can be computed entirely in terms of the residuals and their Jacobian.
The Gauss-Newton approximation is the basis of a variety of techniques to minimize a nonlinear least squares objective functions, with approximate behavior to Newton.
This has several benefits in the way we can guarantee the descent direction similarly to Newton, and this can be exploited with, e.g., line search methods to find a “best approximation” for the descent.
This also tends to be quite numerically efficient in a variety of scenarios, where \( \nabla_{\pmb{x}} \pmb{r} \) can often be a sparse Jacobian, even when \( \left(\nabla_{\pmb{x}} \pmb{r}\right)^\top\nabla_{\pmb{x}} \pmb{r} \) is not sparse.
In such a setting, we reduce the cost of the explicit Hessian calculation considerably, but maintain many of its benefits in utilizing the local geometry.