Presented at the Joint ICTP-IUGG Workshop on Data Assimilation and Inverse Problems in Geophysical Sciences
19 October, 2021
Setting the gradient \( \nabla_{\pmb{w}} \mathcal{J} \) equal to zero for \( \overline{\pmb{w}} \) we find the critical value as
\[ \begin{align} \overline{\pmb{w}} = \pmb{0} - {\mathbf{H}_\mathcal{J}}^{-1} \nabla_{\pmb{w}} \mathcal{J}|_{\pmb{w}=\pmb{0}}. \end{align} \] where \( \mathbf{H}_\mathcal{J}:= \nabla^2_{\pmb{w}}\mathcal{J} \) is the Hessian of the cost function.
The forecast mean is thus updated as,
\[ \begin{align} \overline{\pmb{x}}_L^\mathrm{filt}:= \overline{\pmb{x}}_{L}^\mathrm{fore} + \boldsymbol{\Sigma}_{L}^\mathrm{fore} \overline{\pmb{w}}. \end{align} \]
If we define a right-transform matrix, \( \mathbf{T}:= \mathbf{H}^{-\frac{1}{2}}_{\mathcal{J}} \) we similarly have the update for the covariance as
\[ \begin{align} \mathbf{B}^\mathrm{filt}_L = \left(\boldsymbol{\Sigma}_L^\mathrm{fore} \mathbf{T} \right)\left(\boldsymbol{\Sigma}_L^\mathrm{fore} \mathbf{T} \right)^\top. \end{align} \]
This derivation above describes the square root Kalman filter recursion,3 when written for the exact mean and covariance, recursively computed in the linear-Gaussian model.
In particular, under the perfect model assumption, the forecast can furthermore be written as,
\[ \begin{align} \overline{\pmb{x}}_{L+1}^\mathrm{fore}:= \mathbf{M}_{L+1} \overline{\pmb{x}}_{L}^{\mathrm{filt}} & & \boldsymbol{\Sigma}_{L+1}^\mathrm{fore} := \mathbf{M}_{L+1}\left(\boldsymbol{\Sigma}_{L}^\mathrm{filt}\right) \end{align} \] giving a complete recursion in time, within the matrix factor.
The mean and covariance update in the square root Kalman filter is then written entirely in terms of the weight vector \( \overline{\pmb{w}} \) and the right-transform matrix \( \mathbf{T} \).
These reductions are at the core of the efficiency of the ensemble transform Kalman filter (ETKF),4 in which we will typically make a reduced-rank approximation to the background covariances \( \mathbf{B}^\mathrm{i}_L \).
Suppose, instead, we have an ensemble matrix \( \mathbf{E}^\mathrm{i}_L \in \mathbb{R}^{N_x \times N_e} \);
Writing the cost function restricted to the ensemble span, we have a restricted Hessian \( \mathbf{H}_{\widetilde{\mathcal{J}}} \) of the form
\[ \begin{align} \mathbf{H}_\mathcal{\widetilde{J}} \in \mathbb{R}^{N_e \times N_e}. \end{align} \]
Specifically, we will use the ensemble-based, empirical estimates,
\[ \begin{align} & & \hat{\pmb{x}}_L^\mathrm{fore} &= \mathbf{E}_L^\mathrm{fore} \pmb{1} / N_e ; & \hat{\pmb{\delta}}_L &= \mathbf{R}_k^{-\frac{1}{2}}\left(\pmb{y}_L - \mathbf{H}_L \hat{\pmb{x}}_L\right)\\ &&\mathbf{X}_L^\mathrm{fore} &= \mathbf{E}_L^\mathrm{fore} - \hat{\pmb{x}}^\mathrm{fore}_L \pmb{1}^\top ; & \mathbf{P}^\mathrm{fore}_L &= \mathbf{X}_L^\mathrm{fore} \left(\mathbf{X}_L^\mathrm{fore}\right)^\top / \left(N_e - 1\right);\\ & &\mathbf{S}_L &:=\mathbf{R}_L^{-\frac{1}{2}}\mathbf{H}_L \mathbf{X}_L^\mathrm{fore}. \end{align} \]
Then the ensemble-based cost function is written as
\[ \begin{alignat}{2} & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#1b9e77} {\frac{1}{2} \parallel \hat{\pmb{x}}_L^\mathrm{fore} - \mathbf{X}^\mathrm{fore}_L \pmb{w}- \hat{\pmb{x}}^\mathrm{fore}_L \parallel_{\mathbf{P}^\mathrm{fore}_L}^2} } + {\color{#7570b3} {\frac{1}{2} \parallel \pmb{y}_L - \mathbf{H}_L\hat{\pmb{x}}^\mathrm{fore}_L - \mathbf{H}_L \mathbf{X}^\mathrm{fore}_L \pmb{w} \parallel_{\mathbf{R}_L}^2 } }\\ \Leftrightarrow& & {\color{#d95f02} {\widetilde{\mathcal{J}}(\pmb{w})} } &= {\color{#1b9e77} {\frac{1}{2}(N_e - 1) \parallel\pmb{w}\parallel^2} } + {\color{#7570b3} {\frac{1}{2}\parallel \hat{\pmb{\delta}}_L - \mathbf{S}_L \pmb{w}\parallel^2 } } \end{alignat} \] which is an optimization over a weight vector \( \pmb{w} \) in the ensemble dimension \( N_e \) rather than in the state space dimension \( N_x \).
This a key reduction that makes Monte Carlo methods feasible for the large size of geophysical models.
This is a sketch of the derivation of the local ensemble transform Kalman filter (LETKF) of Hunt et al.,8 which is one of the currently widely operational DA algorithms.
In this formalism, we can appropriately define an ensemble right-transform \( \boldsymbol{\Psi}_k \) such that for any \( t_k \),
\[ \begin{align} \mathbf{E}^\mathrm{filt}_k = \mathbf{E}^\mathrm{fore}_k \boldsymbol{\Psi}_k \end{align} \] where in the above we would say that \[ \begin{align} \mathbf{E}^\mathrm{filt}_k &\sim p(\pmb{x}_k \vert \pmb{y}_{k:1}) \\ \mathbf{E}^\mathrm{fore}_k &\sim p(\pmb{x}_k \vert \pmb{y}_{k-1:1}) \end{align} \]
We will associate \( \mathbf{E}^\mathrm{filt}_L \equiv \mathbf{E}^\mathrm{smth}_{L|L} \);
\[ \begin{align} \mathbf{E}^\mathrm{smth}_{k |L} = \mathbf{E}^\mathrm{smth}_{k|L-1}\boldsymbol{\Psi}_{L} & & \mathbf{E}^\mathrm{smth}_{k|L} \sim p(\pmb{x}_k \vert \pmb{y}_{L:1}). \end{align} \]
Then we can perform a retrospective smoothing analysis on all past states stored in memory by using the latest right-transform update from the filtering step.
This form of retrospective analysis is the basis of the ensemble Kalman smoother (EnKS).9
Under the linear-Gaussian assumption, the resulting cost function takes the form
\[ \begin{alignat}{2} & & {\color{#d95f02} {\mathcal{J} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \overline{\pmb{x}}_{0|L-S}^\mathrm{smth} - \boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w}- \overline{\pmb{x}}^\mathrm{smth}_{0|L-S} \parallel_{\mathbf{B}^\mathrm{smth}_{0|L-S}}^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathbf{H}_k {\color{#1b9e77} { \mathbf{M}_{k:1}} } {\color{#d95f02} {\overline{\pmb{x}}^\mathrm{smth}_{0|L-S}} } - \mathbf{H}_k {\color{#1b9e77} {\mathbf{M}_{k:1}}} {\color{#d95f02} {\boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w} }} \parallel_{\mathbf{R}_k}^2 } } \\ \Leftrightarrow& & \mathcal{J}(\pmb{w}) &= \frac{1}{2}\parallel\pmb{w}\parallel^2 + \sum_{k=L-S+1}^L \frac{1}{2}\parallel \overline{\pmb{\delta}}_k - \boldsymbol{\Gamma}_k \pmb{w}\parallel^2 . \end{alignat} \]
In the linear-Gaussian case, the solution can again be found by a single iteration of Newton's descent for the cost function above,
\[ \begin{align} \nabla_{\pmb{w}} \mathcal{J}:= \pmb{w} - \sum_{k=L-S+1}^{L}\boldsymbol{\Gamma}^\top_k \left(\overline{\pmb{\delta}}_k - \boldsymbol{\Gamma}_k \pmb{w}\right), & & \mathbf{H}_{\mathcal{J}} := \mathbf{I} + \sum_{k=L-S+1}^L \boldsymbol{\Gamma}_k^\top \boldsymbol{\Gamma}_k, & & \overline{\pmb{w}} := \pmb{0} - \mathbf{H}_{\mathcal{J}}^{-1} \nabla_{\pmb{w}} \mathcal{J}|_{\pmb{w}=\pmb{0}}\\ \\ \overline{\pmb{x}}_{0|L}^\mathrm{smth} = \overline{\pmb{x}}^\mathrm{smth}_{0|L-S} + \boldsymbol{\Sigma}_{0|L-S}^\mathrm{smth} \overline{\pmb{w}} & & \mathbf{T}:= \mathbf{H}_{\mathcal{J}}^{-\frac{1}{2}} & &\boldsymbol{\Sigma}_{0|L}^\mathrm{smth}:= \boldsymbol{\Sigma}_{0|L-S}^\mathrm{smth}\mathbf{T} \end{align} \]
However, when the state and observation model are nonlinear, using \( \mathcal{H}_k \) and \( \mathcal{M}_{k:1} \) in the cost function, this cost function is solved iteratively to find a local minimum.
In 4D-VAR, this is performed by an incremental linearization and back propagation of sensitivities by the adjoint model.
In particular, suppose that the equations of motion are generated by a nonlinear function, independent of time for simplicity,
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{x} = \pmb{f}(\pmb{x}) & &\pmb{x}_k= \mathcal{M}_k(\pmb{x}_{k-1}):= \int_{t_{k-1}}^{t_k} \pmb{f}(\pmb{x}) \mathrm{d}t + \pmb{x}_{k-1} \end{align} \]
We can extend the linear-Gaussian approximation for the forecast density as
\[ \begin{align} &\pmb{x}_{k-1} := \overline{\pmb{x}}_{k-1} + \pmb{\delta}_{k-1} \sim N(\overline{\pmb{x}}_{k-1}, \mathbf{B}_{k-1})\\ \Leftrightarrow &\pmb{\delta}_{k-1} \sim N(\pmb{0}, \mathbf{B}_{k-1}) \end{align} \]
The evolution of the perturbation \( \pmb{\delta} \) can thus be approximated via Taylor's theorem as
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{\delta}&:= \frac{\mathrm{d}}{\mathrm{d}t}\left( \pmb{x} - \overline{\pmb{x}}\right)\\ &=\pmb{f}(\pmb{x}) - \pmb{f}(\overline{\pmb{x}})\\ &=\nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta} + \mathcal{O}\left(\parallel \pmb{\delta}\parallel^2\right) \end{align} \] where \( \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}}) \) is the Jacobian equation with dependence on the underlying mean state.
In particular, the linear evolution defined by the truncated Taylor expansion
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{\delta}:= \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta} \end{align} \] is known as the tangent-linear model.
Making the approximation of the tangent-linear model,
\[ \begin{align} &\frac{\mathrm{d}}{\mathrm{d}t} \pmb{x} \approx \pmb{f}(\overline{\pmb{x}}) + \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta}\\ \\ \Rightarrow&\int_{t_{k-1}}^{t_k}\frac{\mathrm{d}}{\mathrm{d}t}\pmb{x}\mathrm{d}t \approx \int_{t_{k-1}}^{t_k} \pmb{f}(\overline{\pmb{x}}) \mathrm{d}t + \int_{t_{k-1}}^{t_k}\nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta}\mathrm{d}t \\ \\ \Rightarrow & \pmb{x}_{k} \approx \mathcal{M}_{k}\left(\overline{\pmb{x}}_{k-1}\right) + \mathbf{M}_k\pmb{\delta}_{k-1} \end{align} \] where \( \mathbf{M}_k \) is the resolvent of the tangent-linear model.
Gaussians are closed under affine transformations, approximating the evolution under the tangent-linear model as
\[ \begin{align} \pmb{x}_{k} \sim N\left(\mathcal{M}_k\left(\overline{\pmb{x}}_{k-1}\right), \mathbf{M}_k \mathbf{B}_{k-1}\mathbf{M}^\top_{k}\right) \end{align} \]
Therefore, the quadratic cost function is approximated by an incremental linearization along the background mean
\[ \begin{alignat}{2} & & {\color{#d95f02} {\mathcal{J} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \pmb{w} \parallel^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1} \left( {\color{#d95f02} {\overline{\pmb{x}}^\mathrm{smth}_{0|L-S} } }\right)}} - \mathbf{H}_k {\color{#1b9e77} {\mathbf{M}_{k:1}}} {\color{#d95f02} {\boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w} } } \parallel_{\mathbf{R}_k}^2 } } \end{alignat} \] describing an approximate linear-Gaussian model / cost function in the space of perturbations.
From the last slide, a very efficient approximation of the gradient can be performed with respect to the adjoint model of the tangent-linear model.
In particular, the adjoint variables are defined by a backward-in-time solution to the linear equation
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \tilde{\pmb{\delta}} = -\left(\nabla_{\pmb{x}} \pmb{f}(\overline{\pmb{x}})\right)^\top \tilde{\pmb{\delta}}, \end{align} \] with the underlying dependence on the nonlinear solution over the time interval.
Therefore, in incremental 4D-VAR, one constructs the gradient for the objective function, differentiating the nonlinear model, via:
This a very effective and efficient solution, but relies on the construction of the tangent-linear and adjoint models for the dynamics.
One alternative to constructing the tangent-linear and adjoint models is to perform a hybrid, ensemble-variational (EnVAR) analysis as based on the ETKF.
This approach is at the basis of the iterative ensemble Kalman filter / smoother (IEnKF/S)11.
This technique seeks to perform an ensemble analysis like the square root ETKF by defining the ensemble estimates and the weight vector in the ensemble span
\[ \begin{alignat}{2} & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \hat{\pmb{x}}_{0|L-S}^\mathrm{smth} - \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w}- \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} \parallel_{\mathbf{P}^\mathrm{smth}_{0|L-S}}^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left( {\color{#d95f02} { \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} } } \right)}}\parallel_{\mathbf{R}_k}^2 } }\\ \Leftrightarrow & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#d95f02} {(N_e - 1) \frac{1}{2} \parallel \pmb{w}\parallel^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left( {\color{#d95f02} { \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} } } \right)}}\parallel_{\mathbf{R}_k}^2 } } \end{alignat} \]
One measures the cost as the discrepancy from the observations with the nonlinear evolution of the perturbation to the ensemble mean,
\[ \begin{align} \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} \end{align} \] combined with the size of the perturbation relative to the ensemble spread.
The key is again, how the gradient is computed for the above cost function.
The gradient of the ensemble-based cost function is given by,
\[ \begin{align} {\color{#d95f02} {\nabla_{\pmb{w}} \widetilde{\mathcal{J}} } }:= {\color{#d95f02} {\left(N_e - 1\right) \pmb{w}}} - {\color{#7570b3} {\sum_{k=L-S+1}^L \widetilde{\mathbf{Y}}_k^\top \mathbf{R}^{-1}_k\left[\pmb{y}_k - \mathcal{H}_k \circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left({\color{#d95f02} {\hat{\pmb{x}}_{0|L-S}^\mathrm{smth} + \mathbf{X}_{0|L-S}^\mathrm{smth} \pmb{w}} }\right) } } \right]}}, \end{align} \]
where \( {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } } \) represents a directional derivative of the observation and state models,
\[ \begin{align} {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } }:= {\color{#d95f02} {\nabla\vert_{\hat{\pmb{x}}^\mathrm{smth}_{0|L-S}} } } {\color{#7570b3} {\left[\mathcal{H}_k \circ {\color{#1b9e77} {\mathcal{M}_{k:1} } } \right] } } {\color{#d95f02} {\mathbf{X}^\mathrm{smth}_{0|L-S}} }. \end{align} \]
In order to avoid the construction of the tangent-linear and adjoint models, the “bundle” version11 makes an explicit approximation of finite differences with the ensemble
\[ \begin{align} {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } }\approx& {\color{#7570b3} { \frac{1}{\epsilon} \mathcal{H}_k \circ {\color{#1b9e77} {\mathcal{M}_{k:1} \left( {\color{#d95f02} { \pmb{x}_{0|L-S}^\mathrm{smth} \pmb{1}^\top + \epsilon \mathbf{X}_{0|L-S}^\mathrm{smth} } }\right) } } \left(\mathbf{I}_{N_e} - \pmb{1}\pmb{1}^\top / N_e \right)} }, \end{align} \] for a small constant \( \epsilon \).
The scheme produces an iterative estimate using a Gauss-Newton-11 or, e.g., Levenberg-Marquardt-based12 optimization.
A similar scheme used more commonly in reservoir modeling is the ensemble randomized maximum likelihood estimator (EnRML).13
While accuracy increases with iterations in the 4D-MAP estimate, every iteration comes at the cost of the model forecast \( {\color{#1b9e77} { \mathcal{M}_{L:1} } } \).
In synoptic meteorology the linear-Gaussian approximation of the evolution of the densities is actually an adequate approximation;
However, the iterative optimization over a nonlinear observation operator \( \mathcal{H}_k \) or hyper-parameters in the filtering step of the classical EnKS can be run without the additional cost of model forecasts.
Subsequently, the retrospective analysis in the form of the filtering right-transform can be applied to condition the initial ensemble
\[ \begin{align} \mathbf{E}^\mathrm{smth}_{0|L} = \mathbf{E}_{0:L-1}^\mathrm{smth} \boldsymbol{\Psi}_L \end{align} \]
As with the 4D cost function, one can initialize the next DA cycle in terms of the retrospective analysis, and gain the benefit of the improved initial estimate.
This scheme, currently in open review, is the single-iteration ensemble Kalman smoother (SIEnKS).15
These results and a wide variety of other test cases for short-range forecast systems are presented in our manuscript A fast, single-iteration ensemble Kalman smoother for sequential data assimilation in open review.15
However, our mathematical results are supported by extensive numerical demonstration, with the Julia package DataAssimilationBenchmarks.jl20
The ETKS, (Lin-)IEnKS and SIEnKS have pseudo-code provided in the manuscript, and implementations Julia.
We introduce our novel scheme, validating its performance advantages for short-range forecast cycles; and
In surveying a variety of schemes, our key point is that the Bayesian MAP analysis can be decomposed in a variety of ways to exploit the operational problem.
Traditional 4D-MAP approaches have largely exploited this analysis for moderately-to-highly nonlinear forecast error dynamics.
However, with a change of perspective, this analysis can similarly be exploited for short-range prediction cycles as in the SIEnKS.
This leads to a simple, outer-loop optimization of the DA cycle itself;