A Bayesian approach to data assimilation

Colin Grudzien

Assistant Professor of Statistics

Presented at the Joint ICTP-IUGG Workshop on Data Assimilation and Inverse Problems in Geophysical Sciences

19 October, 2021

Outline

The following topics will be covered in this lecture

Observation-analysis-forecast cycles
Hidden Markov models and Bayesian analysis
Bayesian maximum-a-posteriori (MAP) analysis in Gaussian models
The ensemble Kalman filter and smoother (EnKF / EnKS)
The 4D cost function and VAR approaches
Sequential EnVAR estimators in the EnKF analysis:

the iterative ensemble Kalman smoother (IEnKS)
the single-iteration ensemble Kalman smoother (SIEnKS)

Motivation

In applications such as short-range weather prediction, data assimilation provides a means to sequentially and recursively update forecasts with newly incoming information.
The Bayesian approach to DA has become widely adopted because it provides a unified treatment of the tools from statistical estimation, nonlinear optimization and machine learning for handling such a problem.
Let’s suppose that the dynamical physical states can be written in a vector, \( \pmb{x}_k \in \mathbb{R}^{N_x} \), where \( k \) corresponds to some time \( t_k \).
Abstractly, we will represent the time-evolution of these states with the nonlinear map \( \mathcal{M} \), \[ \begin{align}\\ \pmb{x}_k = \mathcal{M}_{k} \left(\pmb{x}_{k-1}, \boldsymbol{\lambda}\right) + \boldsymbol{\eta}_k \end{align} \] where

\( \pmb{x}_{k-1} \) is the vector of physical states at an earlier time \( t_{k-1} \);
\( \boldsymbol{\lambda} \) is a vector of uncertain physical parameters on which the time evolution depends;
\( \boldsymbol{\eta}_k \) is an additive, stochastic noise term, representing errors in our model for the physical process.

We will estimate the random vector \( {\color{#d95f02} {\pmb{x}_k } } \) with a prior distribution on \( \left(\pmb{x}_{k-1}, \boldsymbol{\lambda}\right) \) and knowledge of \( \mathcal{M}_{k} \) and knowledge of how \( \boldsymbol{\eta}_k \) is distributed.
At time \( t_{k-1} \), we will make a forecast for the distribution of \( \pmb{x}_k \) with our prior knowledge, including the physics-based model.
For the rest of this lecture, we will restrict to the case that \( \boldsymbol{\lambda} \) is a known constant, and the forecast model is perfect, \[ \begin{align} \pmb{x}_k = \mathcal{M}_{k} \left(\pmb{x}_{k-1}\right), \end{align} \] for simplicity.

Motivation

Suppose that we are also given real-world observations \( {\color{#7570b3} {\pmb{y}_k\in\mathbb{R}^{N_y}} } \) related to the physical states by, \[ {\color{#7570b3} { \pmb{y}_k = \mathcal{H}_k \left(\pmb{x}_k\right) + \boldsymbol{\epsilon}_k } } \] where

\( {\color{#7570b3} {\mathcal{H}_k:\mathbb{R}^{N_x} \rightarrow \mathbb{R}^{N_y} } } \) is a nonlinear map relating the states we wish to estimate \( \pmb{x}_k \) to the values that are observed \( {\color{#7570b3} { \pmb{y}_k} } \).

Typically \( N_y \ll N_x \) so this is information is sparse and observations are not \( 1:1 \) with the unobserved states.

\( \boldsymbol{\epsilon}_k \) is an additive, stochastic noise term representing errors in the measurements.

Therefore, at time \( t_k \) we will have a forecast distribution for the states \( \pmb{x}_k \) generated by our prior on \( \pmb{x}_{k-1} \) and our physics-based model \( \mathcal{M} \).
We will also have an observation \( \pmb{y}_k \) with uncertainty.
We wish to find a posterior distribution for \( \pmb{x}_k \) conditioned on \( \pmb{y}_k \).

Observation-analysis-forecast cycle

Recursive estimation of the distribution for \( \pmb{x}_k \) conditional on \( \pmb{y}_k \) can be described as an:

observation
analysis
forecast

cycle.
We assume that we have an initial forecast-prior for the physics-based numerical model state.

Using the forecast-prior and the likelihood,

we estimate the Bayesian update of the prior to the posterior
conditioned on the observation.

Smoothing

Recursive estimates of the current state can be performed in this fashion, but a related question regards the past states.
New observations from the future gives information about the model states at past times.
This produces a retrospective posterior estimate for the past states.
Recursive estimation of the present state is known as filtering.
Conditional estimation of a past state given future observations is known as smoothing.

Note that the filtering density for the current time is actually just a marginal of this joint posterior over all states in the DAW \[ \begin{align} p(\pmb{x}_3 \vert \pmb{y}_3, \pmb{y}_2, \pmb{y}_1) = \int \int p(\pmb{x}_3, \pmb{x}_2, \pmb{x}_1 \vert \pmb{y}_3, \pmb{y}_2, \pmb{y}_1)\mathrm{d}\pmb{x}_2 \mathrm{d}\pmb{x}_1 \end{align} \]
A smoothing estimate may be produced in a variety of ways, exploiting different formulations of the Bayesian problem.
We may estimate only a marginal as on the left-hand-side, or the entire joint posterior as in the integrand above.¹
For the rest of the lecture, we will consider how we can utilize a Bayesian maximum-a-posteriori (MAP) formalism to efficiently solve the filtering / fixed-lag smoothing problem.

1. Cosme, E., et al. (2012). Smoothing problems in a Bayesian framework and their linear Gaussian solutions. Monthly Weather Review, 140(2), 683-695.

Hidden Markov models

Recall our perfect physical process model and observation model, \[ \begin{align} \pmb{x}_k &= \mathcal{M}_{k} \left(\pmb{x}_{k-1}\right)\\ \pmb{y}_k &= \mathcal{H}_k \left(\pmb{x}_k\right) + \boldsymbol{\epsilon}_k \end{align} \]
Let us denote the sequence of the process model states and observation model states for \( k < l \) as, \[ \begin{align} \pmb{x}_{l:k} := \left\{\pmb{x}_l, \pmb{x}_{l-1}, \cdots, \pmb{x}_k\right\}, & & \pmb{y}_{l:k} := \left\{\pmb{y}_l, \pmb{y}_{l-1}, \cdots, \pmb{y}_k\right\}. \end{align} \]
We assume that the sequence of observation error \[ \begin{align} \{\boldsymbol{\epsilon}_k : k=1,\cdots, L, \cdots\} \end{align} \] is independent in time.
The above formulation is a type of hidden Markov model;

the dynamic state variables \( \pmb{x}_k \) are known as the hidden variables, because they are not directly observed.

These assumptions determine the Markov probability densities for the hidden variables, i.e., \[ \begin{align} p\left(\pmb{x}_{L:0}\right) &= p\left(\pmb{x}_{L} \vert \pmb{x}_{L-1:0}\right) p\left(\pmb{x}_{L-1:0}\right)\\ &= p\left(\pmb{x}_{L} \vert \pmb{x}_{L-1}\right) p\left(\pmb{x}_{L-1:0} \right). \end{align} \]
Applying the Markov property recursively, we have that, \[ p\left(\pmb{x}_{L:0}\right) = p(\pmb{x}_0) \prod_{k=1}^{L} p(\pmb{x}_k|\pmb{x}_{k-1}). \]
Similarly, \[ p\left(\pmb{y}_{k}\vert \pmb{x}_k, \pmb{y}_{k-1:1}\right) = p\left(\pmb{y}_k \vert \pmb{x}_k\right), \] by the independence assumption on the observation errors.

Hidden Markov models continued

Given our physical process model and observation model, \[ \begin{align} \pmb{x}_k &= \mathcal{M}_{k} \left(\pmb{x}_{k-1}\right)\\ \pmb{y}_k &= \mathcal{H}_k \left(\pmb{x}_k\right) + \boldsymbol{\epsilon}_k \end{align} \]
We will thus consider how to estimate the filtering density \( p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right) \) with the Bayesian MAP formalism.
Using the definition of conditional density, we have \[ \begin{align} p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right) &= \frac{p\left(\pmb{y}_{L:1},\pmb{x}_L \right)}{ p\left(\pmb{y}_{L:1}\right)}. \end{align} \]
Using the independence assumptions, we can thus write \[ \begin{align} p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right) &=\frac{p\big(\pmb{y}_L, (\pmb{x}_L, \pmb{y}_{L-1:1})\big)}{p\left(\pmb{y}_{L:1}\right)}\\ &=\frac{p\left(\pmb{y}_{L}\vert \pmb{x}_L, \pmb{y}_{L-1:1}\right) p\left(\pmb{x}_L, \pmb{y}_{L-1:1}\right)}{p\left(\pmb{y}_{L:1}\right)} =\frac{p\left(\pmb{y}_{L}\vert \pmb{x}_L\right) p\left(\pmb{x}_L, \pmb{y}_{L-1:1}\right)}{p\left(\pmb{y}_{L:1}\right)} \end{align} \]
Finally, writing the joint densities in terms of conditional densities we have \[ \begin{align} p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right) &=\frac{p\left(\pmb{y}_L \vert \pmb{x}_L\right) p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right) p\left(\pmb{y}_{L-1:1}\right)}{p\left(\pmb{y}_L\vert \pmb{y}_{L-1:1}\right) p\left(\pmb{y}_{L-1:1}\right)} \\ \\ &=\frac{p\left(\pmb{y}_L \vert \pmb{x}_L\right) p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right)}{p\left(\pmb{y}_L\vert \pmb{y}_{L-1:1}\right)} \end{align} \]

Hidden Markov models continued

From the last slide, the filtering cycle was written as \[ \begin{align} {\color{#d95f02} {p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right)} } &=\frac{ {\color{#7570b3} { p\left(\pmb{y}_L \vert \pmb{x}_L\right) } } {\color{#1b9e77} { p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right) } } }{p\left(\pmb{y}_L\vert \pmb{y}_{L-1:1}\right)}. \end{align} \]
Then notice that, in terms of Bayes' law, this is given by the following terms:
- \( {\color{#d95f02} {p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right)}} \) – this is the posterior estimate for the hidden states (at the current time) given all observations in a time series \( \pmb{y}_{L:1} \);
- \( {\color{#7570b3} {p\left(\pmb{y}_{L}\vert \pmb{x}_{L}\right)}} \) – this is the likelihood of the observed data given our model forecast;
- \( {\color{#1b9e77} {p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right)}} \) – this is the model forecast-prior from our last best estimate of the state.
- \( p\left(\pmb{y}_L \vert \pmb{y}_{L-1:1}\right) \) is the marginal of joint density \( p( \pmb{y}_L, \pmb{x}_L \vert \pmb{y}_{L-1:1}) \) integrating out the hidden variable, \[ p\left(\pmb{y}_L \vert \pmb{y}_{L-1:1}\right) = \int p(\pmb{y}_L \vert \pmb{x}_L) p(\pmb{x}_L \vert \pmb{y}_{L-1:1})\mathrm{d}\pmb{x}_{L}. \]

Bayesian MAP estimates

Typically, the probability density for the denominator \[ p\left(\pmb{y}_L \vert \pmb{y}_{L-1:1}\right) = \int p(\pmb{y}_L \vert \pmb{x}_L) p(\pmb{x}_L \vert \pmb{y}_{L-1:1})\mathrm{d}\pmb{x}_{L} \] is mathematically intractable.

However, the denominator is independent of the hidden variable \( \pmb{x}_{L} \); its purpose is only to normalize the integral of the posterior density to \( 1 \) over all \( \pmb{x}_{L} \).

Instead, as a proportionality statement, \[ \begin{align} {\color{#d95f02} {p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right)} } &\propto {\color{#7570b3} { p\left(\pmb{y}_L \vert \pmb{x}_L\right) } } {\color{#1b9e77} { p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right) } } \end{align} \] we can devise the Bayesian MAP estimate as the choice of \( \overline{\pmb{x}}_L \) that maximizes the posterior density, written in terms of the two right-hand-side components.

For purposes of maximizing the posterior density, the denominator leads to insignificant constants that we can neglect.

Thus, in order to compute the MAP sequentially and recursively in time, we want to find a recursion in proportionality as above.

Bayesian MAP estimates

Generally, the proportionality statement \[ \begin{align} {\color{#d95f02} {p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right)} } &\propto {\color{#7570b3} { p\left(\pmb{y}_L \vert \pmb{x}_L\right) } } {\color{#1b9e77} { p\left(\pmb{x}_L\vert \pmb{y}_{L-1:1}\right) } } \end{align} \] has no analytical solution.
However, when the models are linear, i.e., \[ \begin{align} \pmb{x}_k &= \mathbf{M}_k \pmb{x}_{k-1};\\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{\epsilon}_k; \end{align} \]
and the error densities are Gaussian, i.e., \[ \begin{align} p(\pmb{x}_0) = N\left(\overline{\pmb{x}}_0, \mathbf{B}_0\right) & & p\left(\pmb{y}_k \vert \pmb{x}_k \right) = N\left(\mathbf{H}_k \pmb{x}_k, \mathbf{R}_k\right); \end{align} \]
then the posterior density is Gaussian at all times, \[ {\color{#d95f02} {p\left(\pmb{x}_{L} \vert \pmb{y}_{L:1}\right)\equiv N\left(\overline{\pmb{x}}^\mathrm{filt}_L, \mathbf{B}_L^\mathrm{filt}\right)}} \] and can be computed analytically in a recursive formulation by the Kalman filter.
We will make a linear-Gaussian approximation in the following to develop an efficient solution to the recursive Bayesian MAP problem.
The MAP estimate for the nonlinear model can be estimated by nonlinear optimization techniques and / or Monte Carlo schemes as in traditional VAR, ensemble analysis and hybrid EnVAR.
We define the “background” forecast mean and covariance as follows, to differentiate from ensemble estimates, \[ \begin{align} \\ \overline{\pmb{x}}_L^\mathrm{fore} : =\mathbb{E}_{p(\pmb{x}_L\vert \pmb{y}_{L-1:1})}\left[\pmb{x}_L\right]:= \int \pmb{x}_L p(\pmb{x}_L\vert \pmb{y}_{L-1:1})\mathrm{d}\pmb{x}_L & & \mathbf{B}_{L}^\mathrm{fore}:= \mathbb{E}_{p(\pmb{x}_L\vert \pmb{y}_{L-1:1})}\left[\left(\pmb{x}_L - \overline{\pmb{x}}_L^\mathrm{fore}\right)\left(\pmb{x}_L - \overline{\pmb{x}}_L^\mathrm{fore}\right)^\top \right]. \end{align} \]

Bayesian MAP estimates

Let us make the following notations:

The standard Euclidean vector norm is denoted \( \parallel \pmb{v}\parallel := \sqrt{\pmb{v}^\top \pmb{v}} \).
For a symmetric, positive definite matrix \( \mathbf{A}\in\mathbb{R}^{N\times N} \), we will will denote the vector norm with respect to \( \mathbf{A} \) as \[ \begin{align} \parallel \pmb{v}\parallel_\mathbf{A} := \sqrt{\pmb{v}^\top \mathbf{A}^{-1}\pmb{v}} \end{align} \]
For a generic matrix \( \mathbf{A}\in \mathbb{R}^{N\times M} \) with full column rank \( M \), let us denote the pseudo-inverse \[ \begin{align} \mathbf{A}^\dagger := \left(\mathbf{A}^\top \mathbf{A}\right)^{-1}\mathbf{A}^\top \end{align} \]

Notice that the pseudo-inverse has the properties:
the pseudo-inverse gives a left “inverse” in the column span \[ \begin{align} \mathbf{A}^\dagger \mathbf{A} &= \left(\mathbf{A}^\top \mathbf{A}\right)^{-1}\mathbf{A}^\top \mathbf{A}\\ &= \mathbf{I}_{M}; \end{align} \]
and the composition of the pseudo-inverse as below gives the orthogonal projection into the column span \[ \begin{align} \mathbf{A}\mathbf{A}^\dagger &= \boldsymbol{\Pi}_{A}. \end{align} \]

When \( \mathbf{A} \) has full column rank as above, we define the vector “norm” with respect to \( \mathbf{G} = \mathbf{A}\mathbf{A}^\top \) as \[ \begin{align} \parallel \pmb{v} \parallel_{\mathbf{G}} :=\sqrt{ \left(\mathbf{A}^\dagger\pmb{v}\right)^\top \left(\mathbf{A}^\dagger \pmb{v}\right)}. \end{align} \]

The above weighted norm can be understood as the norm of \( \pmb{v} \) relative to the column span of \( \mathbf{A} \), and its associated singular values.
This is not a true norm, but a lift of a norm from the space \( \mathbb{R}^M \) to \( \mathbb{R}^N \).

Bayesian MAP estimates

Under the linear-Gaussian assumption, the filtering problem can also be phrased in terms of the Bayesian MAP cost function \[ \begin{align} {\color{#d95f02} {\mathcal{J}(\pmb{x}_{L})} } &= {\color{#1b9e77} {\frac{1}{2}\parallel \overline{\pmb{x}}_{L}^\mathrm{fore} -\pmb{x}_{L}\parallel_{\mathbf{B}_{L}^\mathrm{fore}}^2} } + {\color{#7570b3} {\frac{1}{2}\parallel\pmb{y}_L - \mathbf{H}_L \pmb{x}_L\parallel_{\mathbf{R}_L}^2} }, \end{align} \] where the above weighted norms can be understood as:
- \( \parallel \circ \parallel_{\mathbf{B}_k^\mathrm{fore}} \) weighting relative to the forecast covariance; and
- \( \parallel \circ \parallel_{\mathbf{R}_k} \) weighting relative to the observation error covariance.
The MAP state interpolates the forecast mean and the observational data relative to the uncertainty in each piece of data, represented by the spread.
To render the above cost function into the right-transform analysis, let us write the matrix factor \[ \begin{align} \mathbf{B}_{L}^\mathrm{fore} : = \boldsymbol{\Sigma}_{L}^\mathrm{fore} \left(\boldsymbol{\Sigma}_{L}^\mathrm{fore}\right)^\top. \end{align} \]
Instead of optimizing the cost function over the state vector \( \pmb{x}_L \), it can be written in terms of optimizing weights \( \pmb{w} \) where \[ \begin{align} \pmb{x}_L := \overline{\pmb{x}}_L^\mathrm{fore} + \boldsymbol{\Sigma}_L^\mathrm{fore}\pmb{w}; \end{align} \]
the equation written in terms of the weights is given as \[ \begin{align} {\color{#d95f02} {\mathcal{J}(\pmb{w}) } } = {\color{#1b9e77} {\frac{1}{2} \parallel \pmb{w}\parallel^2} } + {\color{#7570b3} {\frac{1}{2} \parallel \pmb{y}_L - \mathbf{H}_L \overline{\pmb{x}}_L^\mathrm{fore} - \mathbf{H}_L \boldsymbol{\Sigma}_L^\mathrm{fore} \pmb{w} \parallel_{\mathbf{R}_L}^2 } } \end{align} \]

Bayesian MAP estimates

Let us make the following definitions, \[ \begin{align} \overline{\pmb{y}}_L = \mathbf{H}_L \overline{\pmb{x}}_L^\mathrm{fore}, & & \overline{\pmb{\delta}}_L &= \mathbf{R}^{-\frac{1}{2}}_L \left(\pmb{y}_L - \overline{\pmb{y}}_L\right), & & \boldsymbol{\Gamma}_L =\mathbf{R}_L^{-\frac{1}{2}}\mathbf{H}_L \boldsymbol{\Sigma}_L^\mathrm{fore}. \end{align} \]
The vector \( \overline{\pmb{\delta}} \) is the innovation vector, weighted by the observation uncertainty.
\( \boldsymbol{\Gamma}_L \) in one dimension would equal the standard deviation of the model forecast relative to the standard deviation of the observation error.
Then, the MAP cost function is further reduced to \[ \begin{align} {\color{#d95f02} {\mathcal{J}(\pmb{w}) } } = {\color{#1b9e77} {\frac{1}{2} \parallel \pmb{w}\parallel^2}} + {\color{#7570b3} {\frac{1}{2} \parallel \overline{\pmb{\delta}}_L - \boldsymbol{\Gamma}_L \pmb{w} \parallel^2 } } \end{align} \]
This cost function is quadratic in \( \pmb{w} \) and can be globally minimized where \( \nabla_{\pmb{w}} \mathcal{J} = \pmb{0} \).
Solving this is similar to solving the normal equations of least-squares regression;

indeed, the Kalman filter is also the best linear unbiased estimator (BLUE) for linear-Gaussian hidden Markov models.²

Note, the MAP weights \( \overline{\pmb{w}} \) don’t, by default, provide an update to the forecast covariance \( \mathbf{B}_L^\mathrm{fore} \).
A sub-optimal approach is to assume that the background covariance \( \mathbf{B}_L^\mathrm{fore} \equiv \mathbf{B} \equiv \mathbf{B}_L^\mathrm{filt} \) is completely invariant-in-time.

This is related to the traditional nonlinear optimization approach of 3D-VAR, and is cost effective, but lacks the information of the time-dependent second moment.

However, in the linear-Gaussian model, we can make an optimal analysis by finding a recursive form for the filter covariance;

the forecast model is then used to propagate the filter covariance, so that we can recursively solve this equation in time.

2. Asch, M., Bocquet, M., & Nodet, M. (2016). Data assimilation: methods, algorithms, and applications. Society for Industrial and Applied Mathematics.

Bayesian MAP estimates

Setting the gradient \( \nabla_{\pmb{w}} \mathcal{J} \) equal to zero for \( \overline{\pmb{w}} \) we find the critical value as

\[ \begin{align} \overline{\pmb{w}} = \pmb{0} - {\mathbf{H}_\mathcal{J}}^{-1} \nabla_{\pmb{w}} \mathcal{J}|_{\pmb{w}=\pmb{0}}. \end{align} \] where \( \mathbf{H}_\mathcal{J}:= \nabla^2_{\pmb{w}}\mathcal{J} \) is the Hessian of the cost function.
- This corresponds to a single iteration of Newton's descent algorithm.
The forecast mean is thus updated as,

\[ \begin{align} \overline{\pmb{x}}_L^\mathrm{filt}:= \overline{\pmb{x}}_{L}^\mathrm{fore} + \boldsymbol{\Sigma}_{L}^\mathrm{fore} \overline{\pmb{w}}. \end{align} \]
If we define a right-transform matrix, \( \mathbf{T}:= \mathbf{H}^{-\frac{1}{2}}_{\mathcal{J}} \) we similarly have the update for the covariance as

\[ \begin{align} \mathbf{B}^\mathrm{filt}_L = \left(\boldsymbol{\Sigma}_L^\mathrm{fore} \mathbf{T} \right)\left(\boldsymbol{\Sigma}_L^\mathrm{fore} \mathbf{T} \right)^\top. \end{align} \]
This derivation above describes the square root Kalman filter recursion,³ when written for the exact mean and covariance, recursively computed in the linear-Gaussian model.
In particular, under the perfect model assumption, the forecast can furthermore be written as,

\[ \begin{align} \overline{\pmb{x}}_{L+1}^\mathrm{fore}:= \mathbf{M}_{L+1} \overline{\pmb{x}}_{L}^{\mathrm{filt}} & & \boldsymbol{\Sigma}_{L+1}^\mathrm{fore} := \mathbf{M}_{L+1}\left(\boldsymbol{\Sigma}_{L}^\mathrm{filt}\right) \end{align} \] giving a complete recursion in time, within the matrix factor.

3. Tippett, M. K., Anderson, J. L., Bishop, C. H., Hamill, T. M., & Whitaker, J. S. (2003). Ensemble square root filters. MWR, 131(7), 1485-1490.

The ETKF

The mean and covariance update in the square root Kalman filter is then written entirely in terms of the weight vector \( \overline{\pmb{w}} \) and the right-transform matrix \( \mathbf{T} \).
- Note that the numerical cost of \( \mathbf{H}_{\mathcal{J}}^{-1} \) in the Newton step and the cost of \( \mathbf{T}:=\mathbf{H}_{\mathcal{J}}^{-\frac{1}{2}} \) in the covariance update are subordinate to the cost of the singular / eigen value decomposition of \( \mathbf{H}_{\mathcal{J}} \).
These reductions are at the core of the efficiency of the ensemble transform Kalman filter (ETKF),⁴ in which we will typically make a reduced-rank approximation to the background covariances \( \mathbf{B}^\mathrm{i}_L \).
Suppose, instead, we have an ensemble matrix \( \mathbf{E}^\mathrm{i}_L \in \mathbb{R}^{N_x \times N_e} \);
- the columns are assumed an independent identically distributed (iid) sample of some distribution corresponding to the label \( \mathrm{i} \).
- In practice for weather forecast models, \( N_e \ll N_x \), where \( N_x \) can be on order \( \mathcal{O}\left(10^9\right) \);⁵
- on the other hand, \( N_e \) is typically on order \( \mathcal{O}\left(10^2\right) \).
Writing the cost function restricted to the ensemble span, we have a restricted Hessian \( \mathbf{H}_{\widetilde{\mathcal{J}}} \) of the form

\[ \begin{align} \mathbf{H}_\mathcal{\widetilde{J}} \in \mathbb{R}^{N_e \times N_e}. \end{align} \]

4. Sakov, P., & Oke, P. R. (2008). Implications of the form of the ensemble transformation in the ensemble square root filters. MWR, 136(3), 1042-1053.
5. Carrassi, A., Bocquet, M., Bertino, L., & Evensen, G. (2018). Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change, 9(5), e535.

The ETKF

Specifically, we will use the ensemble-based, empirical estimates,

\[ \begin{align} & & \hat{\pmb{x}}_L^\mathrm{fore} &= \mathbf{E}_L^\mathrm{fore} \pmb{1} / N_e ; & \hat{\pmb{\delta}}_L &= \mathbf{R}_k^{-\frac{1}{2}}\left(\pmb{y}_L - \mathbf{H}_L \hat{\pmb{x}}_L\right)\\ &&\mathbf{X}_L^\mathrm{fore} &= \mathbf{E}_L^\mathrm{fore} - \hat{\pmb{x}}^\mathrm{fore}_L \pmb{1}^\top ; & \mathbf{P}^\mathrm{fore}_L &= \mathbf{X}_L^\mathrm{fore} \left(\mathbf{X}_L^\mathrm{fore}\right)^\top / \left(N_e - 1\right);\\ & &\mathbf{S}_L &:=\mathbf{R}_L^{-\frac{1}{2}}\mathbf{H}_L \mathbf{X}_L^\mathrm{fore}. \end{align} \]
Then the ensemble-based cost function is written as

\[ \begin{alignat}{2} & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#1b9e77} {\frac{1}{2} \parallel \hat{\pmb{x}}_L^\mathrm{fore} - \mathbf{X}^\mathrm{fore}_L \pmb{w}- \hat{\pmb{x}}^\mathrm{fore}_L \parallel_{\mathbf{P}^\mathrm{fore}_L}^2} } + {\color{#7570b3} {\frac{1}{2} \parallel \pmb{y}_L - \mathbf{H}_L\hat{\pmb{x}}^\mathrm{fore}_L - \mathbf{H}_L \mathbf{X}^\mathrm{fore}_L \pmb{w} \parallel_{\mathbf{R}_L}^2 } }\\ \Leftrightarrow& & {\color{#d95f02} {\widetilde{\mathcal{J}}(\pmb{w})} } &= {\color{#1b9e77} {\frac{1}{2}(N_e - 1) \parallel\pmb{w}\parallel^2} } + {\color{#7570b3} {\frac{1}{2}\parallel \hat{\pmb{\delta}}_L - \mathbf{S}_L \pmb{w}\parallel^2 } } \end{alignat} \] which is an optimization over a weight vector \( \pmb{w} \) in the ensemble dimension \( N_e \) rather than in the state space dimension \( N_x \).
This a key reduction that makes Monte Carlo methods feasible for the large size of geophysical models.
- Still, other techniques such as covariance localization⁶ and hybridization⁷ are used in practice to overcome the curse of dimensionality due to the extremely small feasible ensemble size.

6. Sakov, P., & Bertino, L. (2011). Relation between two common localisation methods for the EnKF. Computational Geosciences, 15(2), 225-237.
7. Penny, S. G. (2017). Mathematical foundations of hybrid data assimilation from a synchronization perspective. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(12), 126801.

The ETKF

This is a sketch of the derivation of the local ensemble transform Kalman filter (LETKF) of Hunt et al.,⁸ which is one of the currently widely operational DA algorithms.
In this formalism, we can appropriately define an ensemble right-transform \( \boldsymbol{\Psi}_k \) such that for any \( t_k \),

\[ \begin{align} \mathbf{E}^\mathrm{filt}_k = \mathbf{E}^\mathrm{fore}_k \boldsymbol{\Psi}_k \end{align} \] where in the above we would say that \[ \begin{align} \mathbf{E}^\mathrm{filt}_k &\sim p(\pmb{x}_k \vert \pmb{y}_{k:1}) \\ \mathbf{E}^\mathrm{fore}_k &\sim p(\pmb{x}_k \vert \pmb{y}_{k-1:1}) \end{align} \]
We will associate \( \mathbf{E}^\mathrm{filt}_L \equiv \mathbf{E}^\mathrm{smth}_{L|L} \);
- under the linear-Gaussian model, we furthermore have that
\[ \begin{align} \mathbf{E}^\mathrm{smth}_{k |L} = \mathbf{E}^\mathrm{smth}_{k|L-1}\boldsymbol{\Psi}_{L} & & \mathbf{E}^\mathrm{smth}_{k|L} \sim p(\pmb{x}_k \vert \pmb{y}_{L:1}). \end{align} \]
Then we can perform a retrospective smoothing analysis on all past states stored in memory by using the latest right-transform update from the filtering step.
This form of retrospective analysis is the basis of the ensemble Kalman smoother (EnKS).⁹

8. Hunt, B. R., Kostelich, E. J., & Szunyogh, I. (2007). Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D: Nonlinear Phenomena, 230(1-2), 112-126.
9. Evensen, G., & Van Leeuwen, P. J. (2000). An ensemble Kalman smoother for nonlinear dynamics. Monthly Weather Review, 128(6), 1852-1867.

The EnKS

The EnKS takes advantage of the simple form of the retrospective, right-transform analysis by including an additional, inner loop of the filtering cycle.

In the above, time is the horizontal axis where right moves forward in time.
At each time, we produce the standard filtering estimate by computing \( \boldsymbol{\Psi}_L \) from the cost function, and updating the forecast \[ \mathbf{E}^\mathrm{filt}_L = \mathbf{E}_L^\mathrm{fore} \boldsymbol{\Psi}_L. \]
The information of incoming observations is passed backward in time using the right-transform to condition the ensemble at past times: \[ \begin{align} \mathbf{E}^\mathrm{smth}_{k|L} = \mathbf{E}^\mathrm{smth}_{k|L-1} \boldsymbol{\Psi}_L. \end{align} \]

The 4D cost function

Re-initializing the DA cycle in a perfect, nonlinear model with the smoothed conditional ensemble estimate \( \mathbf{E}^\mathrm{smth}_{0|L} \) can dramatically improve the performance of the subsequent forecast and filtering statistics.

Let us denote the composition of the model forecast as, \[ \begin{align} \\ \mathcal{M}_{l:k} : = \mathcal{M}_l \circ \cdots \circ \mathcal{M}_{k} & & \mathbf{M}_{l:k} := \mathbf{M}_{l}\cdots \mathbf{M}_{k} \end{align} \]

Particularly, this exploits the miss-match in perfect, nonlinear dynamics between \[ \begin{align} \\ \mathcal{M}_{L:1}\left(\mathbf{E}_{0|L}^\mathrm{smth}\right):= \mathcal{M}_L \circ \cdots \circ \mathcal{M}_1\left(\mathbf{E}_{0|L}^\mathrm{smth}\right) \neq \mathbf{E}_{L}^\mathrm{filt}. \end{align} \]
The effectiveness of the linear-Gaussian approximation strongly depends on the length of the forecast window \( \Delta t \);

for small \( \Delta t \), the densities are well-approximated with Gaussians, yet there are deformations induced due to nonlinearity.

If the dynamics for the evolution of the densities are weakly nonlinear, re-initializing the model forecast with the posterior estimate under the linear-Gaussian approximation can bring new information into the forecast states in the next DAW.
This has been exploited to a great extent by utilizing the 4D cost function;¹⁰

the filtering MAP cost function is extended over multiple observations simultaneously, and in terms of a lagged state directly.

This type of cost function in a sequential forecast system leads to what is known as fixed-lag smoothing.

10. Bannister, R. N. (2017). A review of operational methods of variational and ensemble‐variational data assimilation. QJRMS, 143(703), 607-633.

The 4D cost function

Suppose now we want to write \( p(\pmb{x}_{L:1}\vert \pmb{y}_{L:1}) \), the joint smoothing posterior over the current DAW, as a recursive update to the last smoothing posterior.

Using a Bayesian analysis like before, we can write \[ \begin{align} {\color{#d95f02} { p(\pmb{x}_{L:1} \vert \pmb{y}_{L:1}) } } &\propto \int \mathrm{d}\pmb{x}_0 \underbrace{ {\color{#d95f02} { p(\pmb{x}_0 \vert \pmb{y}_{L-S:-S}) } } }_{(1)} \underbrace{ {\color{#7570b3} { \left[ \prod_{k=L-S+1}^L p(\pmb{y}_k \vert \pmb{x}_k) \right] } }}_{(2)} \underbrace{{\color{#1b9e77} { \left[\prod_{k=1}^L p(\pmb{x}_k|\pmb{x}_{k-1}) \right] } }}_{(3)} \end{align} \] where

is the marginal for \( \pmb{x}_0 \) of the last joint smoothing smoothing posterior \( p(\pmb{x}_{L-S:-S}\vert\pmb{y}_{L-S:-S}) \);
is the joint likelihood of the incoming observations to the current DAW, given the background forecast;
is the free-forecast with the perfect model \( \mathcal{M}_k \).

Using this recursion, we can chain together a recursive fixed-lag smoother, sequentially across DAWs.

The 4D cost function

Under the linear-Gaussian assumption, the resulting cost function takes the form

\[ \begin{alignat}{2} & & {\color{#d95f02} {\mathcal{J} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \overline{\pmb{x}}_{0|L-S}^\mathrm{smth} - \boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w}- \overline{\pmb{x}}^\mathrm{smth}_{0|L-S} \parallel_{\mathbf{B}^\mathrm{smth}_{0|L-S}}^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathbf{H}_k {\color{#1b9e77} { \mathbf{M}_{k:1}} } {\color{#d95f02} {\overline{\pmb{x}}^\mathrm{smth}_{0|L-S}} } - \mathbf{H}_k {\color{#1b9e77} {\mathbf{M}_{k:1}}} {\color{#d95f02} {\boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w} }} \parallel_{\mathbf{R}_k}^2 } } \\ \Leftrightarrow& & \mathcal{J}(\pmb{w}) &= \frac{1}{2}\parallel\pmb{w}\parallel^2 + \sum_{k=L-S+1}^L \frac{1}{2}\parallel \overline{\pmb{\delta}}_k - \boldsymbol{\Gamma}_k \pmb{w}\parallel^2 . \end{alignat} \]
In the linear-Gaussian case, the solution can again be found by a single iteration of Newton's descent for the cost function above,

\[ \begin{align} \nabla_{\pmb{w}} \mathcal{J}:= \pmb{w} - \sum_{k=L-S+1}^{L}\boldsymbol{\Gamma}^\top_k \left(\overline{\pmb{\delta}}_k - \boldsymbol{\Gamma}_k \pmb{w}\right), & & \mathbf{H}_{\mathcal{J}} := \mathbf{I} + \sum_{k=L-S+1}^L \boldsymbol{\Gamma}_k^\top \boldsymbol{\Gamma}_k, & & \overline{\pmb{w}} := \pmb{0} - \mathbf{H}_{\mathcal{J}}^{-1} \nabla_{\pmb{w}} \mathcal{J}|_{\pmb{w}=\pmb{0}}\\ \\ \overline{\pmb{x}}_{0|L}^\mathrm{smth} = \overline{\pmb{x}}^\mathrm{smth}_{0|L-S} + \boldsymbol{\Sigma}_{0|L-S}^\mathrm{smth} \overline{\pmb{w}} & & \mathbf{T}:= \mathbf{H}_{\mathcal{J}}^{-\frac{1}{2}} & &\boldsymbol{\Sigma}_{0|L}^\mathrm{smth}:= \boldsymbol{\Sigma}_{0|L-S}^\mathrm{smth}\mathbf{T} \end{align} \]
However, when the state and observation model are nonlinear, using \( \mathcal{H}_k \) and \( \mathcal{M}_{k:1} \) in the cost function, this cost function is solved iteratively to find a local minimum.
- The difficulty arises in that the gradient \( \nabla_{\pmb{w}} \) actually requires differentiating the equations of motion in \( \mathcal{H}_k\circ\mathcal{M}_{k:1} \).
In 4D-VAR, this is performed by an incremental linearization and back propagation of sensitivities by the adjoint model.

Incremental 4D-VAR

In particular, suppose that the equations of motion are generated by a nonlinear function, independent of time for simplicity,

\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{x} = \pmb{f}(\pmb{x}) & &\pmb{x}_k= \mathcal{M}_k(\pmb{x}_{k-1}):= \int_{t_{k-1}}^{t_k} \pmb{f}(\pmb{x}) \mathrm{d}t + \pmb{x}_{k-1} \end{align} \]
We can extend the linear-Gaussian approximation for the forecast density as

\[ \begin{align} &\pmb{x}_{k-1} := \overline{\pmb{x}}_{k-1} + \pmb{\delta}_{k-1} \sim N(\overline{\pmb{x}}_{k-1}, \mathbf{B}_{k-1})\\ \Leftrightarrow &\pmb{\delta}_{k-1} \sim N(\pmb{0}, \mathbf{B}_{k-1}) \end{align} \]
The evolution of the perturbation \( \pmb{\delta} \) can thus be approximated via Taylor's theorem as

\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{\delta}&:= \frac{\mathrm{d}}{\mathrm{d}t}\left( \pmb{x} - \overline{\pmb{x}}\right)\\ &=\pmb{f}(\pmb{x}) - \pmb{f}(\overline{\pmb{x}})\\ &=\nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta} + \mathcal{O}\left(\parallel \pmb{\delta}\parallel^2\right) \end{align} \] where \( \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}}) \) is the Jacobian equation with dependence on the underlying mean state.
In particular, the linear evolution defined by the truncated Taylor expansion

\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \pmb{\delta}:= \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta} \end{align} \] is known as the tangent-linear model.

Incremental 4D-VAR

Making the approximation of the tangent-linear model,

\[ \begin{align} &\frac{\mathrm{d}}{\mathrm{d}t} \pmb{x} \approx \pmb{f}(\overline{\pmb{x}}) + \nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta}\\ \\ \Rightarrow&\int_{t_{k-1}}^{t_k}\frac{\mathrm{d}}{\mathrm{d}t}\pmb{x}\mathrm{d}t \approx \int_{t_{k-1}}^{t_k} \pmb{f}(\overline{\pmb{x}}) \mathrm{d}t + \int_{t_{k-1}}^{t_k}\nabla_{\pmb{x}}\pmb{f}(\overline{\pmb{x}})\pmb{\delta}\mathrm{d}t \\ \\ \Rightarrow & \pmb{x}_{k} \approx \mathcal{M}_{k}\left(\overline{\pmb{x}}_{k-1}\right) + \mathbf{M}_k\pmb{\delta}_{k-1} \end{align} \] where \( \mathbf{M}_k \) is the resolvent of the tangent-linear model.
Gaussians are closed under affine transformations, approximating the evolution under the tangent-linear model as

\[ \begin{align} \pmb{x}_{k} \sim N\left(\mathcal{M}_k\left(\overline{\pmb{x}}_{k-1}\right), \mathbf{M}_k \mathbf{B}_{k-1}\mathbf{M}^\top_{k}\right) \end{align} \]
Therefore, the quadratic cost function is approximated by an incremental linearization along the background mean

\[ \begin{alignat}{2} & & {\color{#d95f02} {\mathcal{J} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \pmb{w} \parallel^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1} \left( {\color{#d95f02} {\overline{\pmb{x}}^\mathrm{smth}_{0|L-S} } }\right)}} - \mathbf{H}_k {\color{#1b9e77} {\mathbf{M}_{k:1}}} {\color{#d95f02} {\boldsymbol{\Sigma}^\mathrm{smth}_{0|L-S} \pmb{w} } } \parallel_{\mathbf{R}_k}^2 } } \end{alignat} \] describing an approximate linear-Gaussian model / cost function in the space of perturbations.

Incremental 4D-VAR

From the last slide, a very efficient approximation of the gradient can be performed with respect to the adjoint model of the tangent-linear model.
In particular, the adjoint variables are defined by a backward-in-time solution to the linear equation

\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}t} \tilde{\pmb{\delta}} = -\left(\nabla_{\pmb{x}} \pmb{f}(\overline{\pmb{x}})\right)^\top \tilde{\pmb{\delta}}, \end{align} \] with the underlying dependence on the nonlinear solution over the time interval.
Therefore, in incremental 4D-VAR, one constructs the gradient for the objective function, differentiating the nonlinear model, via:
- a forward pass of the nonlinear solution, while computing the tangent-linear evolution in the space of the perturbations; with
- a second backward pass only in the adjoint equations, back-propagating the sensitivities along this solution to find the gradient.
This a very effective and efficient solution, but relies on the construction of the tangent-linear and adjoint models for the dynamics.
- For full-scale geophysical models, this can be extremely challenging, though increasingly this can be performed using automatic differentiation techniques, by formally computing these from a computer program alone.

Hybrid EnVAR in the EnKF analysis

One alternative to constructing the tangent-linear and adjoint models is to perform a hybrid, ensemble-variational (EnVAR) analysis as based on the ETKF.
This approach is at the basis of the iterative ensemble Kalman filter / smoother (IEnKF/S)¹¹.
This technique seeks to perform an ensemble analysis like the square root ETKF by defining the ensemble estimates and the weight vector in the ensemble span

\[ \begin{alignat}{2} & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#d95f02} {\frac{1}{2} \parallel \hat{\pmb{x}}_{0|L-S}^\mathrm{smth} - \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w}- \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} \parallel_{\mathbf{P}^\mathrm{smth}_{0|L-S}}^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left( {\color{#d95f02} { \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} } } \right)}}\parallel_{\mathbf{R}_k}^2 } }\\ \Leftrightarrow & & {\color{#d95f02} {\widetilde{\mathcal{J}} (\pmb{w})} } &= {\color{#d95f02} {(N_e - 1) \frac{1}{2} \parallel \pmb{w}\parallel^2} } + {\color{#7570b3} {\sum_{k=L-S+1}^L \frac{1}{2} \parallel \pmb{y}_k - \mathcal{H}_k\circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left( {\color{#d95f02} { \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} } } \right)}}\parallel_{\mathbf{R}_k}^2 } } \end{alignat} \]
One measures the cost as the discrepancy from the observations with the nonlinear evolution of the perturbation to the ensemble mean,

\[ \begin{align} \hat{\pmb{x}}^\mathrm{smth}_{0|L-S} + \mathbf{X}^\mathrm{smth}_{0|L-S} \pmb{w} \end{align} \] combined with the size of the perturbation relative to the ensemble spread.
The key is again, how the gradient is computed for the above cost function.

11. Bocquet, M., & Sakov, P. (2014). An iterative ensemble Kalman smoother. QJRMS, 140(682), 1521-1535.

Hybrid EnVAR in the EnKF analysis

The gradient of the ensemble-based cost function is given by,

\[ \begin{align} {\color{#d95f02} {\nabla_{\pmb{w}} \widetilde{\mathcal{J}} } }:= {\color{#d95f02} {\left(N_e - 1\right) \pmb{w}}} - {\color{#7570b3} {\sum_{k=L-S+1}^L \widetilde{\mathbf{Y}}_k^\top \mathbf{R}^{-1}_k\left[\pmb{y}_k - \mathcal{H}_k \circ {\color{#1b9e77} { \mathcal{M}_{k:1}\left({\color{#d95f02} {\hat{\pmb{x}}_{0|L-S}^\mathrm{smth} + \mathbf{X}_{0|L-S}^\mathrm{smth} \pmb{w}} }\right) } } \right]}}, \end{align} \]
where \( {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } } \) represents a directional derivative of the observation and state models,

\[ \begin{align} {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } }:= {\color{#d95f02} {\nabla\vert_{\hat{\pmb{x}}^\mathrm{smth}_{0|L-S}} } } {\color{#7570b3} {\left[\mathcal{H}_k \circ {\color{#1b9e77} {\mathcal{M}_{k:1} } } \right] } } {\color{#d95f02} {\mathbf{X}^\mathrm{smth}_{0|L-S}} }. \end{align} \]
In order to avoid the construction of the tangent-linear and adjoint models, the “bundle” version¹¹ makes an explicit approximation of finite differences with the ensemble

\[ \begin{align} {\color{#7570b3} { \widetilde{\mathbf{Y}}_k } }\approx& {\color{#7570b3} { \frac{1}{\epsilon} \mathcal{H}_k \circ {\color{#1b9e77} {\mathcal{M}_{k:1} \left( {\color{#d95f02} { \pmb{x}_{0|L-S}^\mathrm{smth} \pmb{1}^\top + \epsilon \mathbf{X}_{0|L-S}^\mathrm{smth} } }\right) } } \left(\mathbf{I}_{N_e} - \pmb{1}\pmb{1}^\top / N_e \right)} }, \end{align} \] for a small constant \( \epsilon \).
The scheme produces an iterative estimate using a Gauss-Newton-¹¹ or, e.g., Levenberg-Marquardt-based¹² optimization.
A similar scheme used more commonly in reservoir modeling is the ensemble randomized maximum likelihood estimator (EnRML).¹³

12. Bocquet, M., & Sakov, P. (2012). Combining inflation-free and iterative ensemble Kalman filters for strongly nonlinear systems. NPG, 19(3), 383-399.
13. Raanes, P. N., Stordal, A. S., & Evensen, G. (2019). Revising the stochastic iterative ensemble smoother. NPG, 26(3), 325-338.

The single iteration ensemble transform Kalman smoother (SIEnKS)

While accuracy increases with iterations in the 4D-MAP estimate, every iteration comes at the cost of the model forecast \( {\color{#1b9e77} { \mathcal{M}_{L:1} } } \).
In synoptic meteorology the linear-Gaussian approximation of the evolution of the densities is actually an adequate approximation;
- iterating over the nonlinear dynamics may not be justified by the improvement in the forecast statistics.
However, the iterative optimization over a nonlinear observation operator \( \mathcal{H}_k \) or hyper-parameters in the filtering step of the classical EnKS can be run without the additional cost of model forecasts.
- This can be performed similarly to the IEnKS with the maximum likelihood ensemble filter (MELF) analysis.¹⁴
Subsequently, the retrospective analysis in the form of the filtering right-transform can be applied to condition the initial ensemble

\[ \begin{align} \mathbf{E}^\mathrm{smth}_{0|L} = \mathbf{E}_{0:L-1}^\mathrm{smth} \boldsymbol{\Psi}_L \end{align} \]
As with the 4D cost function, one can initialize the next DA cycle in terms of the retrospective analysis, and gain the benefit of the improved initial estimate.
This scheme, currently in open review, is the single-iteration ensemble Kalman smoother (SIEnKS).¹⁵

14. Zupanski, M., et al. (2008). The Maximum Likelihood Ensemble Filter as a non-differentiable minimization algorithm. QJRMS, 134, 1039–1050.
15. Grudzien, C. and Bocquet, M. (2021): A fast, single-iteration ensemble Kalman smoother for sequential data assimilation. GMD Discussions [preprint], https://doi.org/10.5194/gmd-2021-306, in open review.

The single iteration ensemble Kalman smoother (SIEnKS)

Compared to the classical EnKS, this adds an outer loop to the filtering cycle to produce the posterior analysis.

The information flows from the filtered state back in time from the retrospective analysis.
This re-analyzed state becomes the initialization for the next cycle over the shifted DAW, carrying this information forward.
The iterative cost function is only solved in the filtering estimate for the new observations entering the DAW.

Combined with the retrospective analysis, this comes without the cost of iterating the model forecast over the DAW.

When the tangent-linear approximation is adequate, this is shown to be an accurate and highly efficient approach to sequential DA.

The single iteration ensemble Kalman smoother (SIEnKS)

Our key result is an efficient multiple data assimilation (MDA)¹⁶ scheme within the EnKS cycle.

MDA is a technique based on statistical tempering¹⁷ designed to relax the nonlinearity of the Bayesian MAP estimation.

In a single data assimilation (SDA) smoother, each observation is only assimilated once so that new observations are only distantly connected to the initial conditions of the simulation;

in particular, this can introduce many local minima to the 4D-MAP cost function, strongly affecting the performance of the optimization.¹⁸

MDA is designed to artificially inflate observation errors and weakly assimilate the same observation over multiple DAWs.

This inherently weakens the effects of local minima, and slowly brings the estimator close to a more optimal solution.¹⁹

However, the SIEnKS treats MDA differently than 4D EnVAR estimators by using a classic EnKS cycle to weakly assimilate the observations over multiple passes.

The filter step in this analysis is used as a boundary condition for the interpolation of the posterior over the lag window.

This MDA scheme is demonstrated to be more accurate, stable and cost-effective than EnKF-based 4D-EnVAR schemes in short-range forecasts.

Note, the SIEnKS is not as robust as 4D-MAP estimates in highly nonlinear forecast dynamics.

16. Emerick, A. A., & Reynolds, A. C. (2013). Ensemble smoother with multiple data assimilation. Computers & Geosciences, 55, 3-15.
17. Neal, R. M. (1996). Sampling from multimodal distributions using tempered transitions. Statistics and computing, 6(4), 353-366.
18. Fillion, A., Bocquet, M., and Gratton, S. (2018). Quasi-static ensemble variational data assimilation: a theoretical and numerical study with the iterative ensemble Kalman smoother, NPG, 25, 315–334.
19. Evensen, G.: Analysis of iterative ensemble smoothers for solving inverse problems, Computational Geosciences, 22, 885–908, 2018.

The single iteration ensemble Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

The data boundary condition improves the forecast statistics, controlling the accumulated forecast error over lagged states unlike traditional 4D-EnVAR approaches.
Similarly, the interpolation of the posterior estimate remains more stable over the DAW, when the forecast error dynamics are not highly nonlinear.

The single iteration ensemble Kalman smoother (SIEnKS)

These results and a wide variety of other test cases for short-range forecast systems are presented in our manuscript A fast, single-iteration ensemble Kalman smoother for sequential data assimilation in open review.¹⁵
- We study the estimation versus nonlinear observation operators, hyper-parameter optimization and versus long shifts of DAWS which are challenging in the 4D-MAP approach.
- In a variety of test-cases relevant to short-range operational prediction cycles, we demonstrate the improved accuracy at lower leading order cost of the SIEnKS.
- The two qualifications are that:
  1. these theoretical results are based on the perfect model assumption for simplicity in this initial analysis; and
  2. we have not yet introduced localization or covariance hybridization in this initial study for simplicity.
However, our mathematical results are supported by extensive numerical demonstration, with the Julia package DataAssimilationBenchmarks.jl²⁰
The ETKS, (Lin-)IEnKS and SIEnKS have pseudo-code provided in the manuscript, and implementations Julia.
We introduce our novel scheme, validating its performance advantages for short-range forecast cycles; and
- provide a theoretical / computational framework for EnVAR schemes in the ETKF analysis.

15. Grudzien, C. and Bocquet, M. (2021): A fast, single-iteration ensemble Kalman smoother for sequential data assimilation. GMD Discussions [preprint], https://doi.org/10.5194/gmd-2021-306, in open review.
20. Grudzien, C., Sandhu, S., Jridi, A. (2021, September 4). cgrudz/DataAssimilationBenchmarks.jl:. Zenodo. In preparation for The Journal of Open Source Software.

Conclusions

In surveying a variety of schemes, our key point is that the Bayesian MAP analysis can be decomposed in a variety of ways to exploit the operational problem.
Traditional 4D-MAP approaches have largely exploited this analysis for moderately-to-highly nonlinear forecast error dynamics.
- This utilizes the 4D-MAP cost function in order to estimate the full sensitivity of the initial condition over the DAW.
However, with a change of perspective, this analysis can similarly be exploited for short-range prediction cycles as in the SIEnKS.
- We exploit that in systems where:
  1. the perfect model assumption is sufficient; and
  2. the forecast evolution is weakly nonlinear (short-range forecast cycles);
- a retrospective filter-based analysis solves the same MAP estimation of the smoothed prior as the (perfect) 4D-MAP estimation.
This leads to a simple, outer-loop optimization of the DA cycle itself;
- this Bayesian analysis gives a fresh perspective to the tools of nonlinear optimization and how to produce efficient statistical estimates in online settings.

A Bayesian approach to data assimilation

Colin Grudzien

Assistant Professor of Statistics

Outline

Motivation

Motivation

Observation-analysis-forecast cycle

Smoothing

Hidden Markov models

Hidden Markov models continued

Hidden Markov models continued

Bayesian MAP estimates

Bayesian MAP estimates

Bayesian MAP estimates

Bayesian MAP estimates

Bayesian MAP estimates

Bayesian MAP estimates

The ETKF

The ETKF

The ETKF

The EnKS

The 4D cost function

The 4D cost function

The 4D cost function

Incremental 4D-VAR

Incremental 4D-VAR

Incremental 4D-VAR

Hybrid EnVAR in the EnKF analysis

Hybrid EnVAR in the EnKF analysis

The single iteration ensemble transform Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

The single iteration ensemble Kalman smoother (SIEnKS)

Conclusions

Thank you for your attention!