Hidden Markov Models and the Bootstrap Particle Filter

11/25/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for ā€œfair useā€ for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • What is “data assimilation”?
      • Observation-Analysis-Forecast cycles
    • The Bayesian framework for data assimilation
      • Hidden Markov models
    • A simple computational method
      • The bootstrap particle filter

What is “data assimilation”?

Image of global atmospheric circulation.

Courtesy of: Kaidor via Wikimedia Commons (CC 3.0)

  • Data assimilation is a science at the intersection of statistics, physics and applied mathematics.
  • Data assimilation is differentiated by the use of a physics-based process model for the time-evolution of the variables we wish to estimate.
  • We should think of this model being an imperfect representation of the evolution, with uncertainty.
  • On the other hand, suppose we have limited and innacurate real-world observations of related quantitites.
  • The goal of data assimilation:
    • we want to combine the data from:
      1. real-world observations; and
      2. a physics-based model
    • to produce an “optimal” estimate of the physical quantity, its related parameters and the uncertainty therein.

What is “data assimilation”? (continued)

Image of global atmospheric circulation.

Courtesy of: Kaidor via Wikimedia Commons (CC 3.0)

  • Let’s suppose that we have a physics-based model for some process – prototypically we can think of the weather.
  • This model encapsulates the understanding of first physical principles for the equations of motion for various time-varying states.
  • This model can represent, e.g.,
    • the atmosphere in terms of:
      1. temperature,
      2. pressure and
      3. humidity
    • over Reno and the western USA.
    • The evolution of these states is described by a system of PDEs, discretized spatially over three dimensions of the atmosphere and surface topography.

What is “data assimilation”? (continued)

Image of global atmospheric circulation.

Courtesy of: Kaidor via Wikimedia Commons (CC 3.0)

  • Let’s suppose that these dynamical (time-varying) states can be written in a vector, \( \mathbf{x}_k \in \mathbb{R}^n \), where \( k \) corresponds to some time \( t_k \).
  • Abstractly, we will represent the time-evolution of these states with the nonlinear map \( \mathcal{M} \), \[ \mathbf{x}_k = \mathcal{M}_{k:k-1} \left(\mathbf{x}_{k-1}, \boldsymbol{\lambda}\right) + \boldsymbol{\eta}_k \] where
    • \( \mathbf{x}_{k-1} \) is the vector of states at an earlier time \( t_{k-1} \);
    • \( \boldsymbol{\lambda} \) is a vector of uncertain phyiscal parameters on which the evolution depends, e.g., energy loss due to friction of wind on the Sierra.
    • \( \boldsymbol{\eta}_k \) is an additive, stochastic noise term, representing errors in our model for the physical process.
  • The states \( \mathbf{x}_k \) are the values we wish to estimate, having a prior distribution on \( \left(\mathbf{x}_{k-1}, \boldsymbol{\lambda}\right) \) and knowledge of \( \mathcal{M}_{k:k-1} \) and of how \( \boldsymbol{\eta}_k \) are distributed.
  • At time \( t_{k-1} \), we will make a forecast for the distribution of \( \mathbf{x}_k \) with our prior knowledge, including the physics-based model.
  • For the rest of this lecture, we will restrict our consideration to the case that \( \boldsymbol{\lambda} \) is a known constant for simplicity.
  • What is “data assimilation”? (continued)

    Image of global atmospheric circulation.

    Courtesy of: Kaidor via Wikimedia Commons (CC 3.0)

    • We suppose that we are also given real-world observations \( \mathbf{y}_k\in\mathbb{R}^d \) related to the phyiscal states by, \[ \mathbf{y}_k = \mathcal{H}_k \left(\mathbf{x}_k\right) + \boldsymbol{\epsilon}_k \] where
      • \( \mathcal{H}_k:\mathbb{R}^n \rightarrow \mathbb{R}^d \) is a nonlinear map relating the states we wish to estimate \( \mathbf{x}_k \) to the values that are observed \( \mathbf{y}_k \);
        • e.g., we may wish to estimate humidity, but we only observe the back-scatter of light from a satelite’s laser as it hits particulate in the atmosphere.
        • Typically \( d \ll n \) so this is information is sparse and observations are not \( 1:1 \) with the unobserved states.
      • \( \boldsymbol{\epsilon}_k \) is an additive, stochastic noise term representing errors in the measurements.
    • Therefore, at time \( t_k \) we will have a forecast distribution for the states \( \mathbf{x}_k \) generated by our prior on \( \mathbf{x}_{k-1} \) and our physics-based model \( \mathcal{M} \);
    • We will also have an observation \( \mathbf{y}_k \) with uncertainty.
    • We wish to find a posterior distribution for \( \mathbf{x}_k \) conditioned on \( \mathbf{y}_k \).

    Observation-analysis-forecast cycle

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • Recursive estimation of the distribution for \( \mathbf{x}_k \) conditional on \( \mathbf{y}_k \) can be described as an:
      1. observation
      2. analysis
      3. forecast
    • cycle.
    • We assume that we have an initial forecast-prior for the physics-based numerical model state.
    • Suppose that the current time is \( t_{k-1} \) and will will make a prediction of the atmosphere at time \( t_k \).
    • The prior is generated from the simulation of the geophysical fluid dynamics.
    • Let’s suppose this gives an estimate of the atmosphere \( t_{k} - t_{k-1} =6 \) hours in the future.

    Observation

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • The “true” physical process evolves in time;
      • eventually, the physical state reaches the time \( t_k \) of the first forecast.
      • If we make our first forecast at time \( t_{k-1} \), this corresponds to our real atmosphere reaching the time of the forecast state, \( t_{k} \).
    • At this time, we receive an observation of the atmosphere’s state variables.

    Analysis

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • The observation comes with a likelihood function;
      • we compute the likelihood of the observation given the model forecast.
    • Using the forecast-prior and the likelihood,
      • we estimate the Bayesian update of the prior to the posterior
      • conditioned on the observation.
    • The posterior distribution is denoted the analysis.

    Forecast

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • The analysis is then used as the initialization of the next forecast.
    • Simulating the model forward in time, we produce a new forecast-prior.
    • Chaotic evolution (the butterfly effect) causes the model forecast to drift from the true state.
      • This is in addition to the errors in:
        1. accurately representing the true physical process with the imperfect numerical model; and
        2. estimating the “true” initial condition from incomplete and noisy observations.

    Observation

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • When a new observation becomes available…

    Analysis

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • We produce a new analysis

    Forecast

    Diagram of the filter observation-analysis-forecast cycle.

    From: Carrassi, A. et al. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9.5 (2018): e535.

    • And the cycle continues ad-infinitum…

    Hidden Markov models

    • Recall our physical process model and observation model, \[ \begin{align} \mathbf{x}_k &= \mathcal{M}_{k:k-1} \left(\mathbf{x}_{k-1}\right) + \boldsymbol{\eta}_k\\ \mathbf{y}_k &= \mathcal{H}_k \left(\mathbf{x}_k\right) + \boldsymbol{\epsilon}_k \end{align} \]
    • Let us denote the sequence of the process model states and observation model states as, \[ \begin{align} \mathbf{x}_{K:0} = \{\mathbf{x}_k &: k=0,\cdots, K\} \\ \mathbf{y}_{K:1} = \{\mathbf{y}_k &: k=1,\cdots, K\} \end{align} \]
    • We assume that the model error and observation error sequences \[ \begin{align} \{\boldsymbol{\eta}_k : k=1,\cdots, K\} & & \{\boldsymbol{\epsilon}_k : k=1,\cdots, K\} \end{align} \] will be:
      1. mutually indepdendent;
      2. independent in time; and
      3. distributed according to \( P_\boldsymbol{\eta} \) and \( P_\boldsymbol{\epsilon} \) respectively.
    • In this case, we can write the following conditional probability distributions as, \[ \begin{align} P\left(\mathbf{x}_k \vert \mathbf{x}_{k-1} \right) = P_\boldsymbol{\eta}\big(\mathbf{x}_k - \mathcal{M}\left(\mathbf{x}_{k-1}\right)\big) & & P\left(\mathbf{y}_k \vert \mathbf{x}_{k} \right) = P_\boldsymbol{\epsilon}\big(\mathbf{y}_k - \mathcal{H}\left(\mathbf{x}_{k}\right)\big). \end{align} \]
    • The above formulation is known in some literature as a hidden Markov model;
      • the dynamic state variables \( \mathbf{x}_k \) are known as the hidden variables, because they are not directly observed;
      • the Markov property on the conditional probabilities is a direct consequence of the above assumptions.

    Hidden Markov models continued

    • Recall our physical process model and observation model, \[ \begin{align} \mathbf{x}_k &= \mathcal{M}_{k:k-1} \left(\mathbf{x}_{k-1}\right) + \boldsymbol{\eta}_k\\ \mathbf{y}_k &= \mathcal{H}_k \left(\mathbf{x}_k\right) + \boldsymbol{\epsilon}_k \end{align} \] and our conditional probability distributions \[ \begin{align} P\left(\mathbf{x}_k \vert \mathbf{x}_{k-1} \right) = P_\boldsymbol{\eta}\big(\mathbf{x}_k - \mathcal{M}\left(\mathbf{x}_{k-1}\right)\big) & & P\left(\mathbf{y}_k \vert \mathbf{x}_{k} \right) = P_\boldsymbol{\epsilon}\big(\mathbf{y}_k - \mathcal{H}\left(\mathbf{x}_{k}\right)\big). \end{align} \]
    • The independence in time assumption on the model stochasticity gives Markovian probability distributions for the hidden variables, i.e., \[ \begin{align} P\left(\mathbf{x}_{K:0}\right) &= P\left(\mathbf{x}_{K} \vert \mathbf{x}_{K-1: 0}\right) P\left(\mathbf{x}_{K-1:0}\right)\\ &= P\left(\mathbf{x}_{K} \vert \mathbf{x}_{K-1}\right) P\left(\mathbf{x}_{K-1:0} \right). \end{align} \]
    • Applying the Markovian property recursively, we have that, \[ P\left(\mathbf{x}_{K:0}\right) = P(\mathbf{x}_0) \prod_{k=1}^{K} P_\boldsymbol{\eta}\big(\mathbf{x}_k - \mathcal{M}\left(\mathbf{x}_{k-1}\right)\big) \]
    • Similarly, we can show that, \[ P\left(\mathbf{y}_{k}\vert \mathbf{x}_k, \mathbf{y}_{k-1:1}\right) = P\left(\mathbf{y}_k \vert \mathbf{x}_k\right), \] by the independence assumptions on the model and observation errors.

    Hidden Markov models continued

    • Given our physical process model and observation model, \[ \begin{align} \mathbf{x}_k &= \mathcal{M}_{k:k-1} \left(\mathbf{x}_{k-1}\right) + \boldsymbol{\eta}_k\\ \mathbf{y}_k &= \mathcal{H}_k \left(\mathbf{x}_k\right) + \boldsymbol{\epsilon}_k \end{align} \]
    • The goal of the observation-analysis-forecast cycle is thus to estimate the distribution \( P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right) \).
    • Using the definition of conditional probability, we have \[ \begin{align} P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right) &= \frac{P\left(\mathbf{y}_{K:1},\mathbf{x}_K \right)}{ P\left(\mathbf{y}_{K:1}\right)}. \end{align} \]
    • Using the independence assumptions, we can thus write \[ \begin{align} P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right) &=\frac{P\big(\mathbf{y}_K, (\mathbf{x}_K, \mathbf{y}_{K-1:1})\big)}{P\left(\mathbf{y}_{K:1}\right)}\\ &=\frac{P\left(\mathbf{y}_{K}\vert \mathbf{x}_K, \mathbf{y}_{K-1:1}\right) P\left(\mathbf{x}_K, \mathbf{y}_{K-1:1}\right)}{P\left(\mathbf{y}_{K:1}\right)} =\frac{P\left(\mathbf{y}_{K}\vert \mathbf{x}_K\right) P\left(\mathbf{x}_K, \mathbf{y}_{K-1:1}\right)}{P\left(\mathbf{y}_{K:1}\right)} \end{align} \]
    • Finally, writing the joint distributions in terms of conditional distributions we have \[ \begin{align} P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right) &=\frac{P\left(\mathbf{y}_K \vert \mathbf{x}_K\right) P\left(\mathbf{x}_K\vert \mathbf{y}_{K-1:1}\right) P\left(\mathbf{y}_{K-1:1}\right)}{P\left(\mathbf{y}_K\vert \mathbf{y}_{K-1:1}\right) P\left(\mathbf{y}_{K-1:1}\right)} \\ \\ &=\frac{P\left(\mathbf{y}_K \vert \mathbf{x}_K\right) P\left(\mathbf{x}_K\vert \mathbf{y}_{K-1:1}\right)}{P\left(\mathbf{y}_K\vert \mathbf{y}_{K-1:1}\right)} \end{align} \]

    Hidden Markov models continued

    • From the last slide, the observation-analysis-forecast cycle was written as \[ \begin{align} {\color{red} {P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right)} } &=\frac{ {\color{blue} { P\left(\mathbf{y}_K \vert \mathbf{x}_K\right) } } {\color{green} { P\left(\mathbf{x}_K\vert \mathbf{y}_{K-1:1}\right) } } }{P\left(\mathbf{y}_K\vert \mathbf{y}_{K-1:1}\right)}. \end{align} \]
    • Then notice that, in terms of Bayes' law, this is given by the following terms:
      • \( {\color{red} {P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right)}} \) – this is the posterior estimate for the hidden states (at the current time) given all observations in a time series \( \mathbf{y}_{K:1} \);
      • \( {\color{green} {P\left(\mathbf{x}_K\vert \mathbf{y}_{K-1:1}\right)}} \) – this is the model forecast-prior from our last best estimate of the state.
        • That is, suppose that we had computed the posterior probability density \( p(\mathbf{x}_{K-1} \vert \mathbf{y}_{K-1:1}) \) at the last observation time \( t_{K-1} \);
        • then the model forecast probability density is given by, \[ {\color{green} {p(\mathbf{x}_K\vert \mathbf{y}_{K-1:1})} } = \int p(\mathbf{x}_K \vert \mathbf{x}_{K-1}) {\color{red} {p(\mathbf{x}_{K-1} \vert \mathbf{y}_{K-1:1})} }\mathrm{d}\mathbf{x}_{K-1}. \]
        • In theory, the above can be solved by computing a system PDEs known as the Fokker-Plank equations or the Forward Kolmogorov equations with the initial data given by last Bayesian posterior density \( {\color{red} {p(\mathbf{x}_{K-1} \vert \mathbf{y}_{K-1:1})} } \).
        • This corresponds “physically” to evolving the posterior density in the hidden variables \( \mathbf{x} \) with respect to the nonlinear, stochastic differential equations \( \mathcal{M} + \eta \), which defines the Markov transition kernel above.
      • \( {\color{blue} {P\left(\mathbf{y}_{K}\vert \mathbf{x}_{K}\right)}} \) – this is the likelihood of the observed data given our model forecast;
      • \( P\left(\mathbf{y}_K \vert \mathbf{y}_{K-1:1}\right) \) is the marginal of joint distribution \( P( \mathbf{y}_K, \mathbf{x}_K \vert \mathbf{y}_{K-1:1}) \) integrating out the hidden variable, with density \[ p\left(\mathbf{y}_K \vert \mathbf{y}_{K-1:1}\right) = \int p(\mathbf{y}_K \vert \mathbf{x}_K) p(\mathbf{x}_K \vert \mathbf{y}_{K-1:1})\mathrm{d}\mathbf{x}_{K}. \]

    Hidden Markov models continued

    Spaghetti plot of jetstream.

    Courtesy of: Environmental Modeling Center / Public domain

    • Typically, the probability density for the denominator \[ p\left(\mathbf{y}_K \vert \mathbf{y}_{K-1:1}\right) = \int p(\mathbf{y}_K \vert \mathbf{x}_K) p(\mathbf{x}_K \vert \mathbf{y}_{K-1:1})\mathrm{d}\mathbf{x}_{K} \] is mathematically intractable;
      • However, the denominator is independent of the hidden variable \( \mathbf{x}_{K} \);
      • its purpose is only to normalize the integral of the posterior density to \( 1 \) over all \( \mathbf{x}_{K} \).
    • Instead, we will usually be satisfied with computing empirical, sample-based estimates from a sample drawn from the posterior.
    • This says that the posterior of the observation-analysis-forecast cycle \[ \begin{align} {\color{red} {P\left(\mathbf{x}_{K} \vert \mathbf{y}_{K:1}\right)}} &\propto {\color{blue} {P\left(\mathbf{y}_K \vert \mathbf{x}_K\right)}} {\color{green} {P\left(\mathbf{x}_K\vert \mathbf{y}_{K-1:1}\right)} } \end{align} \] can be computed recursively from the last posterior, up to proportionality.
    • Therefore, in a Bayesian framework, data assimilation seeks to sample the posterior distribution above to compute empirical statistics from the ensemble-members.
    • This was not the original reason for, but has become a one of the bases for computing ensemble-forecasts and spaghetti plots of the weather as above.

    Sampling the posterior

    • Suppose that somehow we had access to the posterior \( P(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \).
    • If we draw \( N \) independent, identically distributed (iid) ensemble members from this distribution, \( \{\mathbf{x}_K^i\}_{i=1}^N \), an empirical representation of the distribution is given by \[ \begin{align} P_N(\mathbf{x}_K\vert \mathbf{y}_{K:1}) = \frac{1}{N} \sum_{i=1}^N \boldsymbol{\delta}_{\mathbf{x}^i_K}\left(\mathrm{d}\mathbf{x}_K\right), \end{align} \] where
      • \( \boldsymbol{\delta}_{\mathbf{x}^i_K} \) – this the Dirac delta.
        • Heuristically, this is known as the “function” which has the property \[ \delta_{\mathbf{x}_K^i}(x) = \begin{cases} +\infty & \mathbf{x} = \mathbf{x}_K^i \\ 0 & \text{else}\end{cases}; \]
        • More formally this is defined by the property, \[ \int f(x) \boldsymbol{\delta}_{\mathbf{x}_K^i}\left(\mathrm{d}\mathbf{x}_K\right) = f\left(\mathbf{x}_K^i\right); \] the Dirac delta is a singular measure, not a traditional function but it can be understood by the integral equation.
      • In the above, the denominator \( \frac{1}{N} \) represents that all point volumes have equal mass or weights, so that the total density integrates to one.
    • Then for any statistic \( f \) of the posterior, we can recover an estimate of its expected value directly as \[ \begin{align} \mathbb{E}_{P(\mathbf{x}_K\vert \mathbf{y}_{K:1})} \left[f \right] \triangleq\int f p(\mathbf{x}_K\vert \mathbf{y}_{K:1} ) \mathrm{d}\mathbf{x}_K \approx \int f P_N(\mathbf{x}_K\vert \mathbf{y}_{K:1}) &= \frac{1}{N}\sum_{i=1}^N \int f(\mathbf{x})\delta_{\mathbf{x}^i_K}(\mathrm{d}\mathbf{x}_K) = \frac{1}{N} \sum_{i=1}^N f\left(\mathbf{x}_K^i\right). \end{align} \]

    Empirical estimates

    • The empirical estimate, \[ \mathbb{E}_{P(\mathbf{x}_K\vert \mathbf{y}_{K:1})} \left[f \right] \approx \frac{1}{N} \sum_{i=1}^N f\left(\mathbf{x}_K^i\right), \] is also an unbiased estimator of the statistic \( f \) .
    • If the posterior variance of \( f(\mathbf{x}) \) satisfies, \[ \begin{align} \sigma^2_f = \mathbb{E}_{P(\mathbf{x}_K \vert \mathbf{y}_{K:1})}\left[f^2(\mathbf{x})\right] - \left(\mathbb{E}_{P(\mathbf{x}_K\vert \mathbf{y}_{K:1})}\left[f\right]\right)^2 < \infty; \end{align} \]
    • then the variance of the empirical estimate, \[ \begin{align} \mathrm{var}\left(\mathbb{E}_{P_N(\mathbf{x}\vert \mathbf{y})}\left[f\right]\right) = \frac{\sigma_f^2}{N}, \end{align} \] where the variance is understood as taken over the possible sample outcomes, \( \mathbf{x}_K^i \sim P(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \).
    • For \[ \sigma_f^2 < \infty \] as above, the Central Limit Theorem tells us \[ \begin{align} \lim_{N\rightarrow +\infty}\sqrt{N}\left\{ \mathbb{E}_{P(\mathbf{x}_K\vert \mathbf{y}_{K:1})}\left[f\right] - \mathbb{E}_{P_N(\mathbf{x}_K\vert \mathbf{y}_{K:1})}\left[f\right]\right\} =N(0, \sigma_f^2), \end{align} \]
    • i.e., the empirical distribution converges to the true distribution in the weak sense as \( N \) gets sufficiently large.

    Importance sampling

    • In practice, we often cannot sample the posterior directly but we may need to sample some other distribution that shares its support.
    • This idea of sampling another distribution with shared support is known as importance sampling.
    • We will suppose that we have access, perhaps not to \( P(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \) but instead \( Q(\mathbf{x}_{K}\vert \mathbf{y}_{K:1}) \) such that \( P \ll Q \),
      • i.e., \( P(A\vert \mathbf{y}_{K:1})>0 \Rightarrow Q(A\vert \mathbf{y}_{K:1})>0 \).
    • This above assumption allows us to take the Radon-Nikodym derivative of the true posterior \( P(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \) with respect to the proposal distribution \( Q(\mathbf{x}_K \vert \mathbf{y}_{K:1}) \).
    • The key innovation to the last formulation is that this allows us to evaluate a statistic of the posterior by point volumes but with non-equal importance weights.
    • Let us define the importance weight function \( w(\mathbf{x}_K \vert \mathbf{y}_{K:1}) \triangleq \frac{p(\mathbf{x}_K\vert \mathbf{y}_{K:1})}{q(\mathbf{x}_K\vert \mathbf{y}_{K:1})} \).
      • We will suppose that the weights can be computed even if \( P(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \) is not available – this will be explained shortly.
    • The we can re-write the expected value of some statistic \( f \) of the posterior density as, \[ \begin{align} \mathbb{E}_{P(\mathbf{x}_K\vert \mathbf{y}_{K:1})}[f]= \int f(\mathbf{x}_K)p(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K & = \frac{\int f(\mathbf{x}_K)p(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K}{\int p(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K}\\ &=\frac{\int f(\mathbf{x}_{K})w(\mathbf{x}_K\vert \mathbf{y}_{K:1})q(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K}{\int w(\mathbf{x}_K\vert \mathbf{y}_{K:1})q(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K} , \end{align} \]
    • The above is shown by the definition of the importance weights above and because for any density, \[ \int p(\mathbf{x}_K\vert \mathbf{y}_{K:1})\mathrm{d}\mathbf{x}_K =1. \]

    Importance sampling continued

    • The benefit for sampling techniques is therefore to draw iid \( \mathbf{x}^i_K \sim Q(\mathbf{x}_K\vert \mathbf{y}_{K:1}) \) and to write the empirically derived expected value of \( f \) as, \[ \begin{align} \mathbb{E}_{P_N(\mathbf{x}_K \vert \mathbf{y}_{K:1})}[f] = \frac{ \frac{1}{N} \sum_{i=1}^N f(\mathbf{x}^i_K)w(\mathbf{x}^i_K\vert \mathbf{y}_{K:1})}{\frac{1}{N} \sum_{i=1}^N w(\mathbf{x}^i_K \vert \mathbf{y}_{K:1})} = \sum_{i=1}^N f(\mathbf{x}^i_K) \tilde{w}^i_K. \end{align} \]
    • Here, the \( \tilde{w}^i_K \triangleq \frac{w(\mathbf{x}^i_K \vert \mathbf{y}_{K:1})}{\sum_{i=1}^N w(\mathbf{x}^i_K \vert \mathbf{y}_{K:1})} \) are defined as the normalized importance weights.
    • We will write our empirical estimate of the posterior as, \[ \begin{align} P_N(\mathbf{x}_K \vert \mathbf{y}_{K:1}) \triangleq \sum_{i=1}^N \tilde{w}^i_K\delta_{\mathbf{x}^i_K}(\mathrm{d}\mathbf{x}_K ). \end{align} \]
    • Notice, if we take \( f(\mathbf{x}) = 1 \), the expected value is one — this means that the weighted point volumes gives a probability measure.
    • Using this formulation, we have an extremely flexible view of the posterior as combination of positions and weights.
      • In a hidden Markov model, positions correspond to initial conditions for the nonlinear, stochastic differential equations \( \mathcal{M} + \eta \).
      • “Physically” we can evolve all the point volumes, keeping their weights, and construct a forward-in-time density.
    • Therefore for data assimilation, we have a natural choice of how to find the next prior from the last posterior;
      • we evolve the points and possibly find new weights, or resample positions when we condition on new observed information.
      • This is like a “weighted-spaghetti plot”.

    Sequential importance sampling

    • Using Bayes' Law, we can derive up to proportionality, \[ \begin{align} \tilde{w}^i_K \propto \tilde{w}^i_{K-1} \frac{ {\color{blue} {p\left(\mathbf{y}_K \vert \mathbf{x}^i_{K}\right) } } {\color{green} {p\left(\mathbf{x}^i_K \vert \mathbf{y}_{K-1:1}\right)}} }{q\left(\mathbf{x}_K^i \vert \mathbf{y}_{K-1:1} \right)}. \end{align} \]
    • As one special choice, we can choose a proposal of \( Q \) as the forecast-prior distribution \( {\color{green} {P(\mathbf{x}_{K}\vert \mathbf{y}_{K-1:1})} } \);
    • in this case, we have the weights given recursively by \[ \tilde{w}^i_k \propto \tilde{w}^i_{k-1} {\color{blue} {p(\mathbf{y}_k \vert \mathbf{x}_k^i)} }. \]
    • The proportionality statement says that:
      • Suppose we have knowledge of the normalized weights \( \tilde{w}^i_{K-1} \) at time \( t_{K-1} \);
      • we can generate a forecast for each position \( \mathbf{x}_{K-1}^i \) at time \( t_K \) via the model, \[ \mathbf{x}_K^i = {\color{green} {\mathcal{M}_{K:K-1}(}} \mathbf{x}_{K-1}^i {\color{green} {) + \boldsymbol{\eta}_K} }. \]
      • Prior to obtaining the new observation, the forecast weight will remain as \( \tilde{w}_{K-1}^i \);
      • to condition the weights on the new observation, we apply Bayes' Law, \[ \tilde{w}^i_k \propto \tilde{w}^i_{k-1} {\color{blue} {p(\mathbf{y}_k \vert \mathbf{x}_k^i)} }. \] such that we need only compute \( \tilde{w}^i_{k-1} {\color{blue} {p(\mathbf{y}_k \vert \mathbf{x}_k^i)} } \) for each \( i \) and then re-normalize the weights so they sum to \( 1 \).

    • Sequential importance sampling for estimating the posterior is extremely flexible and makes few assumptions on the form of the problem whatsoever;
    • however, the primary issue is that the importance weights become extremely skewed very quickly, leading to all the probability mass landing on a single point after only a few iterations.

    The bootstrap filter

    • Finding a method for handling the degeneracy of the weights is explicitly the motivation for the bootstrap particle filter, and implicitly one of the motivations for the ensemble Kalman filter.
    • The method of the bootstrap filter essentially proposes to eliminate the degeneracy of the weights by eliminating ensemble members with weights close to zero and resampling.
    • At the point of the Bayesian update and re-weighting:
      1. eliminate all ensemble members with weights \( \tilde{w}^i < W \) where \( W\ll 1 \) will be some threshold for the weights;
      2. then, make replicates of the higher weighted ensemble members and reset the importance weights all equal to \( \frac{1}{N} \);
      3. then the new empirical posterior is then given by, \[ \begin{align} P_N(\mathbf{x}_{K}\vert \mathbf{y}_{K:1}) = \frac{1}{N} \sum_{i=1}^N N^i \delta_{\mathbf{x}^i_K}(\mathrm{d}\mathbf{x}_K), \end{align} \] where \( N^i \) is the number of replicates \( \left(N^i\in[0, N]\right) \) of sample \( \mathbf{x}^i_K \) such that \( \sum_{i=1}^N N^i =N \)
      4. If the first prior sample is drawn iid, the first weights \( w_0^i \equiv \frac{1}{N} \); this gives a complete data assimilation algorithm for a hidden Markov model.
    • Note: how the number of replicates \( N^i \) is chosen is the basis of several different approaches to particle filters.

    Further reading

    • This is only meant to be a quick introduction to the subject of data assimilation and there are many aspects we haven’t discussed whatsoever.
    • More details on the work presented can be found in the following works consulted:
      1. Doucet, Arnaud, Nando De Freitas, and Neil Gordon. Sequential Monte Carlo methods in practice. Springer, New York, NY, 2001.
      2. Arulampalam, M. Sanjeev, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing 50, no. 2 (2002): 174-188.
      3. Carrassi, Alberto, Marc Bocquet, Laurent Bertino, and Geir Evensen. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9, no. 5 (2018): e535.
      4. Bocquet, Marc. Introduction to the principles and methods of data assimilation in the geosciences. Lecture notes. Version 0.32. 2019.
      5. Asch, Mark, Marc Bocquet, and Maƫlle Nodet. Data assimilation: methods, algorithms, and applications. Vol. 11. SIAM, 2016.
    • Many of these topics will appear in greater detail with an emphasis on establishing the theoretical foundations between dynamical systems, probability and machine learning in our upcoming book:
      • Carrassi, Alberto, Colin Grudzien, Marc Bocquet, Alban Farchi, and Patrick Raanes. Data assimilation for dynamical systems and their discovery through machine learning. In preparation for Springer. 2022.