The bootstrap particle filter

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A sampling approach to nonlinear estimation
- Empirical estimates of statistics
- Importance sampling
- The bootstrap filter and resampling strategies

Motivation

We have now begun to introduce some of the key concepts that bridge the estimation problem to nonlinear models.
Rather than the usual Gauss-Markov model, we may generally consider a system of equations

\[ \begin{align} \pmb{x}_k &= \mathcal{M}_k (\pmb{x}_{k-1}) + \pmb{w}_k \\ \pmb{y}_k &= \mathcal{H}_k (\pmb{x}_k) + \pmb{v}_k \end{align} \] where
- \( \mathcal{M}_k:\mathbb{R}^{N_x} \rightarrow \mathbb{R}^{N_x} \) is a nonlinear mapping that encompasses the equations of motion in time;
- \( \mathcal{H}_k: \mathbb{R}^{N_x} \rightarrow \mathbb{R}^{N_y} \) is a nonlinear mapping that transmits the hidden modeled states to the observed variables; and
- \( \{\pmb{w}_k\}_{k=1}^\infty \) and \( \{ \pmb{v}_k\}_{k=1}^\infty \) are mutually independent, white-in-time error sequences, with arbitrary distributions.
Note, even if the error distributions are Gaussian, the nonlinearity of the process model and observation model will deform the forecast and posterior distributions;
- therefore, in the presence of nonlinearity, the Kalman filter approach (in all of its variants) is a sub-optimal estimator, relying on biased assumptions.
While we have seen that the Gauss-Markov model can be used as an approximation within certain restrictions,
- it is of interest to consider, how can we perform fully nonlinear estimation?

Motivation

Recall the highly general hidden Markov model, without the simplification of the linear-Gaussian restriction.

\[ \begin{align} \pmb{x}_k &= \mathcal{M}_k (\pmb{x}_{k-1}) + \pmb{w}_k \\ \pmb{y}_k &= \mathcal{H}_k (\pmb{x}_k) + \pmb{v}_k \end{align} \]
It is possible to define pure Bayesian estimators for the general configuration, with very few assumptions.
- This type of estimator has had a long history in physics and engineering, where they are broadly known as particle filters / smoothers.
These types of estimators are very robust in terms of estimating the nonlinear evolution, and can identify highly-skewed and multi-modal distributions.
The high generality of this type of estimator means that there is very little bias by its construction;
- however, these estimators also tend to be of extremely high variance, due to the lack of any prescribed form like a Gaussian.
This tradeoff means that in order to gain robust estimates as above, the sample size needs to be very large.
- For a small dimensional process / observation model, the large sample size is feasible, and these estimators are very good choices.
However, for too large a system, the computation of standard particle methods becomes unfeasible, unless additional forms of bias are introduced to the estimator.

A sampling approach to nonlinear estimation

Recall that in our discussion of the Kalman filter, we derived that

\[ \begin{align} p(\pmb{x}_k|\pmb{y}_{k:1})\propto p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}), \end{align} \]
so that we interpret:
- Given the last posterior density \( p(\pmb{x}_{k-1}|\pmb{y}_{k-1:1}) \), we can apply the Markov transition kernel
\[ \begin{align} p(\pmb{x}_{k}|\pmb{y}_{k-1:1}) = \int p(\pmb{x}_k|\pmb{x}_{k-1}) p(\pmb{x}_{k-1}|\pmb{y}_{k-1:1})\mathrm{d}\pmb{x}_{k-1} \end{align} \] to obtain the forecast density for \( \pmb{x}_{k|k-1} \).
- We condition based on the likelihood of the observed data, \( \pmb{y}_k \) by multiplication
\[ \begin{align} p(\pmb{y}_k|\pmb{x}_k) p(\pmb{x}_k | \pmb{y}_{k-1:1}); \end{align} \] and
- after re-normalization by a constant, we obtain the posterior \( p(\pmb{x}_k | \pmb{y}_{k:1}) \).
All of these steps are implicitly encoded in the Kalman filter equations for the recursive conditional mean and covariance.
However, the above derivation actually never made use of any linear-Gaussian model assumptions;
- indeed, this only required the Markov assumptions and independence of the observation and model errors.

A sampling approach to nonlinear estimation

Suppose that somehow we had access to the posterior density \( p(\pmb{x}_L\vert \pmb{y}_{L:1}) \).
If we draw \( N_e \) independent, identically distributed (iid) ensemble members from this distribution, \( \{\pmb{x}_L^i\}_{i=1}^{N_e} \),
an empirical representation of the distribution (probability measure) is given by \[ \begin{align} \mathcal{P}_{N_e}(\pmb{x}_L\vert \pmb{y}_{L:1}) = \frac{1}{N_e} \sum_{i=1}^{N_e} \boldsymbol{\delta}_{\pmb{x}^i_L}\left(\mathrm{d}\pmb{x}_L\right), \end{align} \] where

\( \boldsymbol{\delta}_{\pmb{x}^i_L} \) – this the Dirac delta measure centered at the ensemble member \( \pmb{x}_L^i \);

i.e., we write, \[ \begin{align} \int f \boldsymbol{\delta}_{\pmb{x}^i_L} (\mathrm{d}\pmb{x}_L) = f\left(\pmb{x}^i_L\right). \end{align} \]

In the above, the denominator \( \frac{1}{N_e} \) represents that all point volumes have equal mass or weights, so that the measure integrates to one.

Then for any statistic \( f \) of the posterior, we can recover an estimate of its expected value directly as \[ \begin{align} \mathbb{E}_{\mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1})} \left[f \right] := \int f \mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1} ) &\approx \int f \mathcal{P}_{N_e}(\pmb{x}_L\vert \pmb{y}_{L:1})\\ &= \frac{1}{N_e}\sum_{i=1}^{N_e} \int f(\pmb{x})\delta_{\pmb{x}^i_L}(\mathrm{d}\pmb{x}_L) = \frac{1}{N_e} \sum_{i=1}^{N_e} f\left(\pmb{x}_L^i\right), \end{align} \]
that is, by taking the empirical average of the statistic evaluated over the ensemble members.

Empirical estimates

The empirical estimate, \[ \mathbb{E}_{\mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1})} \left[f \right] \approx \frac{1}{N_e} \sum_{i=1}^{N_e} f\left(\pmb{x}_L^i\right), \] is also an unbiased estimator of the statistic \( f \) .
If the posterior variance of \( f(\pmb{x}) \) satisfies, \[ \begin{align} \sigma^2_f = \mathbb{E}_{\mathcal{P}(\pmb{x}_L \vert \pmb{y}_{L:1})}\left[f^2(\pmb{x}_L)\right] - \left(\mathbb{E}_{\mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1})}\left[f(\pmb{x}_L)\right]\right)^2 < \infty; \end{align} \]
then the variance of the empirical estimate, \[ \begin{align} \mathrm{var}\left(\mathbb{E}_{\mathcal{P}_{N_e}(\pmb{x}\vert \pmb{y})}\left[f\right]\right) = \frac{\sigma_f^2}{N_e}, \end{align} \] where the variance is understood as taken over the possible sample outcomes, \( \pmb{x}_L^i \sim \mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1}) \).
For \[ \sigma_f^2 < \infty \] as above, the Central Limit Theorem tells us \[ \begin{align} \lim_{N_e\rightarrow +\infty}\sqrt{N_e}\left\{ \mathbb{E}_{\mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1})}\left[f\right] - \mathbb{E}_{\mathcal{P}_N(\pmb{x}_L\vert \pmb{y}_{L:1})}\left[f\right]\right\} =N(0, \sigma_f^2), \end{align} \]
i.e., the empirical distribution converges to the true distribution in the weak sense as the ensemble size \( N_e \) gets sufficiently large.
In particular, under very general conditions, the empirical probability measure will produce statistics that converge to the statistics of the true posterior in expected value.

Importance sampling

In practice, we often cannot sample the posterior directly but we may need to sample some other distribution that shares its support.
This idea of sampling another distribution with shared support is known as importance sampling.
Importance sampling is a broadly used Bayesian technique which can be given its own detailed treatment in a course of Bayesian estimation.

We will only go over a high-level view of this concept as it relates to particle filters and smoothers.

We will suppose that we have access, perhaps not to \( \mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1}) \) but instead \( \mathcal{Q}(\pmb{x}_{L}\vert \pmb{y}_{L:1}) \) such that \( \mathcal{P} \ll \mathcal{Q} \),

i.e., by contraposition, \( \mathcal{P}(A\vert \pmb{y}_{L:1})>0 \Rightarrow \mathcal{Q}(A\vert \pmb{y}_{L:1})>0 \).

This above assumption allows us to take the Radon-Nikodym derivative of the true posterior \( \mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1}) \) with respect to the proposal distribution \( \mathcal{Q}(\pmb{x}_L \vert \pmb{y}_{L:1}) \).
The key innovation to the last formulation is that this allows us to evaluate a statistic of the posterior by point volumes but with non-equal importance weights.
For each of the probabilities \( \mathcal{P} / \mathcal{Q} \) define the associated densities as \( p / q \).
Let us define the importance weight function \( w(\pmb{x}_L \vert \pmb{y}_{L:1}) := \frac{p(\pmb{x}_L\vert \pmb{y}_{L:1})}{q(\pmb{x}_L\vert \pmb{y}_{L:1})} \).

We will suppose that the weights can be computed even if \( \mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1}) \) is not available – this will be explained shortly.

Importance sampling

Given importance weights, we can re-write the expected value of some statistic \( f \) of the posterior density as, \[ \begin{align} \mathbb{E}_{\mathcal{P}(\pmb{x}_L\vert \pmb{y}_{L:1})}[f]= \int f(\pmb{x}_L)p(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L & = \frac{\int f(\pmb{x}_L)p(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L}{\int p(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L}\\ &=\frac{\int f(\pmb{x}_{L})w(\pmb{x}_L\vert \pmb{y}_{L:1})q(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L}{\int w(\pmb{x}_L\vert \pmb{y}_{L:1})q(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L} , \end{align} \]
The above is shown by the definition of the importance weights above and because for any density, \[ \int p(\pmb{x}_L\vert \pmb{y}_{L:1})\mathrm{d}\pmb{x}_L =1. \]
The benefit for sampling techniques is therefore to draw iid \( \pmb{x}^i_L \sim \mathcal{Q}(\pmb{x}_L\vert \pmb{y}_{L:1}) \) which we assume that we can sample.
The empirical measure is thus, \[ \begin{align} \mathcal{Q}_{N_e} := \frac{1}{N_e} \sum_{i=1}^{N_e} \pmb{\delta}_{\pmb{x}^i_{L}} \end{align} \]
Using the importance weights, then one defines the empirical expected value of \( f \) as, \[ \begin{align} \mathbb{E}_{\mathcal{P}_{N_e}(\pmb{x}_L \vert \pmb{y}_{L:1})}[f] = \frac{ \frac{1}{N_e} \sum_{i=1}^{N_e} f(\pmb{x}^i_L)w(\pmb{x}^i_L\vert \pmb{y}_{L:1})}{\frac{1}{N_e} \sum_{i=1}^{N_e} w(\pmb{x}^i_L \vert \pmb{y}_{L:1})} = \sum_{i=1}^{N_e} f(\pmb{x}^i_L) \tilde{w}^i_L. \end{align} \]
Here, the \( \tilde{w}^i_L := \frac{w(\pmb{x}^i_L \vert \pmb{y}_{L:1})}{\sum_{i=1}^{N_e} w(\pmb{x}^i_L \vert \pmb{y}_{L:1})} \) are defined as the normalized importance weights, ensuring integration to one.

Importance sampling

We will write our empirical estimate of the posterior as, \[ \begin{align} \mathcal{P}_{N_e}(\pmb{x}_L \vert \pmb{y}_{L:1}) := \sum_{i=1}^{N_e} \tilde{w}^i_L\delta_{\pmb{x}^i_L}(\mathrm{d}\pmb{x}_L ). \end{align} \]
Notice, if we take \( f(\pmb{x}) = 1 \), the expected value is indeed one.
Using this formulation, we have an extremely flexible view of the posterior as combination of positions and weights.

In a hidden Markov model, positions correspond to initial conditions for, e.g., a nonlinear, stochastic differential equation \( \mathcal{M} + \pmb{w}_k \).
Therefore, we can evolve all the point volumes, keeping their weights, and construct a forward-in-time density.

For particle filters in physical process models, we have a natural choice of how to find the next prior from the last posterior;

we simply evolve the points within the process model, carrying the importance weights with them.

When we condition on new information, the goal then is to find new appropriate weights, and / or resample positions.
Despite the heavy mathematical machinery, this is an elegant algorithm that has a very natural, empirical interpretation.
The next key is in how we update the weights to condition the sample appropriately

Sequential importance sampling

Using Bayes' Law, we can derive up to proportionality, \[ \begin{align} \tilde{w}^i_L \propto \tilde{w}^i_{L-1} \frac{ {\color{#d24693} {p\left(\pmb{y}_L \vert \pmb{x}^i_{L}\right) } } {\color{#1b9e77} {p\left(\pmb{x}^i_L \vert \pmb{y}_{L-1:1}\right)}} }{q\left(\pmb{x}_L^i \vert \pmb{y}_{L-1:1} \right)}. \end{align} \]
The key again, is how we define a proposal density as above that can be sampled even when the posterior cannot be sampled directly.
As one special choice, we can choose a proposal of \( \mathcal{Q} \) as the forecast-prior distribution \( {\color{#1b9e77} {\mathcal{P}(\pmb{x}_{L}\vert \pmb{y}_{L-1:1})} } \);
in this case, \( p=q \) and we have the weights given recursively by \[ \tilde{w}^i_L \propto \tilde{w}^i_{L-1} {\color{#d24693} {p(\pmb{y}_L \vert \pmb{x}_L^i)} }. \]
The proportionality statement says that:

Suppose we have knowledge of the normalized weights \( \tilde{w}^i_{L-1} \) at time \( t_{L-1} \);
we can generate a forecast for each position \( \pmb{x}_{L-1}^i \) at time \( t_L \) via the model, \[ \pmb{x}_L^i = {\color{#1b9e77} {\mathcal{M}_{L}(}} \pmb{x}_{L-1}^i {\color{#1b9e77} {) + \pmb{w}_L} }. \]
Prior to obtaining the new observation, the forecast weight will remain as \( \tilde{w}_{L-1}^i \);
to condition the weights on the new observation, we apply Bayes' Law, \[ \tilde{w}^i_L \propto \tilde{w}^i_{L-1} {\color{#d24693} {p(\pmb{y}_L \vert \pmb{x}_L^i)} }. \] such that we need only compute \( \tilde{w}^i_{L-1} {\color{#d24693} {p(\pmb{y}_L \vert \pmb{x}_L^i)} } \) for each \( i \) and then re-normalize the weights so they sum to \( 1 \).
This technique is known in the stochastic filtering literature as a sequential importance sampling (SIS) particle filter.

Sequential importance sampling

SIS particle filters are extremely flexible and makes few assumptions on the form of the problem whatsoever;
- however, the primary issue is that the importance weights become extremely skewed very quickly, leading to all the probability mass landing on a single point after only a few iterations.
This is one example of a concept more broadly known in nonlinear filtering as ensemble collapse / filter divergence.
In effect, the empirical estimate becomes overly self-certain, and will no longer be receptive to new data.
Because a single point mass cannot represent the spread of the data, this also cannot be used to represent any of the variation in the estimate.
Finding a method for handling the degeneracy of the weights is explicitly the motivation for the bootstrap particle filter, and implicitly one of the motivations for the ensemble Kalman filter.
- In particular, the weights tend to be of too high variance in the particle filter generally, and one rectification is to impose a bias in the estimate.
We will return to the idea of the ensemble Kalman filter later, but for now consider the classical particle filter rectification.

The bootstrap filter

The method of the bootstrap filter essentially proposes to eliminate the degeneracy of the weights by eliminating ensemble members with weights close to zero and resampling.
At the point of the Bayesian update and re-weighting, one:

eliminates all ensemble members with weights \( \tilde{w}^i < W \) where \( W\ll 1 \) will be some threshold for the weights;
then, make replicates of the higher weighted ensemble members and reset the importance weights all equal to \( \frac{1}{N_e} \);
then the new empirical posterior is then given by, \[ \begin{align} \mathcal{P}_{N_e}(\pmb{x}_{L}\vert \pmb{y}_{L:1}) = \frac{1}{N_e} \sum_{i=1}^{N} N^i \delta_{\pmb{x}^i_L}(\mathrm{d}\pmb{x}_L), \end{align} \] where \( N^i \) is the number of replicates \( \left(N^i\in[0, N_e]\right) \) of sample \( \pmb{x}^i_L \) such that \( \sum_{i=1}^{N} N^i =N_e \)
If the first prior sample is drawn iid, the first weights \( w_0^i \equiv \frac{1}{N_e} \); this gives a complete algorithm for a general hidden Markov model.

Note, how the number of replicates \( N^i \) is chosen is the basis of several different approaches to particle filters.

Basic resampling algorithm

The basic resampling algorithm simply utilizes the strategy of the inverse CDF transformation of a uniform sample.
In particular, the weights \( \tilde{w}^i_k \) at any given time provide an estimate of the empirical CDF for the posterior.
We draw a uniform \( u \) on \( [0,1] \) and find the first index \( i \) for which the total sum of all the weights below index \( i \) fall below the realized value \( u \).
- We select this index and create a particle replicate of the associated \( \pmb{x}_k^i \).
This is repeated until we have \( N_e \) total ensemble members once again, all given equal weights to restart the algorithm.
Notice that assigning the map of \( u \) to the associated particle \( \pmb{x}^{i}_k \) is precisely the inverse empirical CDF map.
In particular, under generic convergence conditions, and the limit in the sample size \( N_e \rightarrow \infty \),
- the ensemble will generate statistics that follow the appropriate theoretical CDF in expectation, giving weak convergence as before.

Systematic resampling algorithm

More commonly, the standard technique for the bootstrap particle filter is the systematic resampling algorithm.
- Rather than using multiple draws of the uniform, this uses a single draw and make a stratified sample of the empirical CDF.
Firstly, we draw \( u^1 \) uniform on \( [0, N_e^{-1} \), i.e., on the restricted range up to one over the ensemble size.
- Then, we make a replicate of the first ensemble member for which the cumulative weight is greater than \( u^1 \).
The first draw creates exactly one “representative” replicate corresponding to all particles with combined weight in the range \( [0,N_e^{-1}] \).
From this point, a new \( u^j \) is defined as \( u^j= u^1 + N_e^{-1}(j-1) \), where the same replication strategy follows:
- the first ensemble member for which the cumulative weight is greater than \( u_j \) is selected as \( j \) ranges up to \( N_e \).
With this strategy, we are guaranteed to draw exactly one “representative” replicate among particles for which the empirical CDF, \( c^i \) falls in the range \( [(j-1)/N_e, j/N_e] \).
Particularly, \( u^j \) is uniform on \( [(j-1)/N, j/N] \), and this decides which particular particle will be replicated in this weight interval.