Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Particle filters as seen in the last lecture provide an extremely flexible Bayesian filtering scheme.
Similar schemes can be constructed to sample the joint posterior for the model state in time, given the time series of observations,
\[ \begin{align} p(\pmb{x}_{L:0}|\pmb{y}_{L:1}) \end{align} \] leading to what are known as particle smoothing methods.
Metropolis-Hastings is key classical scheme provides a basis for many methods of particle smoothing.
This is a general scheme that is designed to sample and arbitrary distribution, when proportional distribution can be sampled in place of the target.
This follows the same idea as importance sampling seen in particle filtering, where we will use the proportional density to draw values;
Together, provided we can sample and evaluate the appropriate densities in proportionality, this gives a scheme that can sample an arbitrary posterior.
In order to introduce this algorithm, we will need to start with a few more general statistical techniques.
A general statistical technique that will be useful for our later development is known as the acceptance-rejection method of sampling.
Suppose that \( X \) and \( Y \) are random variables with density \( f \) and \( g \) respectively,
\[ \begin{align} \frac{f(l)}{g(l)} \leq c \end{align} \] for all \( l \) such that \( f(l) > 0 \).
\[ \begin{align} \frac{f(l)}{cg(l)} \leq 1 \end{align} \]
Then the acceptance-rejection method can be applied to generate the random variable \( X \), provided we have a means to generate \( Y \).
To simplify the analysis, let's suppose that \( X \) and \( Y \) are actually discrete random variables and \( f/g \) are mass functions.
The steps of this acceptance-rejection sampling are performed as:
Notice, if we follow the above steps, we recover that
\[ \begin{align} \mathcal{P}(\text{accept} | Y=y) = \mathcal{P}\left( U < \frac{f(Y)}{cg(Y)}|Y=y\right) = \frac{f(y)}{cg(y)} \end{align} \] where the last equality is simply evaluating the cdf of \( U \).
The total probability of acceptance for any iteration is therefore given as the sum over all possible-to-observe \( y \) as
\[ \begin{align} \sum_{y} \mathcal{P}\left(\text{accept}|y\right) \mathcal{P}(Y=y) = \sum_{y}\frac{f(y)}{cg(y)}g(y) = \frac{1}{c} \end{align} \]
From the last slide, it is easy to see that the number of iterations until acceptance has the geometric distribution with mean \( c \).
\[ \begin{align} p(k) = (1-r)^{k-1} r. \end{align} \]
In expected value, each sampled value of \( X \) requires \( c \) iterations of the algorithm;
To see that the accepted sample has the same distribution as \( X \), we apply Bayes’ law for each possible-to-observe \( k \)
\[ \begin{align} \mathcal{P}(k|\text{accepted}) = \frac{\mathcal{P}(\text{accepted}|k)g(k)}{\mathcal{P}(\text{accepted})} = \frac{\frac{f(k)}{cg(k)}g(k)}{\frac{1}{c}} = f(k). \end{align} \]
Therefore, this shows in the discrete case how the acceptance-rejection scheme can sample the proper distribution by proportionality.
The acceptance-rejection method just discussed will play an important role in the Metropolis-Hastings algorithm,
Consider a generic Markovian transition probability, i.e.,
\[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \mathrm{d}\pmb{x}| \pmb{x}_{k-1:0}) = \mathcal{P}(\pmb{x}_k\in \mathrm{d}\pmb{x} | \pmb{x}_{k-1}). \end{align} \]
Furthermore, we will impose that it is possible that a state will make a transition back to itself, i.e.,
\[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \{\pmb{x}_{k-1}\}| \pmb{x}_{k-1}) \neq 0. \end{align} \]
A key study of Markov chains is to determine if there exists, and the conditions that permit the existence, of an invariant distribution / measure.
In general, we say that a measure \( \mathcal{P} \) is invariant with respect to some transformation \( T \) whenever,
\[ \begin{align} \mathcal{P}\left(T^{-1}(\mathrm{d}\pmb{x})\right) = \mathcal{P}(\mathrm{d}\pmb{x}). \end{align} \]
In particular, for a Markov chain it is of interest when the iterations of the transition probabilities converge to an invariant measure.
Let us denote an invariant measure for the Markov process as \( \pmb{\pi}^\ast \);
\[ \begin{align} \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \int \mathcal{P}(\pmb{x}_k\in\mathrm{d}\pmb{y}|\pmb{x}_{k-1})\pmb{\pi}(\pmb{x}_{k-1}) \mathrm{d}\pmb{x}_{k-1} \end{align} \] where \( \pmb{\pi} \) is the density with respect to the Lebesgue measure, i.e., \( \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \pmb{\pi}(y)\mathrm{d}\pmb{y} \).
To simplify some analysis, we will use slightly different notation to denote the \( k \)-th iteration of the chain as,
\[ \begin{align} \mathcal{P}^k(\mathbf{A}|\pmb{x}) :=\int\mathcal{P}^{k-1}( \mathrm{d}\pmb{y}| \pmb{x}) \mathcal{P}(\mathbf{A}|\pmb{y}) \end{align} \] where:
Therefore, in the above, we see this as the probability (measure) of \( \mathbf{A}\in\mathbb{R}^{N_x} \) given \( \pmb{x} \) as we average over a mix over all possible initial conditions recursively.
Recall our \( k \)-th iterate from the last slide \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \);
\[ \begin{align} \lim_{k\rightarrow \infty} \mathcal{P}^k = \pmb{\pi}^\ast. \end{align} \]
Instead of considering the classical Markov chain motivation, where we wish to determine the convergence to an unknown invariant distribution,
We will suppose that the invariant density \( \pmb{\pi} \) that we wish to sample is known up to proportionality;
To generate a sample from \( \pmb{\pi} \), MCMC methods derive and utilize a transition kernel \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \) whose \( k \)-th iterate converges to \( \pmb{\pi}^\ast \) for large \( k \).
The process is started at an arbitrary \( \pmb{x} \) and iterated a large number of times.
The problem then is to find an appropriate \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \).
We make an ansatz for the transition kernel given as follows for some function \( p(x,y) \)
\[ \begin{align} \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \end{align} \] where
Requiring that \( r(\pmb{x})\neq 0 \) means that the integral of \( p(\pmb{x},\pmb{y}) \) with respect to \( \mathrm{d}\pmb{y} \) may not be equal to one.
Assume that \( p(\pmb{x},\pmb{y}) \) has the following reversability condition,
\[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]
Therefore, this can be used as a criterion for constructing a sampling algorithm with the desired convergence property.
We will demonstrate this as follows…
Recall our ansatz \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \), such that
\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int \left[ \int_{\mathbf{A}} p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} \right]\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} + \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]
Suppose then that we can exchange the order of integration of \( \mathrm{d}\pmb{y} \) and \( \mathrm{d}\pmb{x} \).
\[ \begin{align} \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]
Using both of the above, we recover
\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{x},\pmb{y}) \pmb{\pi}(\pmb{x}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]
Using the reversability, we have
\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]
Recall the last equation,
\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]
Noting that \( r \) is defined as
\[ \begin{align} & r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \\ \Leftrightarrow & \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} = 1 - r(\pmb{x}), \end{align} \]
we recover that
\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} &= \int_{\mathbf{A}} \left[1 - r(\pmb{y}) \right]\pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}\\ &=\int_{\mathbf{A}} \pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y}, \end{align} \] so that \( \pmb{\pi} \) is invariant with respect to the transition kernel.
Let's recall the reversability condition,
\[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]
Intuitively:
The reversability condition says that the two sides are equal;
This is a sketch of the result that gives the sufficiency property needed to construct a chain that eventually samples the invariant distribution.
Again, the goal is to construct the transition kernel to eventually sample \( \pmb{\pi}^\ast \).
In the next lecture, we will discuss how the Metropolis-Hastings algorithm uses the acceptance rejection technique to construct a sufficient \( p(\pmb{x},\pmb{y}) \) to sample the target, invariant distribution.