Metropolis-Hastings Part II

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Metropolis-Hastings and acceptance rejection
- The Metropolis-Hastings chain
- The Metropolis-Hastings candidate generating density
- The Metropolis-Hastings algorithm
- Metropolis-Hastings as data assimilation

Motivation

In the first part, we discussed the development of each of:
1. the acceptance rejection method for sampling; and
2. the Markov chain Monte Carlo (MCMC) method for sampling an invariant distribution.
In this section, we will now discuss how these pieces are put together into the Metropolis-Hastings algorithm.
Likewise, we will discuss how this can be used as a data assimilation technique to sample the posterior.
Particularly, the invariant distribution we would like to sample with this scheme is precisely

\[ \begin{align} \pmb{\pi}(\pmb{x}_{L:0}) = p(\pmb{x}_{L:0}|\pmb{y}_{L:1}) \end{align} \] for an arbitrary time series of model and observation states.
Once we have given a form of the general Metropolis-Hastings algorithm, the application to this specific case will be illustrated.

Metropolis-Hastings and acceptance rejection

As in the acceptance-rejection method, suppose we have a density that can generate candidates.
Since we are dealing with Markov chains, however, we permit that density to depend on the current state of the process.
Accordingly, the candidate-generating density is denoted $ q(\pmb{x}, \pmb{y}) $, where

\[ \begin{align} \int q(\pmb{x},\pmb{y}) \mathrm{d} \pmb{y} = 1. \end{align} \]
This density is to be interpreted as, when a process is at the point $ \pmb{x} $,
- the density generates a value $ \pmb{y} $ from $ q(\pmb{x}, \pmb{y}) $.
If it happens that $ q(\pmb{x}, \pmb{y}) $ itself satisfies the reversibility condition

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}) \end{align} \] for all $ \pmb{x}, \pmb{y} $, our search for the Markov chain is over.
However, the reversibility condition is not generically satisfied;
- rather, we might find for some $ \pmb{x},\pmb{y} $ that
\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) > \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}). \end{align} \]
In this case, somewhat loosely, the process moves from $ \pmb{x} $ to $ \pmb{y} $ too often and from $ \pmb{y} $ to $ \pmb{x} $ too infrequently.

Metropolis-Hastings and acceptance rejection

One way to correct this is to reduce the number of moves from $ \pmb{x} $ to $ \pmb{y} $ by introducing a probability

$$\begin{align} \alpha(\pmb{x}, \pmb{y}) < 1 \end{align}$$
where we refer to $ \alpha(\pmb{x},\pmb{y}) $ as the probability of move.
If the move is not made, the process again returns $ \pmb{x} $ as a value from the target distribution.
This is in contrast with the acceptance rejection method, where if a $ \pmb{y} $ is rejected, a new pair $ (\pmb{y}, u) $ is drawn independently of the previous value of $ \pmb{y} $.
Thus transitions from $ \pmb{x} $ to $ \pmb{y} $, where $ \pmb{y} \neq \pmb{x} $, are made according to Metropolis-Hastings by

\[ \begin{align} p_{\mathrm{MH}}(\pmb{x},\pmb{y}):= q(\pmb{x},\pmb{y}) \alpha(\pmb{x},\pmb{y}) \end{align} \] when $ \pmb{x}\neq \pmb{y} $ for some $ \alpha $ yet-to-be-determined.

Metropolis-Hastings and acceptance rejection

Consider again the inequality,

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) > \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}). \end{align} \]
This tells us that a move from $ \pmb{y} $ to $ \pmb{x} $ is not made often enough and we should therefore define $ \pmb{\alpha}(\pmb{y}, \pmb{x}) $ to be as large as possible.
- since this is a probability, the upper limit is $ \alpha(\pmb{y},\pmb{x})=1 $.
The probability of move $ \alpha(\pmb{x}, \pmb{y}) $ is determined by requiring that

\[ \begin{align} p_{\mathrm{MH}}(\pmb{x}, \pmb{y}) \pmb{\pi}(\pmb{x}) = p_{\mathrm{MH}}(\pmb{y}, \pmb{x}) \pmb{\pi}(\pmb{y}) \end{align} \] i.e., such that it satisfies the reversibility condition.
Notice, by substitution, we recover

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})\alpha(\pmb{y},\pmb{x}), \end{align} \]
- but we set $ \alpha(\pmb{y},\pmb{x})=1 $ as discussed above.
Therefore, we define the appropriate probability of move $ \alpha(\pmb{x},\pmb{y}) $ as

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \frac{\pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})} \end{align} \]
If the above inequality is reversed, we may simply reverse the argument.

The Metropolis-Hastings chain

The construction of the probabilities of move as before,

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \frac{\pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})} & & \alpha(\pmb{y},\pmb{x}) := 1 \end{align} \] are defined to ensure that $ p_{\mathrm{MH}} $ is reversible with respect to the invariant distribution.
This again will ensure that the Metropolis-Hastings chain will converge to the appropriate invariant distribution $ \pmb{\pi}^\ast $ given sufficiently many iterates of the process.
Thus, the criterion to ensure reversability becomes

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \begin{cases} \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x})q(\pmb{x},\pmb{y})}, 1 \right] & \text{if}\quad\pmb{\pi}(\pmb{x})q(\pmb{x},\pmb{y}) > 0\\ 1 & \text{else } \end{cases} \end{align} \]

The Metropolis-Hastings chain

To complete the definition of the transition kernel for the Metropolis-Hastings chain,
- we must consider the possible nonzero probability that the process remains at $ \pmb{x} $.
As defined previously, we wrote

\[ \begin{align} r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \end{align} \]
Consequently, the transition kernel of the Metropolis-Hastings chain is given as

\[ \begin{align} \mathcal{P}_{\mathrm{MH}}(\mathrm{d}\pmb{y}| \pmb{x}):= q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + \left[1 - \int q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} \right] \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \end{align} \] by applying our construction to the previous ansatz.
By constructing the chain in this way, we guarantee that this converges to the invariant distribution $ \pmb{\pi}^\ast $ after sufficiently many iterates of the chain.

The Metropolis-Hastings candidate generating density

We should discuss a few remarks about the last construction:
1. The Metropolis-Hastings algorithm is specified by its candidate-generating density $ q(\pmb{x},\pmb{y}) $ which we have yet to discuss how this is selected.
2. If a candidate value is rejected, the current value is taken as the next item in the sequence.
3. The calculation of $ \alpha(\pmb{x}, \pmb{y}) $ does not actually require knowledge of the invariant density $ \pmb{\pi} $ precisely, only up to proportionality.
4. If generating density is symmetric, i.e., $ q(\pmb{x},\pmb{y})=q(\pmb{y},\pmb{x}) $, the probability of move reduces to the above ratio alone.

The Metropolis-Hastings candidate generating density

From the last statement, we had $ \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} \geq 1 $ implies the chain moves to $ \pmb{y} $;or
for $ \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} < 1 $ the chain moves to $ \pmb{y} $ with precisely this probability.
Intuitively, this says that if the jump goes uphill, the selection is automatic, but if the jump is downhill, this selection occurs with non-zero probability.

Courtesy of Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49(4), 327-335.

For the figure to the right, this says that the jump from $ \pmb{x} $ to $ \pmb{y}_1 $ is automatic.
However, the jump from $ \pmb{x} $ to $ \pmb{y}_2 $ is made with the probability $ \frac{\pmb{\pi}(\pmb{y}_2)}{\pmb{\pi}(\pmb{x})} $.
This is essentially the original algorithm proposed by Metropolis et al., and forms the basis of other optimization techniques such as simulated annealing.

The draws are regarded as a sample from the target density $ \pmb{\pi}(\pmb{x}) $ only after the chain has passed a transient stage.

This allows the initial condition generating the process to be ignored in terms of its impact on the subsequent statistics.

The Metropolis-Hastings algorithm

Provided we have constructed the Metropolis-Hastings chain according to the above, this is guaranteed to converge to the invariant measure $ \pmb{\pi}^\ast $ under very general regularity conditions:

Irreducibility – if $ \pmb{x} $ and $ \pmb{y} $ are in the domain of $ \pmb{\pi} $, it must be possible to move from $ \pmb{x} $ to $ \pmb{y} $ in a finite number of moves.
Aperiodicity – at any given time, the return time from a state back to itself isn’t given by a fixed integer.

Together, these give a mixing property of the system that is similar to ergodicity.

These conditions are usually satisfied if $ q(\pmb{x}, \pmb{y}) $ has a positive density on the same support as that of $ \pmb{\pi} $.
However, these do not guarantee the rate of convergence, and various diagnostics are used in practice to determine if the chain has reached the invariant distribution.
The Metropolis-Hastings algorithm is given by the following steps, initializing an arbitrary $ \pmb{x}^0 $ and a burn-in of $ N_{\mathrm{burn}}< N $:
- Repeat for $ j=1,2,\cdots,N $:
- Return $ \{\pmb{x}^{N_{\mathrm{burn}}}, \cdots, \pmb{x}^N\} $.
The final aspect then is simply how to specify $ q(\pmb{x},\pmb{y}) $ and how this can be used for data assimilation.

The Metropolis-Hastings algorithm

One simple choice for the $ q(\pmb{x},\pmb{y}) $ that is symmetric in $ \pmb{x},\pmb{y} $ is as follows.
Suppose that $ \phi $ is the multivariate Gaussian density with mean zero and some selected covariance.
We take $ q(\pmb{x},\pmb{y}) = \phi(\pmb{y} - \pmb{x}) $;
- the candidate $ \pmb{y} $ is thus drawn as $ \pmb{y} = \pmb{x} + \pmb{z} $, where $ \pmb{z} $ is called the increment.
Particularly, we take $ \pmb{z}\sim \phi $ so that this simply becomes a random walk with Gaussian noise, used to explore the state space.
Therefore, with the symmetric choice above, the probability of move is given by

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) = \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})},1 \right]. \end{align} \]
This fully specifies the Metropolis-Hastings algorithm if we have knowledge of $ \pmb{\pi} $ up to proportionality.
The last step is now to identify how this is related to the Bayesian smoothing posteior…

Metropolis-Hastings as data assimilation

In particular, recall that recursively applying the Markov assumption and independence assumptions, we can write \[ \begin{align} p(\pmb{x}_{L:0} \vert \pmb{y}_{L:1})& \propto \left[ \prod_{k=1}^L p(\pmb{y}_k \vert \pmb{x}_k ) \right]\left[\prod_{k=1}^{L} p(\pmb{x}_k \vert \pmb{x}_{k-1})\right]p\left(\pmb{x}_0\right) \end{align} \]
where in the above the joint posterior is proportional to

the product of the likelihoods of the time series data; with
the product of the (model state) transition probabilities; with
the prior for the initial condition.

This was used to frame the traditional 4D optimization cost function for perfect models.
However, this above result only requires the general hidden Markov model framework to derive the proportionality, \[ \begin{align} \pmb{x}_k &= \mathcal{M}_k (\pmb{x}_{k-1}) + \pmb{w}_k \\ \pmb{y}_k &= \mathcal{H}_k (\pmb{x}_k) + \pmb{v}_k \end{align} \] with arbitrary error distributions.
Therefore, we evaluate $ \pmb{\pi} $ up to proportionality using the right-hand-side, by sampling the prior, evaluating the joint likelihood of the data given the simulation through the hidden Markov model, and thus compute \[ \begin{align} \alpha(\pmb{x},\pmb{y}) = \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})},1 \right]. \end{align} \]
Given enough simulation time, we can sample an arbitrary joint posterior, giving an empirical representation.
This also obviously extends to joint state-parameter estimation with the extended state formalism.