Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- Metropolis-Hastings and acceptance rejection
- The Metropolis-Hastings chain
- The Metropolis-Hastings candidate generating density
- The Metropolis-Hastings algorithm
- Metropolis-Hastings as data assimilation

In the

**first part**, we discussed the development of each of:- the
**acceptance rejection method for sampling**; and - the
**Markov chain Monte Carlo (MCMC) method for sampling an invariant distribution**.

- the
In this section, we will now discuss how these pieces are

**put together into the Metropolis-Hastings algorithm**.Likewise, we will discuss how this can be used as a

**data assimilation technique**to**sample the posterior**.Particularly, the

**invariant distribution**we would like to sample with this scheme is precisely\[ \begin{align} \pmb{\pi}(\pmb{x}_{L:0}) = p(\pmb{x}_{L:0}|\pmb{y}_{L:1}) \end{align} \]

**for an arbitrary time series of model and observation states**.Once we have given a form of the general Metropolis-Hastings algorithm, the application to this specific case will be illustrated.

As in the

**acceptance-rejection method**,**suppose we have a density that can generate candidates**.Since we are dealing with Markov chains, however, we permit that density to depend on the current state of the process.

Accordingly, the

**candidate-generating density**is denoted \( q(\pmb{x}, \pmb{y}) \), where\[ \begin{align} \int q(\pmb{x},\pmb{y}) \mathrm{d} \pmb{y} = 1. \end{align} \]

This density is to be interpreted as,

**when a process is at the point**\( \pmb{x} \),- the
**density generates a value**\( \pmb{y} \) from \( q(\pmb{x}, \pmb{y}) \).

- the
If it happens that \( q(\pmb{x}, \pmb{y}) \) itself satisfies the

**reversibility condition**\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}) \end{align} \] for all \( \pmb{x}, \pmb{y} \),

**our search for the Markov chain is over**.However,

**the reversibility condition is not generically satisfied**;- rather, we might find for some \( \pmb{x},\pmb{y} \) that

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) > \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}). \end{align} \]

In this case, somewhat loosely, the

**process moves from \( \pmb{x} \) to \( \pmb{y} \) too often**and**from \( \pmb{y} \) to \( \pmb{x} \) too infrequently**.

One way to correct this is to reduce the number of moves from \( \pmb{x} \) to \( \pmb{y} \) by

**introducing a probability**$$\begin{align} \alpha(\pmb{x}, \pmb{y}) < 1 \end{align}$$

where we refer to \( \alpha(\pmb{x},\pmb{y}) \) as the

**probability of move**.**If the move is not made**, the process again**returns \( \pmb{x} \) as a value from the target distribution**.This is in contrast with the

**acceptance rejection method**, where if a \( \pmb{y} \) is rejected, a new pair \( (\pmb{y}, u) \) is**drawn independently of the previous value of \( \pmb{y} \)**.Thus transitions from \( \pmb{x} \) to \( \pmb{y} \), where \( \pmb{y} \neq \pmb{x} \), are made according to

**Metropolis-Hastings**by\[ \begin{align} p_{\mathrm{MH}}(\pmb{x},\pmb{y}):= q(\pmb{x},\pmb{y}) \alpha(\pmb{x},\pmb{y}) \end{align} \] when \( \pmb{x}\neq \pmb{y} \) for some \( \alpha \)

**yet-to-be-determined**.

Consider again the inequality,

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y}) > \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x}). \end{align} \]

This tells us that a move from \( \pmb{y} \) to \( \pmb{x} \) is

**not made often enough**and we should therefore define \( \pmb{\alpha}(\pmb{y}, \pmb{x}) \) to be**as large as possible**.- since this is a probability, the upper limit is \( \alpha(\pmb{y},\pmb{x})=1 \).

The

**probability of move**\( \alpha(\pmb{x}, \pmb{y}) \) is determined by requiring that\[ \begin{align} p_{\mathrm{MH}}(\pmb{x}, \pmb{y}) \pmb{\pi}(\pmb{x}) = p_{\mathrm{MH}}(\pmb{y}, \pmb{x}) \pmb{\pi}(\pmb{y}) \end{align} \] i.e., such that it

**satisfies the reversibility condition**.Notice, by substitution, we recover

\[ \begin{align} \pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})\alpha(\pmb{y},\pmb{x}), \end{align} \]

- but we set \( \alpha(\pmb{y},\pmb{x})=1 \) as discussed above.

Therefore, we

**define the appropriate probability of move**\( \alpha(\pmb{x},\pmb{y}) \) as\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \frac{\pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})} \end{align} \]

If the above inequality is reversed, we may simply reverse the argument.

The construction of the probabilities of move as before,

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \frac{\pmb{\pi}(\pmb{y}) q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x}) q(\pmb{x},\pmb{y})} & & \alpha(\pmb{y},\pmb{x}) := 1 \end{align} \] are

**defined to ensure that**\( p_{\mathrm{MH}} \)**is reversible with respect to the invariant distribution**.This again will ensure that the Metropolis-Hastings chain will converge to the appropriate invariant distribution \( \pmb{\pi}^\ast \) given sufficiently many iterates of the process.

Thus, the criterion to ensure reversability becomes

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) := \begin{cases} \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})q(\pmb{y},\pmb{x})}{\pmb{\pi}(\pmb{x})q(\pmb{x},\pmb{y})}, 1 \right] & \text{if}\quad\pmb{\pi}(\pmb{x})q(\pmb{x},\pmb{y}) > 0\\ 1 & \text{else } \end{cases} \end{align} \]

To complete the definition of the transition kernel for the Metropolis-Hastings chain,

- we must
**consider the possible nonzero probability**that**the process remains at \( \pmb{x} \)**.

- we must
As defined previously, we wrote

\[ \begin{align} r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \end{align} \]

Consequently, the transition kernel of the

**Metropolis-Hastings chain**is given as\[ \begin{align} \mathcal{P}_{\mathrm{MH}}(\mathrm{d}\pmb{y}| \pmb{x}):= q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + \left[1 - \int q(\pmb{x},\pmb{y})\alpha(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} \right] \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \end{align} \] by

**applying our construction to the previous ansatz**.By constructing the chain in this way, we guarantee that this converges to the invariant distribution \( \pmb{\pi}^\ast \) after sufficiently many iterates of the chain.

- We should discuss a few remarks about the last construction:
- The
**Metropolis-Hastings algorithm**is**specified by its candidate-generating density**\( q(\pmb{x},\pmb{y}) \) which we have yet to discuss how this is selected. **If a candidate value is rejected**,**the current value is taken as the next item in the sequence**.- The
**calculation**of \( \alpha(\pmb{x}, \pmb{y}) \)**does not actually require knowledge of the invariant density**\( \pmb{\pi} \) precisely, only up to proportionality. - Indeed, if \( \pmb{\pi} \) is
**known to proportionality**, we can evaluate \[ \begin{align} \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} \end{align} \] without exact knowledge,**canceling the normalizing constants in ratio**. - If
**generating density is symmetric**, i.e., \( q(\pmb{x},\pmb{y})=q(\pmb{y},\pmb{x}) \), the**probability of move reduces to the above ratio alone**. - Therefore, if \( \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} \geq 1 \) then the chain moves to \( \pmb{y} \);
- otherwise, it moves with probability of \( \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} < 1 \) precisely.

- The

- From the last statement, we had \( \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} \geq 1 \) implies the chain moves to \( \pmb{y} \);or
- for \( \frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})} < 1 \) the chain moves to \( \pmb{y} \) with precisely this probability.
- Intuitively, this says that
**if the jump goes uphill**, the**selection is automatic**, but**if the jump is downhill**, this**selection occurs with non-zero probability**.

Courtesy of Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49(4), 327-335.

- For the figure to the right, this says that the jump from \( \pmb{x} \) to \( \pmb{y}_1 \) is automatic.
- However, the jump from \( \pmb{x} \) to \( \pmb{y}_2 \) is made with the probability \( \frac{\pmb{\pi}(\pmb{y}_2)}{\pmb{\pi}(\pmb{x})} \).
- This is essentially the original algorithm proposed by Metropolis et al., and forms the basis of other optimization techniques such as simulated annealing.

- The draws are regarded as a sample from the target density \( \pmb{\pi}(\pmb{x}) \) only after the chain has passed a transient stage.
- This allows the
**initial condition**generating the process**to be ignored in terms of its impact on the subsequent statistics**.

- Provided we have constructed the Metropolis-Hastings chain according to the above, this is
**guaranteed to converge to the invariant measure**\( \pmb{\pi}^\ast \) under**very general regularity conditions**: **Irreducibility**– if \( \pmb{x} \) and \( \pmb{y} \) are in the domain of \( \pmb{\pi} \), it must be possible to move from \( \pmb{x} \) to \( \pmb{y} \) in a finite number of moves.**Aperiodicity**– at any given time, the return time from a state back to itself isn’t given by a fixed integer.- Together, these give a mixing property of the system that is similar to ergodicity.
- These conditions are usually satisfied if \( q(\pmb{x}, \pmb{y}) \) has a positive density on the same support as that of \( \pmb{\pi} \).
- However, these do not guarantee the rate of convergence, and various diagnostics are used in practice to determine if the chain has reached the invariant distribution.
- The
**Metropolis-Hastings algorithm**is given by the following steps, initializing an arbitrary \( \pmb{x}^0 \) and a burn-in of \( N_{\mathrm{burn}}< N \):**Repeat for**\( j=1,2,\cdots,N \):- Generate \( \pmb{y} \) from \( q(\pmb{x}^j, \cdot ) \) and \( u \sim \mathcal{U}(0,1) \).
**If**\( u < \alpha(\pmb{x}^j, \pmb{y}) \):- Set \( \pmb{x}^{j+1} = \pmb{y} \)
**Else**:- Set \( \pmb{x}^{j+1} = \pmb{x}^j \)
**Return**\( \{\pmb{x}^{N_{\mathrm{burn}}}, \cdots, \pmb{x}^N\} \).

- The final aspect then is simply how to specify \( q(\pmb{x},\pmb{y}) \) and how this can be used for data assimilation.

One simple choice for the \( q(\pmb{x},\pmb{y}) \) that is symmetric in \( \pmb{x},\pmb{y} \) is as follows.

Suppose that \( \phi \) is the

**multivariate Gaussian density**with mean zero and some selected covariance.We take \( q(\pmb{x},\pmb{y}) = \phi(\pmb{y} - \pmb{x}) \);

- the candidate \( \pmb{y} \) is thus drawn as \( \pmb{y} = \pmb{x} + \pmb{z} \), where \( \pmb{z} \) is called the increment.

Particularly, we take \( \pmb{z}\sim \phi \) so that this

**simply becomes a random walk with Gaussian noise, used to explore the state space**.Therefore, with the symmetric choice above, the probability of move is given by

\[ \begin{align} \alpha(\pmb{x},\pmb{y}) = \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})},1 \right]. \end{align} \]

This

**fully specifies the Metropolis-Hastings algorithm**if we have knowledge of \( \pmb{\pi} \) up to proportionality.The last step is now to identify how this is related to the Bayesian smoothing posteior…

- In particular, recall that
**recursively applying the Markov assumption and independence assumptions**, we can write \[ \begin{align} p(\pmb{x}_{L:0} \vert \pmb{y}_{L:1})& \propto \left[ \prod_{k=1}^L p(\pmb{y}_k \vert \pmb{x}_k ) \right]\left[\prod_{k=1}^{L} p(\pmb{x}_k \vert \pmb{x}_{k-1})\right]p\left(\pmb{x}_0\right) \end{align} \] - where in the above the
**joint posterior is proportional**to - the
**product of the likelihoods of the time series data**; with - the
**product of the (model state) transition probabilities**; with - the
**prior for the initial condition**. - This was used to frame the traditional 4D optimization cost function for perfect models.
- However, this above result
**only requires the general hidden Markov model framework to derive the proportionality**, \[ \begin{align} \pmb{x}_k &= \mathcal{M}_k (\pmb{x}_{k-1}) + \pmb{w}_k \\ \pmb{y}_k &= \mathcal{H}_k (\pmb{x}_k) + \pmb{v}_k \end{align} \] with**arbitrary error distributions**. - Therefore, we evaluate \( \pmb{\pi} \) up to proportionality using the right-hand-side, by
**sampling the prior**,**evaluating the joint likelihood of the data****given the simulation through the hidden Markov model**, and thus compute \[ \begin{align} \alpha(\pmb{x},\pmb{y}) = \mathrm{min}\left[\frac{\pmb{\pi}(\pmb{y})}{\pmb{\pi}(\pmb{x})},1 \right]. \end{align} \] - Given enough simulation time, we can
**sample an arbitrary joint posterior**,**giving an empirical representation**. - This also obviously extends to joint state-parameter estimation with the extended state formalism.