Metropolis-Hastings Part I

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Acceptance rejection sampling
    • Markov Chain Monte Carlo methods

Motivation

  • Particle filters as seen in the last lecture provide an extremely flexible Bayesian filtering scheme.

  • Similar schemes can be constructed to sample the joint posterior for the model state in time, given the time series of observations,

    \[ \begin{align} p(\pmb{x}_{L:0}|\pmb{y}_{L:1}) \end{align} \] leading to what are known as particle smoothing methods.

  • Metropolis-Hastings is key classical scheme provides a basis for many methods of particle smoothing.

  • This is a general scheme that is designed to sample and arbitrary distribution, when proportional distribution can be sampled in place of the target.

  • This follows the same idea as importance sampling seen in particle filtering, where we will use the proportional density to draw values;

    • subsequently, we will use an evaluation of a proportionality statement to determine the likelihood of drawing such a value.
  • Together, provided we can sample and evaluate the appropriate densities in proportionality, this gives a scheme that can sample an arbitrary posterior.

  • In order to introduce this algorithm, we will need to start with a few more general statistical techniques.

Acceptance-rejection sampling

  • A general statistical technique that will be useful for our later development is known as the acceptance-rejection method of sampling.

  • Suppose that \( X \) and \( Y \) are random variables with density \( f \) and \( g \) respectively,

    • furthermore, suppose that there exists a constant \( c \) such that

    \[ \begin{align} \frac{f(l)}{g(l)} \leq c \end{align} \] for all \( l \) such that \( f(l) > 0 \).

    • Notice that this implies that

    \[ \begin{align} \frac{f(l)}{cg(l)} \leq 1 \end{align} \]

  • Then the acceptance-rejection method can be applied to generate the random variable \( X \), provided we have a means to generate \( Y \).

  • To simplify the analysis, let's suppose that \( X \) and \( Y \) are actually discrete random variables and \( f/g \) are mass functions.

    • The results are similar for continuous random variables, but require more technical tools to describe the results (such as conditional expectations).

Acceptance-rejection sampling

  • The steps of this acceptance-rejection sampling are performed as:

    • Generate a random \( y\sim g \).
    • Generate a random \( u\sim \mathcal{U}(0, 1) \).
    • If \( u < \frac{f (y)}{cg(y)} \)
      • accept \( y \) and return \( x = y \).

    • else
      • reject \( y \) and repeat the above steps again.

  • Notice, if we follow the above steps, we recover that

    \[ \begin{align} \mathcal{P}(\text{accept} | Y=y) = \mathcal{P}\left( U < \frac{f(Y)}{cg(Y)}|Y=y\right) = \frac{f(y)}{cg(y)} \end{align} \] where the last equality is simply evaluating the cdf of \( U \).

  • The total probability of acceptance for any iteration is therefore given as the sum over all possible-to-observe \( y \) as

    \[ \begin{align} \sum_{y} \mathcal{P}\left(\text{accept}|y\right) \mathcal{P}(Y=y) = \sum_{y}\frac{f(y)}{cg(y)}g(y) = \frac{1}{c} \end{align} \]

Acceptance-rejection sampling

  • From the last slide, it is easy to see that the number of iterations until acceptance has the geometric distribution with mean \( c \).

    • The geometric distribution gives the probability that a success requires \( k \) independent trials, with each independent trial's success having probability \( r \).
    • The probability that the \( k \)-th trial (out of \( k \) repeated trials) is the first success is given as

    \[ \begin{align} p(k) = (1-r)^{k-1} r. \end{align} \]

    • The geometric distribution in the above has mean \( \frac{1}{r} \), so that viewing the probability of acceptance as \( r=c \) gives the distribution.
  • In expected value, each sampled value of \( X \) requires \( c \) iterations of the algorithm;

    • for efficiency, one needs to choose \( Y \) as a RV that is easy to simulate and such that \( c \) small.
  • To see that the accepted sample has the same distribution as \( X \), we apply Bayes’ law for each possible-to-observe \( k \)

    \[ \begin{align} \mathcal{P}(k|\text{accepted}) = \frac{\mathcal{P}(\text{accepted}|k)g(k)}{\mathcal{P}(\text{accepted})} = \frac{\frac{f(k)}{cg(k)}g(k)}{\frac{1}{c}} = f(k). \end{align} \]

  • Therefore, this shows in the discrete case how the acceptance-rejection scheme can sample the proper distribution by proportionality.

Markov Chain Monte Carlo methods

  • The acceptance-rejection method just discussed will play an important role in the Metropolis-Hastings algorithm,

    • however, to get intuition for the rationale of this procedure, we need to extend our understanding of Markov processes.
  • Consider a generic Markovian transition probability, i.e.,

    \[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \mathrm{d}\pmb{x}| \pmb{x}_{k-1:0}) = \mathcal{P}(\pmb{x}_k\in \mathrm{d}\pmb{x} | \pmb{x}_{k-1}). \end{align} \]

  • Furthermore, we will impose that it is possible that a state will make a transition back to itself, i.e.,

    \[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \{\pmb{x}_{k-1}\}| \pmb{x}_{k-1}) \neq 0. \end{align} \]

  • A key study of Markov chains is to determine if there exists, and the conditions that permit the existence, of an invariant distribution / measure.

  • In general, we say that a measure \( \mathcal{P} \) is invariant with respect to some transformation \( T \) whenever,

    \[ \begin{align} \mathcal{P}\left(T^{-1}(\mathrm{d}\pmb{x})\right) = \mathcal{P}(\mathrm{d}\pmb{x}). \end{align} \]

  • In particular, for a Markov chain it is of interest when the iterations of the transition probabilities converge to an invariant measure.

Markov Chain Monte Carlo methods

  • Let us denote an invariant measure for the Markov process as \( \pmb{\pi}^\ast \);

    • this satisfies the relationship where

    \[ \begin{align} \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \int \mathcal{P}(\pmb{x}_k\in\mathrm{d}\pmb{y}|\pmb{x}_{k-1})\pmb{\pi}(\pmb{x}_{k-1}) \mathrm{d}\pmb{x}_{k-1} \end{align} \] where \( \pmb{\pi} \) is the density with respect to the Lebesgue measure, i.e., \( \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \pmb{\pi}(y)\mathrm{d}\pmb{y} \).

  • To simplify some analysis, we will use slightly different notation to denote the \( k \)-th iteration of the chain as,

    \[ \begin{align} \mathcal{P}^k(\mathbf{A}|\pmb{x}) :=\int\mathcal{P}^{k-1}( \mathrm{d}\pmb{y}| \pmb{x}) \mathcal{P}(\mathbf{A}|\pmb{y}) \end{align} \] where:

    1. \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \) is the transition probability that the next instance of the process lies in \( \mathbf{A} \) given the last realization \( \pmb{x} \).
    2. \( \mathcal{P}^{k-1}( \mathrm{d}\pmb{y}| \pmb{x}) \) is the transition probability that the next instance of the process lies in \( \mathrm{d}\pmb{y} \) given the last realization \( \pmb{x} \).
    3. \( \mathcal{P}(\mathbf{A}|\pmb{y}) \) is the transition probability that the next instance of the process lies in \( \mathbf{A} \) given the last realization \( \pmb{y} \).
    4. The above integral is with respect to \( \mathrm{d}\pmb{y} \).
    5. The chain is initialized as \( \mathcal{P}^1(\mathbf{A}|\pmb{y})=\mathcal{P}(\mathbf{A}|\pmb{y}) \).
  • Therefore, in the above, we see this as the probability (measure) of \( \mathbf{A}\in\mathbb{R}^{N_x} \) given \( \pmb{x} \) as we average over a mix over all possible initial conditions recursively.

Markov Chain Monte Carlo methods

  • Recall our \( k \)-th iterate from the last slide \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \);

    • the conditions discussed in the following will guarantee that

    \[ \begin{align} \lim_{k\rightarrow \infty} \mathcal{P}^k = \pmb{\pi}^\ast. \end{align} \]

  • Instead of considering the classical Markov chain motivation, where we wish to determine the convergence to an unknown invariant distribution,

    • Markov chain Monte Carlo (MCMC) methods reverse the question.
  • We will suppose that the invariant density \( \pmb{\pi} \) that we wish to sample is known up to proportionality;

    • however, we will suppose that the transition kernel that converges to this invariant density is unknown.
  • To generate a sample from \( \pmb{\pi} \), MCMC methods derive and utilize a transition kernel \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \) whose \( k \)-th iterate converges to \( \pmb{\pi}^\ast \) for large \( k \).

  • The process is started at an arbitrary \( \pmb{x} \) and iterated a large number of times.

    • After the large number of iterates, the realizations of the iterative process are drawn from the invariant density.
  • The problem then is to find an appropriate \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \).

Markov Chain Monte Carlo methods

  • We make an ansatz for the transition kernel given as follows for some function \( p(x,y) \)

    \[ \begin{align} \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \end{align} \] where

    1. \( p(\pmb{x},\pmb{x}) = \pmb{0} \);
    2. \( \pmb{\delta}_{\pmb{x}} \) is the Dirac delta measure at \( \pmb{x} \); and
    3. \( r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \) is the probability that the chain remains at \( \pmb{x} \) given its last realization being \( \pmb{x} \).
  • Requiring that \( r(\pmb{x})\neq 0 \) means that the integral of \( p(\pmb{x},\pmb{y}) \) with respect to \( \mathrm{d}\pmb{y} \) may not be equal to one.

    • In particular, this need not be a density function.
  • Assume that \( p(\pmb{x},\pmb{y}) \) has the following reversability condition,

    \[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]

    • this is sufficient then to show that \( \pmb{\pi} \) is the invariant density of \( \mathcal{P}(\mathrm{d}\pmb{x}|\pmb{x}) \).
  • Therefore, this can be used as a criterion for constructing a sampling algorithm with the desired convergence property.

  • We will demonstrate this as follows…

Markov Chain Monte Carlo methods

  • Recall our ansatz \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \), such that

    \[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int \left[ \int_{\mathbf{A}} p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} \right]\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} + \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

  • Suppose then that we can exchange the order of integration of \( \mathrm{d}\pmb{y} \) and \( \mathrm{d}\pmb{x} \).

    • Furthermore, notice that, as a property of the Dirac delta, we can write

    \[ \begin{align} \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

  • Using both of the above, we recover

    \[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{x},\pmb{y}) \pmb{\pi}(\pmb{x}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

  • Using the reversability, we have

    \[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

Markov Chain Monte Carlo methods

  • Recall the last equation,

    \[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

  • Noting that \( r \) is defined as

    \[ \begin{align} & r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \\ \Leftrightarrow & \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} = 1 - r(\pmb{x}), \end{align} \]

  • we recover that

    \[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} &= \int_{\mathbf{A}} \left[1 - r(\pmb{y}) \right]\pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}\\ &=\int_{\mathbf{A}} \pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y}, \end{align} \] so that \( \pmb{\pi} \) is invariant with respect to the transition kernel.

Markov Chain Monte Carlo methods

  • Let's recall the reversability condition,

    \[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]

  • Intuitively:

    • the left-hand-side represents the unconditional probability of moving from \( \pmb{x} \) to \( \pmb{y} \) where \( \pmb{x} \) is generated from \( \pmb{\pi} \); and
    • the right-hand-side represents the unconditional probability of moving from \( \pmb{y} \) to \( \pmb{x} \), where \( \pmb{y} \) is generated by \( \pmb{\pi} \).
  • The reversability condition says that the two sides are equal;

    • the previous derivation furthermore says that \( \pmb{\pi}^\ast \) is the invariant measure with respect to this transition kernel.
  • This is a sketch of the result that gives the sufficiency property needed to construct a chain that eventually samples the invariant distribution.

  • Again, the goal is to construct the transition kernel to eventually sample \( \pmb{\pi}^\ast \).

  • In the next lecture, we will discuss how the Metropolis-Hastings algorithm uses the acceptance rejection technique to construct a sufficient \( p(\pmb{x},\pmb{y}) \) to sample the target, invariant distribution.