Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- Acceptance rejection sampling
- Markov Chain Monte Carlo methods

**Particle filters**as seen in the last lecture provide an**extremely flexible Bayesian filtering scheme**.Similar schemes can be constructed to

**sample the joint posterior**for the model state in time, given the time series of observations,\[ \begin{align} p(\pmb{x}_{L:0}|\pmb{y}_{L:1}) \end{align} \] leading to what are known as

**particle smoothing methods**.**Metropolis-Hastings**is key classical scheme provides a**basis for many methods of particle smoothing**.This is a general scheme that is designed to

**sample and arbitrary distribution**, when**proportional distribution can be sampled in place of the target**.This follows the same idea as

**importance sampling**seen in particle filtering, where we will use the proportional density to draw values;- subsequently, we will use an
**evaluation of a proportionality statement**to determine the likelihood of drawing such a value.

- subsequently, we will use an
Together, provided we can sample and

**evaluate the appropriate densities in proportionality**, this gives a scheme that can**sample an arbitrary posterior**.In order to introduce this algorithm, we will need to start with a few more general statistical techniques.

A general statistical technique that will be useful for our later development is known as the

**acceptance-rejection method**of sampling.Suppose that

**\( X \)**and**\( Y \)**are**random variables with density****\( f \)**and**\( g \)**respectively,- furthermore, suppose that there
**exists a constant**\( c \) such that

\[ \begin{align} \frac{f(l)}{g(l)} \leq c \end{align} \] for all \( l \) such that \( f(l) > 0 \).

- Notice that this implies that

\[ \begin{align} \frac{f(l)}{cg(l)} \leq 1 \end{align} \]

- furthermore, suppose that there
Then the acceptance-rejection method can be applied to

**generate the random variable \( X \)**,**provided we have a means to generate \( Y \)**.To simplify the analysis, let's suppose that \( X \) and \( Y \) are actually

**discrete random variables**and \( f/g \) are**mass functions**.- The results are similar for continuous random variables, but require more technical tools to describe the results (such as conditional expectations).

The steps of this

**acceptance-rejection sampling**are performed as:- Generate a random \( y\sim g \).
- Generate a random \( u\sim \mathcal{U}(0, 1) \).
**If**\( u < \frac{f (y)}{cg(y)} \)**accept \( y \) and return \( x = y \)**.**else****reject \( y \) and repeat the above steps again**.

Notice, if we follow the above steps, we recover that

\[ \begin{align} \mathcal{P}(\text{accept} | Y=y) = \mathcal{P}\left( U < \frac{f(Y)}{cg(Y)}|Y=y\right) = \frac{f(y)}{cg(y)} \end{align} \] where the last equality is simply evaluating the cdf of \( U \).

The

**total probability of acceptance**for any iteration is therefore given as the**sum over all possible-to-observe**\( y \) as\[ \begin{align} \sum_{y} \mathcal{P}\left(\text{accept}|y\right) \mathcal{P}(Y=y) = \sum_{y}\frac{f(y)}{cg(y)}g(y) = \frac{1}{c} \end{align} \]

From the last slide, it is easy to see that the

**number of iterations until acceptance**has the**geometric distribution**with mean \( c \).- The geometric distribution gives the probability that a success requires \( k \) independent trials, with each independent trial's success having probability \( r \).
- The probability that the \( k \)-th trial (out of \( k \) repeated trials) is the
**first success**is given as

\[ \begin{align} p(k) = (1-r)^{k-1} r. \end{align} \]

- The geometric distribution in the above has mean \( \frac{1}{r} \), so that
**viewing the probability of acceptance as \( r=c \)**gives the distribution.

In

**expected value**, each sampled value of \( X \)**requires \( c \) iterations of the algorithm**;- for efficiency, one needs to choose \( Y \) as a RV that is easy to simulate and such that \( c \) small.

To see that the

**accepted sample**has the**same distribution as \( X \)**, we apply Bayes’ law for each possible-to-observe \( k \)\[ \begin{align} \mathcal{P}(k|\text{accepted}) = \frac{\mathcal{P}(\text{accepted}|k)g(k)}{\mathcal{P}(\text{accepted})} = \frac{\frac{f(k)}{cg(k)}g(k)}{\frac{1}{c}} = f(k). \end{align} \]

Therefore, this shows in the discrete case how the

**acceptance-rejection scheme**can**sample the proper distribution by proportionality**.

The acceptance-rejection method just discussed will play an important role in the Metropolis-Hastings algorithm,

- however, to get intuition for the rationale of this procedure, we need to extend our understanding of Markov processes.

Consider a

**generic Markovian transition probability**, i.e.,\[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \mathrm{d}\pmb{x}| \pmb{x}_{k-1:0}) = \mathcal{P}(\pmb{x}_k\in \mathrm{d}\pmb{x} | \pmb{x}_{k-1}). \end{align} \]

Furthermore, we will

**impose that it is possible that a state will make a transition back to itself**, i.e.,\[ \begin{align} \mathcal{P}(\pmb{x}_{k}\in \{\pmb{x}_{k-1}\}| \pmb{x}_{k-1}) \neq 0. \end{align} \]

A key study of

**Markov chains**is to**determine if there exists, and the conditions that permit the existence, of an invariant distribution / measure**.In general, we say that a

**measure \( \mathcal{P} \)**is**invariant with respect to some transformation \( T \) whenever**,\[ \begin{align} \mathcal{P}\left(T^{-1}(\mathrm{d}\pmb{x})\right) = \mathcal{P}(\mathrm{d}\pmb{x}). \end{align} \]

In particular, for a Markov chain it is of interest

**when the iterations of the transition probabilities converge**to an**invariant measure**.

Let us denote an

**invariant measure**for the Markov process as \( \pmb{\pi}^\ast \);- this satisfies the relationship where

\[ \begin{align} \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \int \mathcal{P}(\pmb{x}_k\in\mathrm{d}\pmb{y}|\pmb{x}_{k-1})\pmb{\pi}(\pmb{x}_{k-1}) \mathrm{d}\pmb{x}_{k-1} \end{align} \] where \( \pmb{\pi} \) is the

**density with respect to the Lebesgue measure**, i.e., \( \pmb{\pi}^\ast (\mathrm{d}\pmb{y}) = \pmb{\pi}(y)\mathrm{d}\pmb{y} \).To simplify some analysis, we will use

**slightly different notation**to denote the \( k \)-th iteration of the chain as,\[ \begin{align} \mathcal{P}^k(\mathbf{A}|\pmb{x}) :=\int\mathcal{P}^{k-1}( \mathrm{d}\pmb{y}| \pmb{x}) \mathcal{P}(\mathbf{A}|\pmb{y}) \end{align} \] where:

- \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \) is the transition probability that the
**next instance of the process lies in \( \mathbf{A} \)**given the**last realization \( \pmb{x} \)**. - \( \mathcal{P}^{k-1}( \mathrm{d}\pmb{y}| \pmb{x}) \) is the transition probability that the
**next instance of the process lies in \( \mathrm{d}\pmb{y} \)**given the**last realization \( \pmb{x} \)**. - \( \mathcal{P}(\mathbf{A}|\pmb{y}) \) is the transition probability that the
**next instance of the process lies in \( \mathbf{A} \)**given the**last realization \( \pmb{y} \)**. - The above
**integral is with respect to \( \mathrm{d}\pmb{y} \)**. - The
**chain is initialized**as \( \mathcal{P}^1(\mathbf{A}|\pmb{y})=\mathcal{P}(\mathbf{A}|\pmb{y}) \).

- \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \) is the transition probability that the
Therefore, in the above, we see this as the

**probability (measure) of \( \mathbf{A}\in\mathbb{R}^{N_x} \)****given \( \pmb{x} \) as we average over a mix over all possible initial conditions recursively**.

Recall our \( k \)-th iterate from the last slide \( \mathcal{P}^k(\mathbf{A}|\pmb{x}) \);

- the
**conditions discussed in the following**will**guarantee that**

\[ \begin{align} \lim_{k\rightarrow \infty} \mathcal{P}^k = \pmb{\pi}^\ast. \end{align} \]

- the
Instead of considering the

**classical Markov chain motivation**, where we wish to**determine the convergence to an unknown invariant distribution**,**Markov chain Monte Carlo (MCMC)**methods**reverse the question**.

We will suppose that the

**invariant density \( \pmb{\pi} \)**that**we wish to sample is known up to proportionality**;- however, we will suppose that the
**transition kernel that converges to this invariant density is unknown**.

- however, we will suppose that the
To generate a sample from \( \pmb{\pi} \),

**MCMC methods derive and utilize a transition kernel**\( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \) whose**\( k \)-th iterate converges to \( \pmb{\pi}^\ast \)**for large \( k \).The process is started at an arbitrary \( \pmb{x} \) and iterated a large number of times.

- After the
**large number of iterates**, the**realizations of the iterative process are drawn from the invariant density**.

- After the
The

**problem then is to find an appropriate \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) \)**.

We make an

**ansatz for the transition kernel**given as follows for some function \( p(x,y) \)\[ \begin{align} \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \end{align} \] where

- \( p(\pmb{x},\pmb{x}) = \pmb{0} \);
- \( \pmb{\delta}_{\pmb{x}} \) is the
**Dirac delta measure**at \( \pmb{x} \); and - \( r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \) is the
**probability that the chain remains at \( \pmb{x} \)**given its last realization being \( \pmb{x} \).

**Requiring that \( r(\pmb{x})\neq 0 \)**means that the integral of \( p(\pmb{x},\pmb{y}) \) with respect to \( \mathrm{d}\pmb{y} \)**may not be equal to one**.- In particular, this need not be a density function.

Assume that \( p(\pmb{x},\pmb{y}) \) has the following

**reversability condition**,\[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]

**this is sufficient**then to show that \( \pmb{\pi} \) is the**invariant density**of \( \mathcal{P}(\mathrm{d}\pmb{x}|\pmb{x}) \).

Therefore, this can be used as a

**criterion for constructing a sampling algorithm**with the**desired convergence property**.We will demonstrate this as follows…

Recall our ansatz \( \mathcal{P}(\mathrm{d}\pmb{y}|\pmb{x}) = p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} + r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathrm{d}\pmb{y}) \), such that

\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int \left[ \int_{\mathbf{A}} p(\pmb{x},\pmb{y})\mathrm{d}\pmb{y} \right]\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} + \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

Suppose then that we can

**exchange the order of integration**of \( \mathrm{d}\pmb{y} \) and \( \mathrm{d}\pmb{x} \).- Furthermore, notice that, as a
**property of the Dirac delta**, we can write

\[ \begin{align} \int r(\pmb{x}) \pmb{\delta}_{\pmb{x}}(\mathbf{A})\pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

- Furthermore, notice that, as a
Using

**both of the above**, we recover\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{x},\pmb{y}) \pmb{\pi}(\pmb{x}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

Using the

**reversability**, we have\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

Recall the last equation,

\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} = \int_{\mathbf{A}} \left[ \int p(\pmb{y},\pmb{x}) \pmb{\pi}(\pmb{y}) \mathrm{d}\pmb{x} \right]\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}. \end{align} \]

Noting that

**\( r \) is defined as**\[ \begin{align} & r(\pmb{x})=1 - \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} \\ \Leftrightarrow & \int p(\pmb{x},\pmb{y}) \mathrm{d}\pmb{y} = 1 - r(\pmb{x}), \end{align} \]

we recover that

\[ \begin{align} \int \mathcal{P}(\mathbf{A}|\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x} &= \int_{\mathbf{A}} \left[1 - r(\pmb{y}) \right]\pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y} +\int_{\mathbf{A}} r(\pmb{x}) \pmb{\pi}(\pmb{x})\mathrm{d}\pmb{x}\\ &=\int_{\mathbf{A}} \pmb{\pi}(\pmb{y})\mathrm{d}\pmb{y}, \end{align} \] so that \( \pmb{\pi} \)

**is invariant with respect to the transition kernel**.

Let's recall the

**reversability condition**,\[ \begin{align} \pmb{\pi}(\pmb{x}) p(\pmb{x},\pmb{y}) = \pmb{\pi}(\pmb{y}) p(\pmb{y},\pmb{x}); \end{align} \]

Intuitively:

- the
**left-hand-side**represents the**unconditional probability of moving from \( \pmb{x} \) to \( \pmb{y} \)**where \( \pmb{x} \) is generated from \( \pmb{\pi} \); and - the
**right-hand-side**represents the**unconditional probability of moving from \( \pmb{y} \) to \( \pmb{x} \)**, where \( \pmb{y} \) is generated by \( \pmb{\pi} \).

- the
The

**reversability condition**says that the**two sides are equal**;- the
**previous derivation**furthermore says that**\( \pmb{\pi}^\ast \) is the invariant measure with respect to this transition kernel**.

- the
This is a sketch of the result that gives the sufficiency property needed to construct a chain that eventually samples the invariant distribution.

Again, the

**goal**is to**construct the transition kernel to eventually sample \( \pmb{\pi}^\ast \)**.In the next lecture, we will discuss how the

**Metropolis-Hastings algorithm**uses the**acceptance rejection technique to construct a sufficient \( p(\pmb{x},\pmb{y}) \) to sample the target, invariant distribution**.