Minimum variance and maximum likelihood estimation Part II

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A simple example of maximum likelihood estimation
- Bayesian maximum a posteriori estimation

Motivation

In the last lecture, we saw that when our modeled random state \( \pmb{x} \) and some observed piece of data \( \pmb{y} \) are jointly Gaussian, the conditional Gaussian mean is precisely the BLUE.
The BLUE and its covariance parameterized the conditional Gaussian distribution for \( \pmb{x}|\pmb{y} \).
- The Gauss-Markov theorem does not require that the underlying distributions are actually Gaussian, however.
- Without the Gaussian assumption, we can still construct the BLUE and its covariance as discussed already, though it will not generally parameterize the conditional distribution for \( \pmb{x}|\pmb{y} \), which may be non-Gaussian.
However, when the underlying distributions are Gaussian, as above, we also get the equivalence of the conditional mean as the maximum likelihood estimator.
To explain this notion, we must first introduce the idea of a likelihood function.

Likelihood function
Let \( p_{\pmb{\theta}}(\pmb{x}) \) be a probability density that depends on the parameter vector \( \pmb{\theta} \) as a free variable. If \( \pmb{x}_0 \) is an observed realization of the random variable, then we denote the likelihood function \[ \begin{align} L_{\pmb{x}_0}(\pmb{\theta}):= p_{\pmb{\theta}}(\pmb{x}_0), \end{align} \] i.e., we evaluate the density for \( \pmb{x}_0 \) with respect to the particular choice of \( \pmb{\theta} \) as a free variable.

The definition above simply re-arranges the terms for the density, and which variable we treat as the argument.
This provides a means, for an unknown value of the free parameter \( \pmb{\theta} \), to consider which form of the density best matches the observed data.

Maximum likelihood estimation

If we suppose, furthermore, we have a random sample \( \pmb{x}_{k:0} \), independently and identically distributed according to the parent distribution for some unknown choice of \( \pmb{\theta} \);
- the joint likelihood for the random sample is given by
\[ \begin{align} L_{\pmb{x}_{k:0}}(\pmb{\theta})&:= p_{\pmb{\theta}}(\pmb{x}_{k:0})\\ &=\prod_{i=0}^k p_{\pmb{\theta}}(\pmb{x}_i), \end{align} \] due to independence.
Another way in which we might thus consider an estimate “optimal” is if it maximizes the joint likelihood of our observed data:

Maximum likelihood estimation
Let \( \hat{\pmb{\Theta}} \) be a point estimator for an unknown parameter \( \pmb{\theta} \), depending on the random sample \( \pmb{x}_{k:0} \). We say that \( \hat{\pmb{\Theta}} \) is a maximum likelihood estimator of \( \theta \) if for any other point estimator \( \tilde{\pmb{\Theta}} \), \[ \begin{align} L_{\pmb{x}_{k:0}}\left(\tilde{\pmb{\Theta}}\right)\leq L_{\pmb{x}_{k:0}}\left(\hat{\pmb{\Theta}}\right), \end{align} \] i.e., for any particular realization \( \hat{\pmb{\theta}} \) of the random variable \( \hat{\pmb{\Theta}} \) depending on the outcome of the random sample \( \pmb{x}_{k:0} \), \( \hat{\pmb{\theta}} \) is the value that maximizes the joint density for \( \pmb{x}_{k:0} \).

Maximum likelihood estimation

It is important to recognize that the joint likelihood,

\[ \begin{align} L_{\pmb{x}_{k:0}}(\pmb{\theta}) &=\prod_{i=0}^k p_{\pmb{\theta}}(\pmb{x}_i), \end{align} \] can rarely be solved analytically.
However, we note that \( \log \) is monotonic, such that an increase in the argument uniformly corresponds to an increase in the output.
It is equivalent, then, to maximize the log-likelihood, defined

\[ \begin{align} \mathcal{l}_{\pmb{x}_{k:0}}(\pmb{\theta})&:= \log\left(\prod_{i=0}^k p_{\pmb{\theta}}(\pmb{x}_i)\right) \\ &=\sum_{i=0}^k \log\left(p_{\pmb{\theta}}(\pmb{x}_i) \right). \end{align} \]
Furthermore, writing the minus-log-likelihood,

\[ \begin{align} \mathcal{J}(\pmb{\theta}):= - \mathcal{l}_{\pmb{x}_{k:0}}(\pmb{\theta}) =-\sum_{i=0}^k \log\left(p_{\pmb{\theta}}(\pmb{x}_i) \right) \end{align} \] finding the maximum likelihood estimate is equivalent to an objective function minimization problem in optimization.

Maximum likelihood estimation

Let's consider again the simple example of estimating the fixed, true temperature \( T_t \) from two random observations,

\[ \begin{align} T_1 = T_t + \epsilon_1 & &\epsilon_1 \sim N\left(0, \sigma_1^2\right) \\ T_2 = T_t + \epsilon_2 & & \epsilon_2 \sim N\left(0, \sigma_2^2\right) \end{align} \]
The probability density of an observation \( T_i \) given the true value \( T_t \) and the standard deviation \( \sigma_i \) is given as

\[ \begin{align} p_{\sigma_i, T_t}(T_i) = \frac{1}{\sqrt{2\pi}\sigma_i}e^{-\frac{\left(T_i - T_t\right)^2}{2\sigma_i^2}} \end{align} \]
This corresponds then to saying the likelihood of the true value \( T_t \) is given the observed \( T_i \) is \( L_{\sigma_i,T_i}(T) = P_{\sigma_i,T_t}\left( T_i\right) \).
If we take the minus-log-likelihood, we say this is equal to

\[ \begin{align} \mathcal{J}(T) =\text{constants} + \frac{1}{2}\left[\frac{\left(T -T_1\right)^2}{\sigma_1^2} + \frac{\left(T -T_2\right)^2}{\sigma_2^2} \right], \end{align} \] where “constants” refer to terms that do not involve the free variable \( T \).
We write the minus-log-likelihood this way, because the constant terms have no bearing on which choice of \( T \) minimizes the above objective function.
Rather, we see this as a penalty function given in terms of the square deviation of \( T \) from the observations, proportional to the observation precisions.

Maximum likelihood estimation

Taking the derivative with respect to \( T \), this equals zero precisely where

\[ \begin{align} &0 = \frac{T-T_1}{\sigma_1^2} + \frac{T-T_2}{\sigma_2^2} \\ \Leftrightarrow & T= \frac{T_1 \sigma_2^2}{\sigma_1^2 + \sigma_2^2} + \frac{T_2 \sigma_1^2}{\sigma_1^2 + \sigma_2^2} \end{align} \] as with the minimum variance estimator.
This is actually a general property for Gaussian distributions.
This is due to the geometry of the Gaussian exactly, in that its density is unimodally peaked at the mean, with symmetry about this value.
This means that the mean of the Gaussian and the mode (the density maximizing value) always coincide.
In this simple example, we again assumed that \( T_t \) was a fixed, unknown value;
- however, we want to consider again the case where \( \pmb{x} \) is actually a random variable due to uncertain initial data and evolution.
If we generally suppose that we have a joint density for \( \pmb{x} \) and \( \pmb{y} \), we can instead write this as a case of Bayesian maximum a posteriori estimation.

Maximum a posteriori estimation

Let's consider the relationship of conditional probability, supposing we have a joint density on the vectors \( \pmb{x} \) and \( \pmb{y} \),

\[ \begin{align} p(\pmb{x},\pmb{y})& = p(\pmb{x}|\pmb{y}) p(\pmb{y}), \\ p(\pmb{x},\pmb{y})&= p(\pmb{y}|\pmb{x}) p(\pmb{x}), \end{align} \]
which together give Bayes' law as follows.

\[ \begin{align} p(\pmb{x}|\pmb{y}) = \frac{p(\pmb{y}|\pmb{x}) p(\pmb{x})}{p(\pmb{y})}. \end{align} \]
Viewing this like maximum likelihood estimation, we can find the value \( \hat{\pmb{x}} \) that maximizes the conditional density for \( \pmb{x}|\pmb{y} \).
Therefore, up to proportionality, we say

\[ \begin{align} p(\pmb{x}|\pmb{y})\propto p(\pmb{y}|\pmb{x}) p(\pmb{x}); \end{align} \] where
- \( p(\pmb{x}|\pmb{y}) \) is known as the posterior;
- \( p(\pmb{y}|\pmb{x}) \) is known as the likelihood of the data; and
- \( p(\pmb{x}) \) is the prior knowledge of \( \pmb{x} \).
Particularly, it is thus sufficient to maximize the product of the likelihood and the prior to find \( \hat{\pmb{x}} \) that maximizes the posterior.
- Again, the marginal density \( p(\pmb{y}) \) for the observed data makes no difference in the maximal solution with respect to \( \pmb{x} \).

Maximum a posteriori estimation

We recall that if \( p(\pmb{x},\pmb{y}) \) is jointly Gaussian, then the posterior \( p(\pmb{x}|\pmb{y}) \) is also Gaussian.
Therefore, the conditional Gaussian mean is both the minimum variance and maximum a posteriori estimator.
Particularly, if we recall that for jointly Gaussian distributed variables,

\[ \begin{align} \begin{pmatrix} \pmb{x} \\ \pmb{y} \end{pmatrix} \sim N\left(\begin{pmatrix}\overline{\pmb{x}} \\ \overline{\pmb{y}}\end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{\pmb{x}} & \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \\ \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} & \boldsymbol{\Sigma}_{\pmb{y}} \end{pmatrix} \right), \end{align} \]
the posterior for \( \pmb{x} \) given a particular realization of \( \pmb{y} \) is given by the conditional density, proportional to

\[ \begin{align} p(\pmb{x}|\pmb{y}) \propto \exp\left\{-\frac{1}{2}\parallel \pmb{x} - \overline{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}}\boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\parallel_{\boldsymbol{\Sigma}_{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}}}^2\right\}. \end{align} \]
Because the above is a hyper-exponential penalty function, for how far \( \pmb{x} \) lies away from the conditional mean, this is clearly maximized at the condtional mean.
However, the conditional mean is not always the maximum a posteriori estimator for a generic density.
Nonetheless, the Bayesian proportionality statement

\[ \begin{align} p(\pmb{x}|\pmb{y}) \propto p(\pmb{y}|\pmb{x}) p(\pmb{x}) \end{align} \] gives a very flexible means to construct a Bayesian maximum a posteriori estimator.
Taking the minus-log-likelihood once again, we attain a more general objective function minimization problem.