Minimum variance and maximum likelihood estimation Part I

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Minimum variance estimation in a simple example
- Gauss-Markov theorem

Motivation

In the first part of this course, we developed a variety of tools for how one models a random variable as it evolves in time with respect to a Markov model.
Our prototypical example is the discrete Gauss-Markov model;

\[ \begin{align} \pmb{x}_k := \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k \end{align} \]
When the governing process law is linear, as above, and the distributions are Gaussian, we have a full solution for the evolution in terms of the joint Gaussian density of the forecast.
When we consider a nonlinear initial value problem, if we have a Gaussian prior on the initial data defining the problem,
- the tangent-linear model gives an approximation for the evolution by a Gauss-Markov model on the space of perturbations (the tangent space).
We can furthermore consider SDEs to define a random flow map, when we believe that the shocks to the mechanistic model should be represented continuously in time.
- if the SDEs are linear, this directly defines a Gauss-Markov model.
- When the system of SDEs is nonlinear, we can still use the tangent-linear approximation, though the details become more complicated.
What we have not introduced, yet, is how we update such a forecast model when new information becomes available.

Motivation

Let's add an equation now, supposing that there are observations given sequentially in time:

\[ \begin{align} \pmb{x}_k& = \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k\\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k \end{align} \] where in the above:
1. $ \pmb{y}_k\in \mathbb{R}^{N_y} $ is an observed random variable; connected to the modeled state $ \pmb{x}_k $ by
2. the observation mapping $ \mathbf{H}_k $ which takes the modeled states to the observed variables, which may be different entirely and often $ N_y \ll N_x $; plus
3. observation noise $ \pmb{v}_k \sim N(\pmb{0}, \mathbf{R}_k) $, modeling a random, unbiased measurement error.
This is to say that,

\[ \begin{align} \mathbb{E}\left[\pmb{y}_k\right] = \mathbb{E}\left[\mathbf{H}_k \pmb{x}_k + \pmb{v}_k\right] =\mathbf{H}_k \overline{\pmb{x}}_k \end{align} \]
In this case, $ \mathbf{M}_k $ represents the (discretized) time evolution between observation times in which we receive noisy information about the modeled state $ \pmb{x}_k $ in the observed variables $ \mathbb{R}^{N_y} $.
The question then is,

What do we do with the information $ \pmb{y}_k $ to produce a better estimate of the modeled state $ \pmb{x}_k $ and in what sense do we mean a “better” estimate?
The next two lectures will develop the statistical tools that allow us to describe the “optimal” estimation algorithm for Gauss-Markov models, i.e., the Kalman filter.

Minimum variance estimation

To begin our discussion of estimation, we will consider a simpler case in which the modeled variable $ \pmb{x} $ does not depend on time.
We will also begin with a scalar form of the estimation for simplicity, where we wish to estimate the temperature of some system.
Suppose we have two independent observations $ T_1, T_2 $ of an unknown, scalar temperature defined as $ T_t $.

We assume (for the moment) that the temperature is deterministic (and unknown), but the observations will be random, i.e.,

\[ \begin{align} T_1 &= T_t + \epsilon_1 \\ T_2 &= T_t + \epsilon_2 \end{align} \]
We will assume furthermore that

\[ \begin{align} \epsilon_1 &\sim N\left(0, \sigma_1^2\right)\\ \epsilon_2 &\sim N\left(0, \sigma_2^2\right) \end{align} \]

Minimum variance estimation

Suppose that we will estimate the true temperature by a linear combination of the two measurements.
That is, we will define an “analyzed” temperature as $ T_a $ where,

\[ \begin{align} T_a := a_1 T_1 + a_2 T_2 \end{align} \]
We will require that the analysis is unbiased

$$\begin{align} \mathbb{E}[T_a] = T_t &\Leftrightarrow a_1 + a_2 = 1 \end{align}$$
- i.e., with $ \mathbb{E}[\epsilon_i]=0 $, we see that $ \mathbb{E}[T_a] = (a_1 + a_2)T_t $.
We choose $ a_1 $ and $ a_2 $ to be “optimal” in the sense that they minimize the mean-square-error of the analysis, defined as

\[ \begin{align} \sigma_a^2 &= \mathbb{E}\left[\left(T_a - T_t\right)^2\right] \\ &=\mathbb{E}\left[ \left( a_1\left\{T_1 - T_t\right\} + a_2\left\{T_2 - T_t\right\}\right)^2 \right]. \end{align} \]
Substituting the relationship $ a_2 = 1 - a_1 $,

\[ \begin{align} \sigma^2_a &= \mathbb{E}\left[\left(a_1 \epsilon_1 + \left\{1-a_1\right\}\epsilon_2\right)^2\right] \\ & =a_1^2 \sigma_1^2 + \left(1 - a_1\right)^2 \sigma_2^2 \end{align} \] as $ \epsilon_1 $ and $ \epsilon_2 $ are uncorrelated.

Minimum variance estimation

From the relationship on the last slide,

\[ \begin{align} \sigma^2_a = a_1^2 \sigma_1^2 + \left(1 - a_1\right)^2 \sigma_2^2 \end{align} \] we can compute the derivative of the variance of the analysis solution above with respect to $ a_1 $.
This equation is convex with respect to $ a_1 $, so that this achieves a global minimum exactly at the critical value.
Therefore, setting the derivative of $ \partial_{a_1} \sigma_a^2 = 0 $ we recover,

\[ \begin{align} & 0= 2 a_1\sigma_1^2 - 2 \sigma_2^2 + 2 a_1 \sigma_2^2 \\ \Leftrightarrow & a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} \end{align} \]
The value for $ a_2 $ can be derived symmetrically in the index.
We can thus derive the choice of weights that minimizes the analysis variance as

\[ \begin{align} a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} & & a_2 = \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} . \end{align} \]

Minimum variance estimation

Notice that

\[ \begin{align} a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} & & \Leftrightarrow & & a_1 = \frac{\frac{1}{\sigma_1^2}}{\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}} \end{align} \]
The inverse of the variance of the observation is known as the precision of the observation.
This tells us that we weight the observations in the optimal solution proportionately to their precision.
Equivalently, we can say we weight the observations inverse-proportionately to their uncertainty.
This simple example is a case of a minimum variance estimator.

Unbiased minimum variance estimator
Let $ \hat{\Theta} $ be an unbiased estimator of an unknown fixed parameter $ \theta $. Then, $ \hat{\Theta} $ is the unbiased minimum variance estimator of $ \theta $ if for any other unbiased point estimator $ \tilde{\Theta} $ \[ \begin{align} \mathbb{E}\left[(\theta - \hat{\Theta})^2 \right] \leq \mathbb{E}\left[\left(\theta - \tilde{\Theta}\right)^2 \right]. \end{align} \]

In the above, we are referring to the expectation over all possible realizations for the point estimator, i.e., in the way it depends on the observed data used to infer $ \theta $.
This is to say, the sampling distribution for $ \hat{\Theta} $ has the least spread about the true value compared to any other sampling distribution centered at $ \theta $.

Minimum variance estimation

Note, in this simple example, we were considering the frequentist case where $ T_t $ is a fixed constant, not a random variable.
- Likewise, in the definition, we described $ \theta $ as a fixed but unknown deterministic parameter.
This isn't entirely applicable to our case of interest, where we will generically take $ \pmb{x} $ to be a random vector as defined by uncertain initial data (and possibly an uncertain evolution in time).
A wider class of problems can be handled in a multiple regression setting by considering, e.g., a conditional Gaussian model for both the observations and the modeled state.
If we assume that $ \pmb{x} $ and $ \pmb{y} $ are jointly Gaussian distributed, we can construct a correlation model for $ \pmb{x} $ given an outcome of $ \pmb{y} $ by using the conditional Gaussian.
In particular, we can then similarly define a loss function

\[ \begin{align} \mathcal{J}\left(\hat{\pmb{x}}\right):=\mathbb{E}\left[ \parallel\pmb{x} - \hat{\pmb{x}}\parallel^2 \right] \end{align} \] which measures the expected square-difference of our predicted quantity $ \hat{\pmb{x}} $ from the true realization.
Linear regression posits that there is a linear relationship that can be defined between observed data and the estimator $ \hat{\pmb{x}} $.
This random estimator $ \hat{\pmb{x}} $ will also be denoted a linear unbiased minimum variance estimator for the random variable $ \pmb{x} $.

Gauss-Markov theorem

The existence and uniqueness of such an optimal solution is given by the Gauss-Markov theorem.
We will suppose we have two vectors $ \pmb{x}:= \overline{\pmb{x}} + \pmb{\delta}_{\pmb{x}} $ and $ \pmb{y}:= \overline{\pmb{y}} + \pmb{\delta}_{\pmb{y}} $; where
we assume that $ \mathbb{E}[\pmb{\delta}_{\pmb{x}}]=\mathbb{E}[\pmb{\delta}_{\pmb{y}}]=\pmb{0} $, i.e., these are vectors of anomalies from the means.
We will assume that there is some linear relationship between $ \pmb{\delta}_{\pmb{y}} $ and $ \pmb{\delta}_{\pmb{x}} $ that is represented by a matrix of weights $ \mathbf{W} $,

\[ \begin{align} \pmb{\delta}_{\pmb{x}} = \mathbf{W} \pmb{\delta}_{\pmb{y}} + \boldsymbol{\epsilon} \end{align} \] where $ \mathbb{E}\left[\boldsymbol{\epsilon}\right]=\pmb{0} $ and $ \mathrm{cov}(\pmb{\epsilon}) = \mathbf{I} $.
As a multiple regression, we will write the estimated value for this relationship by,

\[ \begin{align} \hat{\pmb{\delta}}_{\pmb{x}} = \widehat{\mathbf{W}} \pmb{\delta}_{\pmb{y}} \end{align} \]
such that

\[ \begin{align} \pmb{x} - \hat{\pmb{x}} &:= \overline{\pmb{x}} + \pmb{\delta}_{\pmb{x}} - \overline{\pmb{x}} - \hat{\pmb{\delta}}_{\pmb{x}}\\ &=\pmb{\delta}_{\pmb{x}} - \hat{\pmb{\delta}}_{\pmb{x}} \\ &=\pmb{\delta}_{\pmb{x}} - \widehat{\mathbf{W}} \pmb{\delta}_{\pmb{y}} \\ &= \hat{\pmb{\epsilon}} \end{align} \] where $ \hat{\pmb{\epsilon}} $ is known as the residual error.

Gauss-Markov theorem

The Gauss-Markov theorem loosely states that the weights $ \widehat{\mathbf{W}} $ found by minimizing the expected residual sum of squares

\[ \begin{align} \text{RSS} = \hat{\pmb{\epsilon}}^\top \hat{\pmb{\epsilon}}, \end{align} \] is the Best-Linear-Unbiased-Estimator of the true relationship $ \mathbf{W} $.
This is commonly known as the BLUE, and this choice of weights is unique.
The estimator $ \widehat{\mathbf{W}} $ is “best” in the sense that this is the linear unbiased minimum variance estimator.
This equation is also convex in the matrix argument $ \mathbf{W} $; to find the minimizing $ \widehat{\mathbf{W}} $, we differentiate the expected RSS with respect to the weight matrix

\[ \begin{align} \partial_{\mathbf{W}}\mathbb{E}\left[ \hat{\pmb{\epsilon}}^\top \pmb{\epsilon}\right] &:=\mathbb{E}\left\{\partial_{\mathbf{W}}\left[\pmb{\delta}_{\pmb{x}}^\top \pmb{\delta}_{\pmb{x}} - \pmb{\delta}_{\pmb{x}}^\top\left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)- \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)^\top\pmb{\delta}_{\pmb{x}} + \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)^\top \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right) \right] \right\}\\ &=\mathbb{E}\left\{ 2\left[ \left(\mathbf{W}\pmb{\delta}_{\pmb{y}}\right)\pmb{\delta}_{\pmb{y}}^\top- \pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top\right] \right\} \end{align} \]
Setting the equation to zero for $ \widehat{\mathbf{W}} $, we obtain the normal equation

\[ \begin{align} & &\widehat{\mathbf{W}}\mathbb{E}\left[\pmb{\delta}_{\pmb{y}}\pmb{\delta}_{\pmb{y}}^\top\right] - \mathbb{E}\left[ \pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top\right]&= \pmb{0}\\ \Leftrightarrow & & \widehat{\mathbf{W}}\mathrm{cov}(\pmb{y}) - \mathrm{cov}(\pmb{x},\pmb{y}) &= \pmb{0}\\ \Leftrightarrow & & \widehat{\mathbf{W}}&= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}. \end{align} \]

Gauss-Markov theorem

Consider, we estimate $ \hat{\pmb{\delta}}_{\pmb{x}}= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}} $, where
- this is the regression of the deviation from the mean of $ \pmb{x} $;
- in terms of the deviation from the mean of $ \pmb{y} $.
Recalling that

\[ \begin{align} \pmb{\delta}_{\pmb{x}}&:= \pmb{x} - \overline{\pmb{x}}\\ \pmb{\delta}_{\pmb{y}}&:= \pmb{y} - \overline{\pmb{y}} \end{align} \] we find

\[ \begin{align} &\hat{\pmb{\delta}}_{\pmb{x}}= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\\ \Leftrightarrow & \hat{\pmb{x}} = \overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}} \right) \end{align} \]
Suppose that we compute the covariance of this estimate – i.e., the multi-dimensional spread of the minimum variance estimate.
- From the above, it is obvious that this is an unbiased estimator
\[ \begin{align} \mathbb{E}\left[\hat{\pmb{x}}\right] &= \mathbb{E}\left[\overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}} \right) \right] \\ &= \overline{\pmb{x}}, \end{align} \] such that \[ \begin{align} \mathbb{E}\left[\hat{\pmb{x}} - \pmb{x} \right] = \pmb{0}. \end{align} \]

Gauss-Markov theorem

From the last slide, we recover that,

\[ \begin{align} \mathrm{cov}\left(\pmb{x}-\hat{\pmb{x}} \right)&= \mathbb{E}\left[\left(\pmb{x} - \overline{\pmb{x}} - \hat{\pmb{x}} + \overline{\pmb{x}}\right)\left(\pmb{x} - \overline{\pmb{x}} - \hat{\pmb{x}} + \overline{\pmb{x}}\right)^\top \right]\\ &= \mathbb{E}\left[\left(\pmb{\delta}_{\pmb{x}}- \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\right)\left( \pmb{\delta}_{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\right)^\top \right]\\ &= \mathbb{E}\left[\pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{x}}^\top \right] - \mathbb{E}\left[\pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \right] - \mathbb{E}\left[\boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \pmb{\delta}_{\pmb{y}}\pmb{\delta}_{\pmb{x}}^\top \right] + \mathbb{E}\left[\boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \pmb{\delta}_{\pmb{y}} \pmb{\delta}_{\pmb{y}}^\top \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \right]\\ &=\boldsymbol{\Sigma}_{x} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \end{align} \]
Recall the conditional Gaussian as previously seen, where we assume $ \pmb{x},\pmb{y} $ are jointly Gaussian distributed,

\[ \begin{align} &\pmb{x}| \pmb{y} \sim N\left(\overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}}\boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}}\right), \boldsymbol{\Sigma}_{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}}\right)\\ \Leftrightarrow & \pmb{x}| \pmb{y} \sim N\left(\hat{\pmb{x}}, \mathrm{cov}\left(\pmb{x} - \hat{\pmb{x}}\right)\right). \end{align} \]
This tells us that, for jointly Gaussian distributed variables,
- the conditional Gaussian mean is precisely the BLUE.
Furthermore, the BLUE and its covariance parameterizes the conditional Gaussian distribution for $ \pmb{x}|\pmb{y} $.
The Gauss-Markov theorem does not require that the underlying distributions are actually Gaussian, however.
- Without the Gaussian assumption, we can still construct the BLUE and its covariance as discussed already, though it will not generally parameterize the conditional distribution for $ \pmb{x}|\pmb{y} $, which may be non-Gaussian.