Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
In the first part of this course, we developed a variety of tools for how one models a random variable as it evolves in time with respect to a Markov model.
Our prototypical example is the discrete Gauss-Markov model;
\[ \begin{align} \pmb{x}_k := \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k \end{align} \]
When the governing process law is linear, as above, and the distributions are Gaussian, we have a full solution for the evolution in terms of the joint Gaussian density of the forecast.
When we consider a nonlinear initial value problem, if we have a Gaussian prior on the initial data defining the problem,
We can furthermore consider SDEs to define a random flow map, when we believe that the shocks to the mechanistic model should be represented continuously in time.
What we have not introduced, yet, is how we update such a forecast model when new information becomes available.
Let's add an equation now, supposing that there are observations given sequentially in time:
\[ \begin{align} \pmb{x}_k& = \mathbf{M}_k \pmb{x}_{k-1} + \pmb{w}_k\\ \pmb{y}_k &= \mathbf{H}_k \pmb{x}_k + \pmb{v}_k \end{align} \] where in the above:
This is to say that,
\[ \begin{align} \mathbb{E}\left[\pmb{y}_k\right] = \mathbb{E}\left[\mathbf{H}_k \pmb{x}_k + \pmb{v}_k\right] =\mathbf{H}_k \overline{\pmb{x}}_k \end{align} \]
In this case, \( \mathbf{M}_k \) represents the (discretized) time evolution between observation times in which we receive noisy information about the modeled state \( \pmb{x}_k \) in the observed variables \( \mathbb{R}^{N_y} \).
The question then is,
What do we do with the information \( \pmb{y}_k \) to produce a better estimate of the modeled state \( \pmb{x}_k \) and in what sense do we mean a “better” estimate?
The next two lectures will develop the statistical tools that allow us to describe the “optimal” estimation algorithm for Gauss-Markov models, i.e., the Kalman filter.
To begin our discussion of estimation, we will consider a simpler case in which the modeled variable \( \pmb{x} \) does not depend on time.
We will also begin with a scalar form of the estimation for simplicity, where we wish to estimate the temperature of some system.
Suppose we have two independent observations \( T_1, T_2 \) of an unknown, scalar temperature defined as \( T_t \).
We assume (for the moment) that the temperature is deterministic (and unknown), but the observations will be random, i.e.,
\[ \begin{align} T_1 &= T_t + \epsilon_1 \\ T_2 &= T_t + \epsilon_2 \end{align} \]
We will assume furthermore that
\[ \begin{align} \epsilon_1 &\sim N\left(0, \sigma_1^2\right)\\ \epsilon_2 &\sim N\left(0, \sigma_2^2\right) \end{align} \]
Suppose that we will estimate the true temperature by a linear combination of the two measurements.
That is, we will define an “analyzed” temperature as \( T_a \) where,
\[ \begin{align} T_a := a_1 T_1 + a_2 T_2 \end{align} \]
We will require that the analysis is unbiased
$$\begin{align} \mathbb{E}[T_a] = T_t &\Leftrightarrow a_1 + a_2 = 1 \end{align}$$
We choose \( a_1 \) and \( a_2 \) to be “optimal” in the sense that they minimize the mean-square-error of the analysis, defined as
\[ \begin{align} \sigma_a^2 &= \mathbb{E}\left[\left(T_a - T_t\right)^2\right] \\ &=\mathbb{E}\left[ \left( a_1\left\{T_1 - T_t\right\} + a_2\left\{T_2 - T_t\right\}\right)^2 \right]. \end{align} \]
Substituting the relationship \( a_2 = 1 - a_1 \),
\[ \begin{align} \sigma^2_a &= \mathbb{E}\left[\left(a_1 \epsilon_1 + \left\{1-a_1\right\}\epsilon_2\right)^2\right] \\ & =a_1^2 \sigma_1^2 + \left(1 - a_1\right)^2 \sigma_2^2 \end{align} \] as \( \epsilon_1 \) and \( \epsilon_2 \) are uncorrelated.
From the relationship on the last slide,
\[ \begin{align} \sigma^2_a = a_1^2 \sigma_1^2 + \left(1 - a_1\right)^2 \sigma_2^2 \end{align} \] we can compute the derivative of the variance of the analysis solution above with respect to \( a_1 \).
This equation is convex with respect to \( a_1 \), so that this achieves a global minimum exactly at the critical value.
Therefore, setting the derivative of \( \partial_{a_1} \sigma_a^2 = 0 \) we recover,
\[ \begin{align} & 0= 2 a_1\sigma_1^2 - 2 \sigma_2^2 + 2 a_1 \sigma_2^2 \\ \Leftrightarrow & a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} \end{align} \]
The value for \( a_2 \) can be derived symmetrically in the index.
We can thus derive the choice of weights that minimizes the analysis variance as
\[ \begin{align} a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} & & a_2 = \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} . \end{align} \]
Notice that
\[ \begin{align} a_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2} & & \Leftrightarrow & & a_1 = \frac{\frac{1}{\sigma_1^2}}{\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}} \end{align} \]
The inverse of the variance of the observation is known as the precision of the observation.
This tells us that we weight the observations in the optimal solution proportionately to their precision.
Equivalently, we can say we weight the observations inverse-proportionately to their uncertainty.
This simple example is a case of a minimum variance estimator.
Unbiased minimum variance estimator
Let \( \hat{\Theta} \) be an unbiased estimator of an unknown fixed parameter \( \theta \). Then, \( \hat{\Theta} \) is the unbiased minimum variance estimator of \( \theta \) if for any other unbiased point estimator \( \tilde{\Theta} \) \[ \begin{align} \mathbb{E}\left[(\theta - \hat{\Theta})^2 \right] \leq \mathbb{E}\left[\left(\theta - \tilde{\Theta}\right)^2 \right]. \end{align} \]
In the above, we are referring to the expectation over all possible realizations for the point estimator, i.e., in the way it depends on the observed data used to infer \( \theta \).
This is to say, the sampling distribution for \( \hat{\Theta} \) has the least spread about the true value compared to any other sampling distribution centered at \( \theta \).
Note, in this simple example, we were considering the frequentist case where \( T_t \) is a fixed constant, not a random variable.
This isn't entirely applicable to our case of interest, where we will generically take \( \pmb{x} \) to be a random vector as defined by uncertain initial data (and possibly an uncertain evolution in time).
A wider class of problems can be handled in a multiple regression setting by considering, e.g., a conditional Gaussian model for both the observations and the modeled state.
If we assume that \( \pmb{x} \) and \( \pmb{y} \) are jointly Gaussian distributed, we can construct a correlation model for \( \pmb{x} \) given an outcome of \( \pmb{y} \) by using the conditional Gaussian.
In particular, we can then similarly define a loss function
\[ \begin{align} \mathcal{J}\left(\hat{\pmb{x}}\right):=\mathbb{E}\left[ \parallel\pmb{x} - \hat{\pmb{x}}\parallel^2 \right] \end{align} \] which measures the expected square-difference of our predicted quantity \( \hat{\pmb{x}} \) from the true realization.
Linear regression posits that there is a linear relationship that can be defined between observed data and the estimator \( \hat{\pmb{x}} \).
This random estimator \( \hat{\pmb{x}} \) will also be denoted a linear unbiased minimum variance estimator for the random variable \( \pmb{x} \).
The existence and uniqueness of such an optimal solution is given by the Gauss-Markov theorem.
We will suppose we have two vectors \( \pmb{x}:= \overline{\pmb{x}} + \pmb{\delta}_{\pmb{x}} \) and \( \pmb{y}:= \overline{\pmb{y}} + \pmb{\delta}_{\pmb{y}} \); where
we assume that \( \mathbb{E}[\pmb{\delta}_{\pmb{x}}]=\mathbb{E}[\pmb{\delta}_{\pmb{y}}]=\pmb{0} \), i.e., these are vectors of anomalies from the means.
We will assume that there is some linear relationship between \( \pmb{\delta}_{\pmb{y}} \) and \( \pmb{\delta}_{\pmb{x}} \) that is represented by a matrix of weights \( \mathbf{W} \),
\[ \begin{align} \pmb{\delta}_{\pmb{x}} = \mathbf{W} \pmb{\delta}_{\pmb{y}} + \boldsymbol{\epsilon} \end{align} \] where \( \mathbb{E}\left[\boldsymbol{\epsilon}\right]=\pmb{0} \) and \( \mathrm{cov}(\pmb{\epsilon}) = \mathbf{I} \).
As a multiple regression, we will write the estimated value for this relationship by,
\[ \begin{align} \hat{\pmb{\delta}}_{\pmb{x}} = \widehat{\mathbf{W}} \pmb{\delta}_{\pmb{y}} \end{align} \]
such that
\[ \begin{align} \pmb{x} - \hat{\pmb{x}} &:= \overline{\pmb{x}} + \pmb{\delta}_{\pmb{x}} - \overline{\pmb{x}} - \hat{\pmb{\delta}}_{\pmb{x}}\\ &=\pmb{\delta}_{\pmb{x}} - \hat{\pmb{\delta}}_{\pmb{x}} \\ &=\pmb{\delta}_{\pmb{x}} - \widehat{\mathbf{W}} \pmb{\delta}_{\pmb{y}} \\ &= \hat{\pmb{\epsilon}} \end{align} \] where \( \hat{\pmb{\epsilon}} \) is known as the residual error.
The Gauss-Markov theorem loosely states that the weights \( \widehat{\mathbf{W}} \) found by minimizing the expected residual sum of squares
\[ \begin{align} \text{RSS} = \hat{\pmb{\epsilon}}^\top \hat{\pmb{\epsilon}}, \end{align} \] is the Best-Linear-Unbiased-Estimator of the true relationship \( \mathbf{W} \).
This is commonly known as the BLUE, and this choice of weights is unique.
The estimator \( \widehat{\mathbf{W}} \) is “best” in the sense that this is the linear unbiased minimum variance estimator.
This equation is also convex in the matrix argument \( \mathbf{W} \); to find the minimizing \( \widehat{\mathbf{W}} \), we differentiate the expected RSS with respect to the weight matrix
\[ \begin{align} \partial_{\mathbf{W}}\mathbb{E}\left[ \hat{\pmb{\epsilon}}^\top \pmb{\epsilon}\right] &:=\mathbb{E}\left\{\partial_{\mathbf{W}}\left[\pmb{\delta}_{\pmb{x}}^\top \pmb{\delta}_{\pmb{x}} - \pmb{\delta}_{\pmb{x}}^\top\left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)- \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)^\top\pmb{\delta}_{\pmb{x}} + \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right)^\top \left(\mathbf{W} \pmb{\delta}_{\pmb{y}} \right) \right] \right\}\\ &=\mathbb{E}\left\{ 2\left[ \left(\mathbf{W}\pmb{\delta}_{\pmb{y}}\right)\pmb{\delta}_{\pmb{y}}^\top- \pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top\right] \right\} \end{align} \]
Setting the equation to zero for \( \widehat{\mathbf{W}} \), we obtain the normal equation
\[ \begin{align} & &\widehat{\mathbf{W}}\mathbb{E}\left[\pmb{\delta}_{\pmb{y}}\pmb{\delta}_{\pmb{y}}^\top\right] - \mathbb{E}\left[ \pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top\right]&= \pmb{0}\\ \Leftrightarrow & & \widehat{\mathbf{W}}\mathrm{cov}(\pmb{y}) - \mathrm{cov}(\pmb{x},\pmb{y}) &= \pmb{0}\\ \Leftrightarrow & & \widehat{\mathbf{W}}&= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}. \end{align} \]
Consider, we estimate \( \hat{\pmb{\delta}}_{\pmb{x}}= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}} \), where
Recalling that
\[ \begin{align} \pmb{\delta}_{\pmb{x}}&:= \pmb{x} - \overline{\pmb{x}}\\ \pmb{\delta}_{\pmb{y}}&:= \pmb{y} - \overline{\pmb{y}} \end{align} \] we find
\[ \begin{align} &\hat{\pmb{\delta}}_{\pmb{x}}= \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\\ \Leftrightarrow & \hat{\pmb{x}} = \overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}} \right) \end{align} \]
Suppose that we compute the covariance of this estimate – i.e., the multi-dimensional spread of the minimum variance estimate.
\[ \begin{align} \mathbb{E}\left[\hat{\pmb{x}}\right] &= \mathbb{E}\left[\overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}} \right) \right] \\ &= \overline{\pmb{x}}, \end{align} \] such that \[ \begin{align} \mathbb{E}\left[\hat{\pmb{x}} - \pmb{x} \right] = \pmb{0}. \end{align} \]
From the last slide, we recover that,
\[ \begin{align} \mathrm{cov}\left(\pmb{x}-\hat{\pmb{x}} \right)&= \mathbb{E}\left[\left(\pmb{x} - \overline{\pmb{x}} - \hat{\pmb{x}} + \overline{\pmb{x}}\right)\left(\pmb{x} - \overline{\pmb{x}} - \hat{\pmb{x}} + \overline{\pmb{x}}\right)^\top \right]\\ &= \mathbb{E}\left[\left(\pmb{\delta}_{\pmb{x}}- \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\right)\left( \pmb{\delta}_{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\pmb{\delta}_{\pmb{y}}\right)^\top \right]\\ &= \mathbb{E}\left[\pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{x}}^\top \right] - \mathbb{E}\left[\pmb{\delta}_{\pmb{x}}\pmb{\delta}_{\pmb{y}}^\top \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \right] - \mathbb{E}\left[\boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \pmb{\delta}_{\pmb{y}}\pmb{\delta}_{\pmb{x}}^\top \right] + \mathbb{E}\left[\boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \pmb{\delta}_{\pmb{y}} \pmb{\delta}_{\pmb{y}}^\top \boldsymbol{\Sigma}_{\pmb{y}}^{-1}\boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \right]\\ &=\boldsymbol{\Sigma}_{x} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}} \end{align} \]
Recall the conditional Gaussian as previously seen, where we assume \( \pmb{x},\pmb{y} \) are jointly Gaussian distributed,
\[ \begin{align} &\pmb{x}| \pmb{y} \sim N\left(\overline{\pmb{x}} + \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}}\boldsymbol{\Sigma}_{\pmb{y}}^{-1}\left(\pmb{y} - \overline{\pmb{y}}\right), \boldsymbol{\Sigma}_{\pmb{x}} - \boldsymbol{\Sigma}_{\pmb{x},\pmb{y}} \boldsymbol{\Sigma}_{\pmb{y}}^{-1} \boldsymbol{\Sigma}_{\pmb{y},\pmb{x}}\right)\\ \Leftrightarrow & \pmb{x}| \pmb{y} \sim N\left(\hat{\pmb{x}}, \mathrm{cov}\left(\pmb{x} - \hat{\pmb{x}}\right)\right). \end{align} \]
This tells us that, for jointly Gaussian distributed variables,
Furthermore, the BLUE and its covariance parameterizes the conditional Gaussian distribution for \( \pmb{x}|\pmb{y} \).
The Gauss-Markov theorem does not require that the underlying distributions are actually Gaussian, however.