Maximum likelihood estimation

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

Basic ideas in MLE
The log-likelihood function
The score and the Fisher information
The expected score and the expected fisher information
The covariance and the inverse Hessian

Introduction to maximum likelihood estimation

Maximum likelihood is a method of estimation of parameters of a distribution, where there is a proposal for the “form” of the data, but unknown details for the parameters determining the exact shape and location.
The abbreviation MLE may refer to:
1. maximum likelihood estimation (the method);
2. the estimate itself produced from a real data set; or
3. the statistical estimator, i.e., the random variable generated by the procedure, subject to different realizations under resampling.
The method finds a value of the parameter that maximizes a “likelihood function”.
The likelihood function is a simple change of perspective on our earlier statistical tool, the density.
However, many densities are highly nonlinear in their construction, and the notion of maximization is thus inherently tied to numerical optimization.
We can attempt to maximize a likelihood by sampling values many times, but this tends to be inefficient if not simply impossible when the data is high dimensional and the number of parameters to optimize over is large.
Thus, an important class of optimization problems in statistics are maximum likelihood problems.

Defining the likelihood

In the following, we will discuss the construction of the likelihood function.
Consider a probability density, denoted \( f(y,\theta) \), which depends on the value of the parameter \( \theta \).

Suppose we are investigating some random variable \( Y \), for which there is some “true” parameter \( \theta_0 \), such that \[ P(a \leq Y \leq b)= \int_a^b y f(y,\theta_0) \mathrm{d}y. \]

We assume that even though \( \theta_0 \) is not known, we can evaluate the density \[ \begin{align} f(Y=y, \theta) \end{align} \] for any choice of \( \theta \) and some observed piece of data \( y \).
Then, the likelihood of \( \theta \) based on an observed \( y \) is defined \[ \begin{align} \mathcal{L}_y(\theta) = f(Y=y,\theta) \end{align} \] where we evaluate the probability of \( Y \) attaining an observed piece of data \( y \) given a choice of \( \theta \).
The “likelihood” is thus a measure of how well does our choice of parameter make the distribution describe the observed data.
Maximum likelihood estimation is thus the process of finding a \( \hat{\theta} \) which maximizes the likelihood,
- i.e., \( \hat{\theta} \) which maximizes the probability density for the observed data \( y \) given the density \( f(y,\theta) \).

Defining the likelihood

Suppose we have a collection of observations of the random variable \( Y \), denoted \( \{y_i\}_{i=1}^n \);
- we again assume that \( y_i \) are taken of observations of \( Y\sim F(y,\theta) \)
For a single observation \( y_i \), the likelihood of observing this value is a function of the unknown \( \theta \) is thus given in terms of

\[ \begin{align} \mathcal{L}_{y_i}(\theta) = f(Y=y_i, \theta) \end{align} \]
When we have multiple observations, and they are independent and identically distributed (iid), the joint likelihood function is defined by the product of the individual likelihoods.
I.e., if we define the vector of observations as,

\[ \begin{align} \mathbf{y} =\begin{pmatrix} y_1 \\ \vdots \\ y_n\end{pmatrix} \end{align} \]
the joint likelihood of observing these together can be derived from the joint density and the iid assumption.

Defining the likelihood

We consider

\[ \begin{align} \mathcal{L}_{\mathbf{y}}(\theta) &= f\left( \cup_{i=1}^n\{Y_i = y_i\}, \theta\right) \end{align} \] where we can consider the above as the density of the joint event for an independent random sample of size \( n \).
Given that we suppose each of the sample random variables \( Y_i \) are independent, and all distributed to the parent distribution

\[ Y \sim F(y,\theta) \] (i.e., the iid assumption)
we can break the joint density of these variables into their product, rendering,

\[ \begin{align} \mathcal{L}_{\mathbf{y}}(\theta) &= f\left( \cup_{i=1}^n\{Y_i = y_i\}, \theta\right) \\ &=\prod_{i=1}^n f(Y_i = y_i, \theta) \end{align} \]
We will consider what this looks like in a short toy example.

Estimation of parameters by maximum likelihood – an example

Image of Gaussian distribution with mean 230, standard deviation 10, and observations at y_1 = 250, y_2=265, y_3=259

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

Suppose that we have three observations \( y_1,y_2,y_3 \) from what we know is a population that is normally distributed.

Moreover, we suppose that while we know the standard deviation \( \sigma=10 \), we do not know the true mean \( \mu \).

Let \( \mathrm{exp}\left\{x\right\} \) denote the function \( e^x \).
By varying the choice of \( \mu \), we can evaluate the density of any particular sample with the Gaussian density \[ f_j = \frac{1}{\sqrt{2\pi}10}\mathrm{exp}\left\{-\frac{1}{2}\left(\frac{y_j - \mu}{10}\right)^2\right\}, \] denoted \( f_j \) for sample \( j \), based on the choice of \( \mu \).
We suppose that the observed data points are \( y_1 = 250, y_2=265, y_3=259 \)
Q: given the above data points, and the associated graph to the left, does \( \mu=230 \) appear to be the “most likely” choice for the true mean?

Estimation of parameters by maximum likelihood – an example continued

Image of Gaussian distribution with mean 259, standard deviation 10, and samples at y_1 = 250, y_2=265, y_3=259

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

Intuitively, we can tell that there are better choices for the “center of mass” of the data, given by our choice of \( \mu \).
One particular choice may be to set \( \mu= 259= y_3 \) as on the left.
Indeed, we can compare the values for the density function for each choice of \( \mu \) as in the table below:

Density \( \mu=230 \) \( \mu= 259 \)

\( f_1 \) .005399 .026609

\( f_2 \) .000087 .033322

\( f_3 \) .000595 .039894
The likelihood value for the data above, assuming that these are independent samples of the true Gaussian distribution, is given by the product of the respective densities.
By performing a numerical optimization, we can recover the “most likely” choice of \( \mu \) given our observations.
However, for a Gaussian distribution as above, it can be demonstrated that the sample mean is the maximum likelyhood estimate of \( \mu \).

Density	\( \mu=230 \)	\( \mu= 259 \)
\( f_1 \)	.005399	.026609
\( f_2 \)	.000087	.033322
\( f_3 \)	.000595	.039894

The log-likelihood function

For a variety of computational reasons, it is preferable to replace a likelihood function with a log-likelihood function.
We note the following property of the \( \log \), i.e.,

\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}x}\log(x) = \frac{1}{x} \end{align} \]
Also, we should note that the \( \log \) is the inverse function of the exponential \( e^x \), so that \( \log \) is only defined on the range of the exponential function, i.e., over \( (0,+\infty) \).
This tells us that for all \( x \) in the domain of \( \log \), the derivative is positive and thus \( \log \) is monotonically increasing.
- In a practical sense this means that,
\[ \begin{align} a \leq b & &\Leftrightarrow & & \log(a) \leq \log(b) \end{align} \]
Therefore, if we want to maximize the likelihood function, \( \mathcal{L}_\mathbf{y}(\theta) \),

\[ \begin{align} \mathcal{L}_\mathbf{y}(\theta_0) \leq \mathcal{L}_\mathbf{y}(\theta_1) & & \Leftrightarrow & & \log(\mathcal{L}_\mathbf{y}(\theta_0)) \leq \log(\mathcal{L}_\mathbf{y}(\theta_1)) \end{align} \]
We will refer to the log-likelihood as \( L_\mathbf{y}(\theta) = \log(\mathcal{L}_\mathbf{y}(\theta)) \);
- using the properties of the \( \log \), this also comes with a benefit that,
\[ \begin{align} L_\mathbf{y}(\theta) = \sum_{i=1}^n \log(f(Y=y_i,\theta)). \end{align} \]

Another toy example

We will consider another toy example analytically and in R.
Suppose \( Y_1 \) and \( Y_2 \) are iid with density \( f(y) = \theta e^{-\theta y} \) for \( y > 0 \).
We wish to find the MLE of \( \theta \).
By independence,

\[ \begin{align} \mathcal{L}_\mathbf{y}(\theta)&= \left(−\theta e^{-\theta y_1}\right)\left(-\theta e^{-\theta y_2}\right)\\ &=\theta^2 e^{-\theta\left(y_1 + y_2\right)} \end{align} \]
Thus the log-likelihood

\[ \begin{align} L_\mathbf{y}(\theta) = 2\log(\theta) - \theta\left(y_1+y_2\right) \end{align} \]
If we wish to find the maximum log-likelihood of the above equation, we can utilize the first derivative in \( \theta \) to find critical points with respect to certain values of \( \theta \),

\[ \begin{align} & \frac{\mathrm{d}}{\mathrm{d}\theta} L_\mathbf{y}(\theta) = 0 \\ \Leftrightarrow & 2 \frac{1}{\theta} - \left(y_1 + y_2\right) = 0 \\ \Leftrightarrow & \theta = \frac{2}{y_1 + y_2} \end{align} \]

Another toy example

From the previous slide, we note that there is a critical point precisely at,

\[ \begin{align} \hat{\theta} = \frac{2}{y_1 +y_2} \end{align} \]
Q: using the above, how can we determine that the critical point for the log-likelihood is actually a maximum?
- A: we can use the second derivative test as,
\[ \begin{align} \left(\frac{\mathrm{d}}{\mathrm{d}\theta}\right)^2 L_\mathbf{y}(\theta) = - 2 \frac{1}{\theta^2} \end{align} \]
Because the second derivative with respect to \( \theta \) is negative evaluated at the critical point,

\[ \begin{align} \left(\frac{\mathrm{d}}{\mathrm{d}\theta}\right)^2 \vert_{\hat{\theta}} L_\mathbf{y}(\theta) = - \frac{\left(y_1 + y_2\right)^2}{2} \end{align} \] for \( y_1,y_2 >0 \),
we know that in a neighborhood of the critical point, the log likelihood is decreasing for all other values.

Another toy example

The unique MLE solution in the previous example is \( \hat{\theta} = \frac{2}{y_1 + y_2} \) which maximizes \( L_\mathbf{y}(\theta) \), and equivalently \( \mathcal{L}_\mathbf{y}(\theta) \).
Therefore the MLE is the reciprocal of the sample mean in this example.
Although we have the analytical solution, let us see how the problem can be solved numerically using the mle (stats4) function.
We note that traditionally optimization problems, e.g., using methods like Newton descent, are formed in terms of a minimization.
Recall, a maximization problem can be phrased as

\[ \begin{align} \text{given }f:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \end{align} \] for all \( \mathbf{x} \) in a neighborhood of \( \mathbf{x}^\ast \).

Another toy example

Using the previous fact, a maximization problem can be turned into a minimization problem as follows.
If we redefine \( \tilde{f}(x) = - f(x) \), then note the minimization problem:

\[ \begin{align} \text{given }\tilde{f}:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \end{align} \] for all \( \mathbf{x} \) in a neighborhood of \( \mathbf{x}^\ast \), gives the solution to the maximization problem above.
Specifically, if

\[ \begin{align} &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &-f(\mathbf{x}) \geq -f\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \\ \end{align} \]
The mle function takes as its first argument the function that evaluates \( −L(\theta) = − \log(\mathcal{L}(\theta)) \).
The negative log-likelihood is minimized by a call to optim, an optimization routine, shown in the next slide.

Another toy example

library(stats4)
y <- c(0.04304550, 0.50263474) #the observed sample
mlogL <- function(theta=1) { #minus log-likelihood of exp. density, rate 1/theta
  return( - (length(y) * log(theta) - theta * sum(y))) 
}
fit <- mle(mlogL)
summary(fit)

Maximum likelihood estimation

Call:
mle(minuslogl = mlogL)

Coefficients:
      Estimate Std. Error
theta  3.66515   2.591652

-2 log L: -1.195477

Considering our earlier calculation,

2 / (y[1] + y[2])

[1] 3.66515

we find our earlier analytical solution agrees with the numerical one.

The score and the Fisher information

We saw that there was information about the standard error of the estimate \( \hat{\theta} \);
- this reflects the fact that the MLE is considered a random variable, which is subject to the sampling error that results from a finite sample from the true population.
In general, \( \hat{\theta} \) will not equal the true population parameter for the distribution from which the observations \( \mathbf{y} \) are derived.
- As we saw analytically in the example, the MLE in this case is equal to the sample-mean, which only converges to the true population mean in distribution in the large sample limit.
For this reason, we would like to have more information about the estimate than can be derived simply by the “optimal” value \( \hat{\theta} \) for a given sample.
We saw in the earlier example, and in our discussion of optimization, the central role that the second derivative plays in identifying local maxima and minima.
In fact, this is intrinsically related to the statistics of the estimator, with a similar result to the central limit theorem.
- In MLE, the second derivative is also used to identify the standard error of the estimator \( \hat{\theta} \), or at least to provide some approximation to it.

The score and the Fisher information

Assume for the moment that the log-likelihood is a function of a scalar alone, and can well-approximated by a quadratic function;
- this is basically to say that this can be well-approximated by the second order Taylor expansion around the maximum likelihood estimate.
From optimization, this is again a least squares problem in which we want to minimize the sum of squares residuals.
We will define the score of the log-likelihood as the first derivative of \( L(\theta) \),

\[ \begin{align} S(\theta) = \partial_\theta L(\theta) \end{align} \]
The minus the second derivative of the log-likelihood is known as the Fisher information function, given by,

\[ \begin{align} I(\theta) = -\partial^2_\theta L(\theta) = \partial^2_\theta \left(-\log(\mathcal{L}(\theta))\right) \end{align} \] this again measures the curvature about a value \( \theta \) with respect to changes in \( \theta \).
For the minus-log-likelihood, this corresponds exactly to measuring the convexity of the function for a minimization problem;
- the above Fisher information function generalizes to the Hessian of the minimization problem when we make a search over multiple parameters.
When we evaluate the Fisher information function at the MLE, \( I(\hat{\theta}) \), this is called the observed Fisher information.
- A large curvature associated to \( I(\hat{\theta}) \) is associated with a tight or strong peak, intuitively indicating less uncertainty about \( \theta \).

An example of the score and the Fisher information

Let's suppose that \( \{y_i\}_{i=1}^n \) is an iid sample of observations, normally distributed with known \( \sigma \) but an unknown \( \mu \).
If we ignore the irrelevant constant terms for the optimization, we find

\[ \begin{align} -L_\mathbf{y}(\theta) &= -\log(\mathcal{L}_\mathbf{y}(\theta)) \\ &= \frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_i -\theta\right)^2. \end{align} \]
Note, the above is basically equivalent to the least squares optimization, where we have the objective function defined as

\[ \begin{align} f:\mathbb{R} &\rightarrow \mathbb{R} \\ r_i : \mathbb{R} &\rightarrow \mathbb{R}\\ \theta :&\rightarrow f(\theta) = \sum_{i=1}^m r_i(\theta)^2 \end{align} \] and residuals \( r_i(\theta) = y_i - \theta \), and the additional (uniform) weights given as \( \frac{1}{2\sigma^2} \).
This is one of the common themes in which least squares problems are often derived from Gaussian maximum likelihood, or derived from the approximation by Gaussian maximum likelihood.

An example of the score and the Fisher information

The above objective function immediately tells us that the score is given as,

\[ \begin{align} S(\theta)& = \partial_\theta L_\mathbf{y}(\theta) \\ &= \frac{1}{\sigma^2} \sum_{i=1}^n \left(y_i - \theta\right). \end{align} \]
Following this, if we solve for \( S(\hat{\theta}) = 0 \), we will obtain that \( \hat{\theta}=\overline{y} \), the sample mean, as

\[ \begin{align} \frac{1}{\sigma^2} \sum_{i=1}^n \left(y_i - \overline{y}\right) &=\frac{1}{\sigma^2}\sum_{i=1}^n \left(y_i - \frac{1}{n}\sum_{j=1}^n y_j\right) = \frac{1}{\sigma^2}\left(\sum_{i=1}^n y_i - \sum_{j=1}^n y_j\right) =0 \end{align} \]
Then notice, the observed Fisher information is given as,

\[ \begin{align} -\partial_{\theta}^2 L_\mathbf{y}(\theta) = \frac{n}{\sigma^2} \end{align} \]
Note, the above equation can be considered to be random, with respect to a particular sample \( \mathbf{y} \);
Particularly, it can be shown that the variance of the MLE is given as \( \frac{\sigma^2}{n} = \frac{1}{I(\hat{\sigma})} \).
This is also a common theme in MLE, where if we are making a Gaussian MLE, or approximating the MLE with a Gaussian, we can estimate the variance / covariance by the inverse, observed Fisher information.

The observed Fisher information in the Taylor approximation

Given that the Fisher information is a second order derivative of the minus-log-likelihood, we might consider how this relates to the second order Taylor approximation which is the basis of Newton descent.
Consider, if we write the Taylor approximation for the likelihood of \( \theta \) as a small perturbation of the MLE \( \hat{\theta} \), we can say,

\[ \begin{align} L(\theta) \approx L(\hat{\theta}) + S(\hat{\theta})(\theta - \hat{\theta}) - \frac{1}{2}I(\hat{\theta}) (\theta - \hat{\theta})^2 \end{align} \]
Noting that \( S(\hat{\theta}) =0 \) by its construction, we obtain

\[ \begin{align} \log\left(\frac{\mathcal{L}(\theta)}{\mathcal{L}(\hat{\theta})}\right) \approx - \frac{1}{2}I(\hat{\theta}) (\theta - \hat{\theta})^2. \end{align} \]
This is to say, if we normalize the log-likelihood of nearby choices of \( \theta \) by the log-likelihood of the MLE, we can obtain a good approximation of this value with the observed Fisher information as a quadratic form.
Note, the above approximation is actually exact for the Gaussian MLE, and provides a good approximation where we can make an effective Gaussian approximation.
A quadratic approximation of the log-likelihood thus actually corresponds to a normal approximation of \( \hat{\theta} \).
We have here a practical rule in all likelihood applications: a reasonably regular likelihood means \( \hat{\theta} \) is approximately normal, so statements which are exactly true for the normal model will be approximately true for \( \hat{\theta} \).

The expected score and the expected Fisher information

We would like to know generally how the the score and the Fisher information behave with respect to the true parameter, which may be different than with the MLE.
We will begin by showing that the expected value of the score function (under regularity conditions) will also have an expected value equal to zero.
Under the regularity condition that the log-likelihood is well approximated by a quadratic form, it can be shown that if \( \theta^\ast \) is the true parameter of the likelihood function,

\[ \begin{align} \mathbb{E}\left[\partial_\theta \log(\mathcal{L}(Y,\theta) \vert \theta=\theta^\ast \right]& = \int_\mathbb{R} \frac{\partial_\theta f(y,\theta^\ast)}{f(y,\theta^\ast)}f(y,\theta^\ast) \mathrm{d}y \\ &=\partial_\theta \int_\mathbb{R} f(y,\theta^\ast)\mathrm{d}y \\ &= \partial_\theta 1 = 0 \end{align} \]
We can say therefore that the score \( S(\theta^\ast) \), as a random variable subject to the randomness of a sample, will vary around zero for the true parameter value.
Specifically,

\[ \begin{align} \mathbb{E}\left[S(\theta^\ast)\right] = 0 \end{align} \] for the true parameter \( \theta^\ast \).
This differs from \( S(\hat{\theta})=0 \) by default, as it is chosen as a critical point with respect to the given sample of observations.

The expected score and the expected Fisher information

From the last derivation, we say that

\[ \begin{align} \mathbb{E}\left[S(\theta^\ast)\right] = 0 \end{align} \] for the true parameter \( \theta^\ast \).
The variance of the score, taken over random realizations of \( Y \), is known as the expected Fisher information, which can be defined thus as,

\[ \begin{align} \mathcal{I}(\theta^\ast) = \mathbb{E}\left[\left(S(\theta^\ast)\right)^2 \right] \end{align} \]
The link between the expected Fisher information and the second derivative of the minus-log-likelihood can be derived as,

\[ \begin{align} \partial_\theta^2 L(\theta) &= \partial_\theta^2 \log(f(y,\theta)) \\ &=\frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left(\frac{\partial_\theta f(y,\theta)}{f(y,\theta)}\right)^2\\ &= \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left( \partial_\theta \log (f(y,\theta))\right)^2 \end{align} \]
The above is obtained using the quotient and the chain rule, including the derivative of the \( \log \).

The expected score and the expected Fisher information

Noting the previous equation

\[ \begin{align} \partial_\theta^2 L(\theta) = \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left( \partial_\theta \log (f(y,\theta))\right)^2, \end{align} \]
and following the expectation,

\[ \begin{align} \mathbb{E}\left[\partial_\theta^2 L(\theta) \vert \theta=\theta^\ast \right]& = \int_\mathbb{R} \left(\frac{\partial_\theta^2 f(y,\theta^\ast)}{f(y,\theta^\ast)} - \left( \partial_\theta \log (f(y,\theta^\ast))\right)^2 \right)f(y,\theta^\ast)\mathrm{d}y \\ & = -\int_{\mathbb{R}}\left( \partial_\theta \log (f(y,\theta^\ast))\right)^2 f(y,\theta^\ast)\mathrm{d}y\\ &= -\mathbb{E}\left[(S(\theta))^2\vert \theta = \theta^\ast\right] = -\mathcal{I}(\theta^\ast) \end{align} \]
as,

\[ \begin{align} \mathbb{E}\left[ \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} \vert \theta = \theta^\ast\right] &= \int_\mathbb{R} \frac{\partial_\theta^2 f(y,\theta^\ast)}{f(y,\theta^\ast)} f(y,\theta^\ast) \mathrm{d}y \\ &= \partial_\theta^2 \int_\mathbb{R} f(y,\theta^\ast) \mathrm{d}y = \partial_\theta^2 1 = 0 \end{align} \]

The expected score and the expected Fisher information

From the last derivation, under our regularity condition with \( L(\theta) \) being well approximated by a quadratic form,

\[ \begin{align} \mathbb{E}\left[-\partial_\theta^2 L(\theta) \vert \theta=\theta^\ast \right] = \mathcal{I}(\theta^\ast) \end{align} \]
This tells us that the expected value of the second derivative of the minus-log-likelihood (at the true parameter value) is the expected Fisher information, i.e., the variance of the score function at the true value \( \theta^\ast \).
This says that the curvature of the minus-log-likelihood function at the true value, described by the second derivative, can give an approximation of the variance of the MLE.
Specifically, using the Gaussian approximation, we will say that the variance is approximately given as \( \frac{1}{I(\hat{\theta})} \).
This is a highly technical result, similar to the central limit theorem, which we will just sketch the statement of in the following.
We must make, in addition to the well-posed second order approximation, an assumption on the consistency of the MLE estimator \( \hat{\theta} \).
Given an estimation procedure it is reasonable to require that it produces a 'good' estimate if the experiment is large enough, and a 'better' estimate as the experiment becomes larger.

The covariance and the inverse Hessian

Suppose \( \theta^\ast \) is the true parameter, and \( \epsilon \) is a small positive value.
For any choice of \( \epsilon \), suppose that by making the experiment large enough, we we can guarantee (with large probability) that the estimate will fall within \( \epsilon \) of \( \theta^\ast \).
This is a frequentist requirement:
- if we repeat the large experiment many times then a large proportion of the resulting experiments will be within \( \epsilon \) of \( \theta^\ast \).
In the following, we will assume that the MLE is consistent, i.e., it acheives better precision in probability with large sample sizes.
We suppose that \( \{y_i\}_{i=1}^n \) is a sample of observations of \( Y \sim F_{\theta^\ast} \) for some known distribution, but unknown, true and fixed parameter value \( \theta^\ast \).
Then, under the good approximation by the quadratic form assumption, the MLE \( \hat{\theta} \) is approximately distributed according to

\[ \begin{align} \hat{\theta} \sim N\left(\theta^\ast, \frac{1}{I(\hat{\theta})}\right) \end{align} \] for \( n\gg 1 \).

The covariance and the inverse Hessian

When \( \boldsymbol{\theta}^\ast \) and \( \hat{\boldsymbol{\theta}} \) are vectors of parameters, this translates to say that, for large sample sizes \( n\gg 1 \), we have the MLE distributed approximately according to

\[ \begin{align} \hat{\boldsymbol{\theta}} \sim N\left(\boldsymbol{\theta}^\ast, \left(H_{-L(\hat{\boldsymbol{\theta}})}\right)^{-1}\right) \end{align} \]
That is to say, the inverse Hessian of the minimization of the minus-log-likelihood function at the MLE \( \hat{\boldsymbol{\theta}} \) is approximately the covariance of the MLE.
This is one of the reasons why methods like Newton's descent provides much more information than standard gradient descent;
- the geometry of the curvature actually gives fundamental statistical information about the estimate, and how precisely we can believe we have estimated the true value.
This is often why in optimization routines there is additional importance in computing or at least approximating the Hessian, as with BFGS.

The covariance and the inverse Hessian

Intuitively, the last statement makes sense.
If the log-likelihood is well approximated by a quadratic function, then we need at least two quantities to represent it:
1. the location of its maximum; and
2. the curvature at the maximum.
The curvature at the maximum gives information particularly about how much a given estimate might tend to vary with respect to resampling.
If the eigenvalues of the Hessian are relatively close to zero, this indicates some “flatness” to the area around the estimate;
- likewise, this says that the inverse eigenvalues tend to be large and the variance of the estimate tends to be large over resampling.
Conversely, if the eigenvalues of the Hessian are strongly peaked, this says that the geometry is very convex around the estimate;
- likewlise, this says that the inverse eigenvalues tend to be small and the variance of the estimate tends to be small over resampling.
In this way, the geometry of the optimization and the statistics of the estimation are fundamentally linked.

The covariance and the inverse Hessian

Particularly, we estimate the standard error of the MLE as,

\[ \begin{align} se(\hat{\theta}) = I(\hat{\theta})^{-\frac{1}{2}}, \end{align} \] when we have a single parameter estimate.
Similarly, we use the diagonal entries of the inverse Hessian for the minus-log-likelihood minimization to estimate the standard errors of a vector of parameters.
This thus explains the meaning of the standard error outputs from MLE solvers in R.
This is typically the kind of result that we say informally “the inverse Hessian is the covariance of the minus-log-likelihood function”;
- however, it is important to remember that this is an approximation that holds under good second order Taylor approximations and large sample sizes.
If the two former statements do not hold, the approximation may be well off.