Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Maximum likelihood is a method of estimation of parameters of a distribution, where there is a proposal for the “form” of the data, but unknown details for the parameters determining the exact shape and location.
The abbreviation MLE may refer to:
The method finds a value of the parameter that maximizes a “likelihood function”.
The likelihood function is a simple change of perspective on our earlier statistical tool, the density.
However, many densities are highly nonlinear in their construction, and the notion of maximization is thus inherently tied to numerical optimization.
We can attempt to maximize a likelihood by sampling values many times, but this tends to be inefficient if not simply impossible when the data is high dimensional and the number of parameters to optimize over is large.
Thus, an important class of optimization problems in statistics are maximum likelihood problems.
Suppose we have a collection of observations of the random variable \( Y \), denoted \( \{y_i\}_{i=1}^n \);
For a single observation \( y_i \), the likelihood of observing this value is a function of the unknown \( \theta \) is thus given in terms of
\[ \begin{align} \mathcal{L}_{y_i}(\theta) = f(Y=y_i, \theta) \end{align} \]
When we have multiple observations, and they are independent and identically distributed (iid), the joint likelihood function is defined by the product of the individual likelihoods.
I.e., if we define the vector of observations as,
\[ \begin{align} \mathbf{y} =\begin{pmatrix} y_1 \\ \vdots \\ y_n\end{pmatrix} \end{align} \]
the joint likelihood of observing these together can be derived from the joint density and the iid assumption.
We consider
\[ \begin{align} \mathcal{L}_{\mathbf{y}}(\theta) &= f\left( \cup_{i=1}^n\{Y_i = y_i\}, \theta\right) \end{align} \] where we can consider the above as the density of the joint event for an independent random sample of size \( n \).
Given that we suppose each of the sample random variables \( Y_i \) are independent, and all distributed to the parent distribution
\[ Y \sim F(y,\theta) \] (i.e., the iid assumption)
we can break the joint density of these variables into their product, rendering,
\[ \begin{align} \mathcal{L}_{\mathbf{y}}(\theta) &= f\left( \cup_{i=1}^n\{Y_i = y_i\}, \theta\right) \\ &=\prod_{i=1}^n f(Y_i = y_i, \theta) \end{align} \]
We will consider what this looks like in a short toy example.
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Density | \( \mu=230 \) | \( \mu= 259 \) |
---|---|---|
\( f_1 \) | .005399 | .026609 |
\( f_2 \) | .000087 | .033322 |
\( f_3 \) | .000595 | .039894 |
For a variety of computational reasons, it is preferable to replace a likelihood function with a log-likelihood function.
We note the following property of the \( \log \), i.e.,
\[ \begin{align} \frac{\mathrm{d}}{\mathrm{d}x}\log(x) = \frac{1}{x} \end{align} \]
Also, we should note that the \( \log \) is the inverse function of the exponential \( e^x \), so that \( \log \) is only defined on the range of the exponential function, i.e., over \( (0,+\infty) \).
This tells us that for all \( x \) in the domain of \( \log \), the derivative is positive and thus \( \log \) is monotonically increasing.
\[ \begin{align} a \leq b & &\Leftrightarrow & & \log(a) \leq \log(b) \end{align} \]
Therefore, if we want to maximize the likelihood function, \( \mathcal{L}_\mathbf{y}(\theta) \),
\[ \begin{align} \mathcal{L}_\mathbf{y}(\theta_0) \leq \mathcal{L}_\mathbf{y}(\theta_1) & & \Leftrightarrow & & \log(\mathcal{L}_\mathbf{y}(\theta_0)) \leq \log(\mathcal{L}_\mathbf{y}(\theta_1)) \end{align} \]
We will refer to the log-likelihood as \( L_\mathbf{y}(\theta) = \log(\mathcal{L}_\mathbf{y}(\theta)) \);
\[ \begin{align} L_\mathbf{y}(\theta) = \sum_{i=1}^n \log(f(Y=y_i,\theta)). \end{align} \]
We will consider another toy example analytically and in R.
Suppose \( Y_1 \) and \( Y_2 \) are iid with density \( f(y) = \theta e^{-\theta y} \) for \( y > 0 \).
We wish to find the MLE of \( \theta \).
By independence,
\[ \begin{align} \mathcal{L}_\mathbf{y}(\theta)&= \left(−\theta e^{-\theta y_1}\right)\left(-\theta e^{-\theta y_2}\right)\\ &=\theta^2 e^{-\theta\left(y_1 + y_2\right)} \end{align} \]
Thus the log-likelihood
\[ \begin{align} L_\mathbf{y}(\theta) = 2\log(\theta) - \theta\left(y_1+y_2\right) \end{align} \]
If we wish to find the maximum log-likelihood of the above equation, we can utilize the first derivative in \( \theta \) to find critical points with respect to certain values of \( \theta \),
\[ \begin{align} & \frac{\mathrm{d}}{\mathrm{d}\theta} L_\mathbf{y}(\theta) = 0 \\ \Leftrightarrow & 2 \frac{1}{\theta} - \left(y_1 + y_2\right) = 0 \\ \Leftrightarrow & \theta = \frac{2}{y_1 + y_2} \end{align} \]
From the previous slide, we note that there is a critical point precisely at,
\[ \begin{align} \hat{\theta} = \frac{2}{y_1 +y_2} \end{align} \]
Q: using the above, how can we determine that the critical point for the log-likelihood is actually a maximum?
\[ \begin{align} \left(\frac{\mathrm{d}}{\mathrm{d}\theta}\right)^2 L_\mathbf{y}(\theta) = - 2 \frac{1}{\theta^2} \end{align} \]
Because the second derivative with respect to \( \theta \) is negative evaluated at the critical point,
\[ \begin{align} \left(\frac{\mathrm{d}}{\mathrm{d}\theta}\right)^2 \vert_{\hat{\theta}} L_\mathbf{y}(\theta) = - \frac{\left(y_1 + y_2\right)^2}{2} \end{align} \] for \( y_1,y_2 >0 \),
we know that in a neighborhood of the critical point, the log likelihood is decreasing for all other values.
The unique MLE solution in the previous example is \( \hat{\theta} = \frac{2}{y_1 + y_2} \) which maximizes \( L_\mathbf{y}(\theta) \), and equivalently \( \mathcal{L}_\mathbf{y}(\theta) \).
Therefore the MLE is the reciprocal of the sample mean in this example.
Although we have the analytical solution, let us see how the problem can be solved numerically using the mle (stats4)
function.
We note that traditionally optimization problems, e.g., using methods like Newton descent, are formed in terms of a minimization.
Recall, a maximization problem can be phrased as
\[ \begin{align} \text{given }f:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \end{align} \] for all \( \mathbf{x} \) in a neighborhood of \( \mathbf{x}^\ast \).
Using the previous fact, a maximization problem can be turned into a minimization problem as follows.
If we redefine \( \tilde{f}(x) = - f(x) \), then note the minimization problem:
\[ \begin{align} \text{given }\tilde{f}:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \end{align} \] for all \( \mathbf{x} \) in a neighborhood of \( \mathbf{x}^\ast \), gives the solution to the maximization problem above.
Specifically, if
\[ \begin{align} &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &-f(\mathbf{x}) \geq -f\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \\ \end{align} \]
The mle
function takes as its first argument the function that evaluates \( −L(\theta) = − \log(\mathcal{L}(\theta)) \).
The negative log-likelihood is minimized by a call to optim
, an optimization routine, shown in the next slide.
library(stats4)
y <- c(0.04304550, 0.50263474) #the observed sample
mlogL <- function(theta=1) { #minus log-likelihood of exp. density, rate 1/theta
return( - (length(y) * log(theta) - theta * sum(y)))
}
fit <- mle(mlogL)
summary(fit)
Maximum likelihood estimation
Call:
mle(minuslogl = mlogL)
Coefficients:
Estimate Std. Error
theta 3.66515 2.591652
-2 log L: -1.195477
2 / (y[1] + y[2])
[1] 3.66515
We saw that there was information about the standard error of the estimate \( \hat{\theta} \);
In general, \( \hat{\theta} \) will not equal the true population parameter for the distribution from which the observations \( \mathbf{y} \) are derived.
For this reason, we would like to have more information about the estimate than can be derived simply by the “optimal” value \( \hat{\theta} \) for a given sample.
We saw in the earlier example, and in our discussion of optimization, the central role that the second derivative plays in identifying local maxima and minima.
In fact, this is intrinsically related to the statistics of the estimator, with a similar result to the central limit theorem.
Assume for the moment that the log-likelihood is a function of a scalar alone, and can well-approximated by a quadratic function;
From optimization, this is again a least squares problem in which we want to minimize the sum of squares residuals.
We will define the score of the log-likelihood as the first derivative of \( L(\theta) \),
\[ \begin{align} S(\theta) = \partial_\theta L(\theta) \end{align} \]
The minus the second derivative of the log-likelihood is known as the Fisher information function, given by,
\[ \begin{align} I(\theta) = -\partial^2_\theta L(\theta) = \partial^2_\theta \left(-\log(\mathcal{L}(\theta))\right) \end{align} \] this again measures the curvature about a value \( \theta \) with respect to changes in \( \theta \).
For the minus-log-likelihood, this corresponds exactly to measuring the convexity of the function for a minimization problem;
When we evaluate the Fisher information function at the MLE, \( I(\hat{\theta}) \), this is called the observed Fisher information.
Let's suppose that \( \{y_i\}_{i=1}^n \) is an iid sample of observations, normally distributed with known \( \sigma \) but an unknown \( \mu \).
If we ignore the irrelevant constant terms for the optimization, we find
\[ \begin{align} -L_\mathbf{y}(\theta) &= -\log(\mathcal{L}_\mathbf{y}(\theta)) \\ &= \frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_i -\theta\right)^2. \end{align} \]
Note, the above is basically equivalent to the least squares optimization, where we have the objective function defined as
\[ \begin{align} f:\mathbb{R} &\rightarrow \mathbb{R} \\ r_i : \mathbb{R} &\rightarrow \mathbb{R}\\ \theta :&\rightarrow f(\theta) = \sum_{i=1}^m r_i(\theta)^2 \end{align} \] and residuals \( r_i(\theta) = y_i - \theta \), and the additional (uniform) weights given as \( \frac{1}{2\sigma^2} \).
This is one of the common themes in which least squares problems are often derived from Gaussian maximum likelihood, or derived from the approximation by Gaussian maximum likelihood.
The above objective function immediately tells us that the score is given as,
\[ \begin{align} S(\theta)& = \partial_\theta L_\mathbf{y}(\theta) \\ &= \frac{1}{\sigma^2} \sum_{i=1}^n \left(y_i - \theta\right). \end{align} \]
Following this, if we solve for \( S(\hat{\theta}) = 0 \), we will obtain that \( \hat{\theta}=\overline{y} \), the sample mean, as
\[ \begin{align} \frac{1}{\sigma^2} \sum_{i=1}^n \left(y_i - \overline{y}\right) &=\frac{1}{\sigma^2}\sum_{i=1}^n \left(y_i - \frac{1}{n}\sum_{j=1}^n y_j\right) = \frac{1}{\sigma^2}\left(\sum_{i=1}^n y_i - \sum_{j=1}^n y_j\right) =0 \end{align} \]
Then notice, the observed Fisher information is given as,
\[ \begin{align} -\partial_{\theta}^2 L_\mathbf{y}(\theta) = \frac{n}{\sigma^2} \end{align} \]
Note, the above equation can be considered to be random, with respect to a particular sample \( \mathbf{y} \);
Particularly, it can be shown that the variance of the MLE is given as \( \frac{\sigma^2}{n} = \frac{1}{I(\hat{\sigma})} \).
This is also a common theme in MLE, where if we are making a Gaussian MLE, or approximating the MLE with a Gaussian, we can estimate the variance / covariance by the inverse, observed Fisher information.
Given that the Fisher information is a second order derivative of the minus-log-likelihood, we might consider how this relates to the second order Taylor approximation which is the basis of Newton descent.
Consider, if we write the Taylor approximation for the likelihood of \( \theta \) as a small perturbation of the MLE \( \hat{\theta} \), we can say,
\[ \begin{align} L(\theta) \approx L(\hat{\theta}) + S(\hat{\theta})(\theta - \hat{\theta}) - \frac{1}{2}I(\hat{\theta}) (\theta - \hat{\theta})^2 \end{align} \]
Noting that \( S(\hat{\theta}) =0 \) by its construction, we obtain
\[ \begin{align} \log\left(\frac{\mathcal{L}(\theta)}{\mathcal{L}(\hat{\theta})}\right) \approx - \frac{1}{2}I(\hat{\theta}) (\theta - \hat{\theta})^2. \end{align} \]
This is to say, if we normalize the log-likelihood of nearby choices of \( \theta \) by the log-likelihood of the MLE, we can obtain a good approximation of this value with the observed Fisher information as a quadratic form.
Note, the above approximation is actually exact for the Gaussian MLE, and provides a good approximation where we can make an effective Gaussian approximation.
A quadratic approximation of the log-likelihood thus actually corresponds to a normal approximation of \( \hat{\theta} \).
We have here a practical rule in all likelihood applications: a reasonably regular likelihood means \( \hat{\theta} \) is approximately normal, so statements which are exactly true for the normal model will be approximately true for \( \hat{\theta} \).
We would like to know generally how the the score and the Fisher information behave with respect to the true parameter, which may be different than with the MLE.
We will begin by showing that the expected value of the score function (under regularity conditions) will also have an expected value equal to zero.
Under the regularity condition that the log-likelihood is well approximated by a quadratic form, it can be shown that if \( \theta^\ast \) is the true parameter of the likelihood function,
\[ \begin{align} \mathbb{E}\left[\partial_\theta \log(\mathcal{L}(Y,\theta) \vert \theta=\theta^\ast \right]& = \int_\mathbb{R} \frac{\partial_\theta f(y,\theta^\ast)}{f(y,\theta^\ast)}f(y,\theta^\ast) \mathrm{d}y \\ &=\partial_\theta \int_\mathbb{R} f(y,\theta^\ast)\mathrm{d}y \\ &= \partial_\theta 1 = 0 \end{align} \]
We can say therefore that the score \( S(\theta^\ast) \), as a random variable subject to the randomness of a sample, will vary around zero for the true parameter value.
Specifically,
\[ \begin{align} \mathbb{E}\left[S(\theta^\ast)\right] = 0 \end{align} \] for the true parameter \( \theta^\ast \).
This differs from \( S(\hat{\theta})=0 \) by default, as it is chosen as a critical point with respect to the given sample of observations.
From the last derivation, we say that
\[ \begin{align} \mathbb{E}\left[S(\theta^\ast)\right] = 0 \end{align} \] for the true parameter \( \theta^\ast \).
The variance of the score, taken over random realizations of \( Y \), is known as the expected Fisher information, which can be defined thus as,
\[ \begin{align} \mathcal{I}(\theta^\ast) = \mathbb{E}\left[\left(S(\theta^\ast)\right)^2 \right] \end{align} \]
The link between the expected Fisher information and the second derivative of the minus-log-likelihood can be derived as,
\[ \begin{align} \partial_\theta^2 L(\theta) &= \partial_\theta^2 \log(f(y,\theta)) \\ &=\frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left(\frac{\partial_\theta f(y,\theta)}{f(y,\theta)}\right)^2\\ &= \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left( \partial_\theta \log (f(y,\theta))\right)^2 \end{align} \]
The above is obtained using the quotient and the chain rule, including the derivative of the \( \log \).
Noting the previous equation
\[ \begin{align} \partial_\theta^2 L(\theta) = \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} - \left( \partial_\theta \log (f(y,\theta))\right)^2, \end{align} \]
and following the expectation,
\[ \begin{align} \mathbb{E}\left[\partial_\theta^2 L(\theta) \vert \theta=\theta^\ast \right]& = \int_\mathbb{R} \left(\frac{\partial_\theta^2 f(y,\theta^\ast)}{f(y,\theta^\ast)} - \left( \partial_\theta \log (f(y,\theta^\ast))\right)^2 \right)f(y,\theta^\ast)\mathrm{d}y \\ & = -\int_{\mathbb{R}}\left( \partial_\theta \log (f(y,\theta^\ast))\right)^2 f(y,\theta^\ast)\mathrm{d}y\\ &= -\mathbb{E}\left[(S(\theta))^2\vert \theta = \theta^\ast\right] = -\mathcal{I}(\theta^\ast) \end{align} \]
as,
\[ \begin{align} \mathbb{E}\left[ \frac{\partial_\theta^2 f(y,\theta)}{f(y,\theta)} \vert \theta = \theta^\ast\right] &= \int_\mathbb{R} \frac{\partial_\theta^2 f(y,\theta^\ast)}{f(y,\theta^\ast)} f(y,\theta^\ast) \mathrm{d}y \\ &= \partial_\theta^2 \int_\mathbb{R} f(y,\theta^\ast) \mathrm{d}y = \partial_\theta^2 1 = 0 \end{align} \]
From the last derivation, under our regularity condition with \( L(\theta) \) being well approximated by a quadratic form,
\[ \begin{align} \mathbb{E}\left[-\partial_\theta^2 L(\theta) \vert \theta=\theta^\ast \right] = \mathcal{I}(\theta^\ast) \end{align} \]
This tells us that the expected value of the second derivative of the minus-log-likelihood (at the true parameter value) is the expected Fisher information, i.e., the variance of the score function at the true value \( \theta^\ast \).
This says that the curvature of the minus-log-likelihood function at the true value, described by the second derivative, can give an approximation of the variance of the MLE.
Specifically, using the Gaussian approximation, we will say that the variance is approximately given as \( \frac{1}{I(\hat{\theta})} \).
This is a highly technical result, similar to the central limit theorem, which we will just sketch the statement of in the following.
We must make, in addition to the well-posed second order approximation, an assumption on the consistency of the MLE estimator \( \hat{\theta} \).
Given an estimation procedure it is reasonable to require that it produces a 'good' estimate if the experiment is large enough, and a 'better' estimate as the experiment becomes larger.
Suppose \( \theta^\ast \) is the true parameter, and \( \epsilon \) is a small positive value.
For any choice of \( \epsilon \), suppose that by making the experiment large enough, we we can guarantee (with large probability) that the estimate will fall within \( \epsilon \) of \( \theta^\ast \).
This is a frequentist requirement:
In the following, we will assume that the MLE is consistent, i.e., it acheives better precision in probability with large sample sizes.
We suppose that \( \{y_i\}_{i=1}^n \) is a sample of observations of \( Y \sim F_{\theta^\ast} \) for some known distribution, but unknown, true and fixed parameter value \( \theta^\ast \).
Then, under the good approximation by the quadratic form assumption, the MLE \( \hat{\theta} \) is approximately distributed according to
\[ \begin{align} \hat{\theta} \sim N\left(\theta^\ast, \frac{1}{I(\hat{\theta})}\right) \end{align} \] for \( n\gg 1 \).
When \( \boldsymbol{\theta}^\ast \) and \( \hat{\boldsymbol{\theta}} \) are vectors of parameters, this translates to say that, for large sample sizes \( n\gg 1 \), we have the MLE distributed approximately according to
\[ \begin{align} \hat{\boldsymbol{\theta}} \sim N\left(\boldsymbol{\theta}^\ast, \left(H_{-L(\hat{\boldsymbol{\theta}})}\right)^{-1}\right) \end{align} \]
That is to say, the inverse Hessian of the minimization of the minus-log-likelihood function at the MLE \( \hat{\boldsymbol{\theta}} \) is approximately the covariance of the MLE.
This is one of the reasons why methods like Newton's descent provides much more information than standard gradient descent;
This is often why in optimization routines there is additional importance in computing or at least approximating the Hessian, as with BFGS.
Intuitively, the last statement makes sense.
If the log-likelihood is well approximated by a quadratic function, then we need at least two quantities to represent it:
The curvature at the maximum gives information particularly about how much a given estimate might tend to vary with respect to resampling.
If the eigenvalues of the Hessian are relatively close to zero, this indicates some “flatness” to the area around the estimate;
Conversely, if the eigenvalues of the Hessian are strongly peaked, this says that the geometry is very convex around the estimate;
In this way, the geometry of the optimization and the statistics of the estimation are fundamentally linked.
Particularly, we estimate the standard error of the MLE as,
\[ \begin{align} se(\hat{\theta}) = I(\hat{\theta})^{-\frac{1}{2}}, \end{align} \] when we have a single parameter estimate.
Similarly, we use the diagonal entries of the inverse Hessian for the minus-log-likelihood minimization to estimate the standard errors of a vector of parameters.
This thus explains the meaning of the standard error outputs from MLE solvers in R.
This is typically the kind of result that we say informally “the inverse Hessian is the covariance of the minus-log-likelihood function”;
If the two former statements do not hold, the approximation may be well off.