Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The maximization and minimization of functions, or optimization problems, contain two components:
E.g., we may wish to optimize factory output \( f(x) \) as a function of hours \( x \) in a week, with a measure of our active machine-hours \( g(x) \) not exceeding a pre-specified limitation \( g(x)\leq C \).
Optimization problems can thus be classified into two categories:
We will focus on unconstrained optimization as often arises in MLE; this is formulated as the following problem,
\[ \begin{align} f: \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} &\rightarrow f(\mathbf{x})\\ f(\mathbf{x}^\ast) &= \mathrm{max}_{\mathbf{x} \in \mathcal{D}} f \end{align} \]
We note that the above problem is equivalent to a minimization problem by a subsitution of \( \tilde{f} = -f \), i.e.,
\[ \begin{align} \tilde{f}: \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} &\rightarrow -f(\mathbf{x})\\ f(\mathbf{x}^\ast) &= \mathrm{max}_{\mathbf{x} \in \mathcal{D}} \tilde{f} = \mathrm{min}_{\mathbf{x}\in \mathcal{D}} f \end{align} \]
A point \( \mathbf{x}^\ast \) is a global minimizer if \( f(\mathbf{x}^\ast) \leq f(\mathbf{x}) \) for all other possible \( \mathbf{x} \) in the domain of consideration \( D\subset \mathbb{R}^n \).
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
Let \( \mathcal{N}\subset \mathcal{D} \subset\mathbb{R}^n \) be a neighborhood of the point \( \mathbf{x}^\ast \) in the domain of consideration. We say \( \mathbf{x}^\ast \) is a local minimizer in the neighborhood of \( \mathcal{N} \) if \[ \begin{align} f(\mathbf{x}^\ast) \leq f(\mathbf{x}) & & \text{ for each other }\mathbf{x}\in \mathcal{N} \end{align} \]
For finding a local minimizer, the main tools will be derived directly from the second order approximation of the objective function \( f \), defined by
\[ \begin{align} f(\mathbf{x}_1) \approx f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+\frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \end{align} \]
We will consider how this is related to the notion of convexity as follows.
Courtesy of: Oleg Alexandrov. Public domain, via Wikimedia Commons.
For the function of one variable \( f(x) \) we say that \( x^\ast \) is a local minimizer if \( f'(x^\ast)=0 \) and \( f''(x^\ast)> 0 \).
Courtesy of: Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.
We can understand the second derivative test with the Hessian using our second order Taylor approximation as follows.
Let \( \mathbf{x}^\ast \) be a critical point and \( \mathbf{x}_1 = \mathbf{x}^\ast + \boldsymbol{\delta}_{x_1} \) be a perturbation of this in the neighborhood \( \mathcal{N} \).
Suppose that the Hessian has only positive eigenvalues at \( \mathbf{x}^\ast \), then our approximation at second order gives \[ \begin{align} f(\mathbf{x}_1) &\approx f(\mathbf{x}^\ast) + \left(\nabla f(\mathbf{x}^\ast)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+ \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \\ &= f(\mathbf{x}^\ast) + \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \end{align} \]
Provided \( \mathcal{N} \) is a small enough neighborhood, the \( \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^3\right) \) will remain very small.
However, \( \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \) must be positive by the positive eigenvalues of the Hessian.
This says for a radius sufficiently small, \( \parallel \boldsymbol{\delta}_{x_1}\parallel \), and any perturbation of the point \( \mathbf{x}^\ast \) defined as \( \mathbf{x}_1 = \mathbf{x}^\ast + \boldsymbol{\delta}_{x_1} \), we have \[ f(\mathbf{x}_1) \geq f(\mathbf{x}^\ast). \]
Therefore, we can identify a local minimizer whenever the gradient is zero and the Hessian has positive eigenvalues, due to the local convexity.
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
In Newton descent, for each \( k \) we repeat the step to define the \( k+1 \) approximation, \[ \begin{align} & 0 = \nabla f(\mathbf{x}_k) + \mathbf{H}_f(\mathbf{x}_k) \boldsymbol{\delta}_{x_{k+1}} \\ \Leftrightarrow & \boldsymbol{\delta}_{x_{k+1}} = -\left(\mathbf{H}_f(\mathbf{x}_k)\right)^{-1} \nabla f(\mathbf{x}_k)\\ &\mathbf{x}_{k+1} = \mathbf{x}_k + \boldsymbol{\delta}_{x_k}. \end{align} \]
This process will continue until the approximation reaches an error tolerance (when we have a good initial guess \( \mathbf{x}_0 \)) or terminate after timing out, failing to converge if we are not in an appropriate neighborhood \( \mathcal{N} \) of a minimizer \( \mathbf{x}^\ast \).
Unlike the gradient vector alone, this gives a step size choice derived by the local geometry, and this tends to converge to the local minimum quickly as long as the initial choice \( \mathbf{x}_0 \) is in the neighborhood \( \mathcal{N} \).
However, this method does not know if there is a better minimizing solution \( \mathbf{x}_0^\ast \) that lies in a different neighborhood \( \mathcal{N}_0 \).
Newton descent is the basis for a wide class of line-search methods in optimization, i.e., where we try to optimize a function by finding a good descent direction and choose an appropriate step size based on the direction.