Unconstrained optimization part I

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

Concepts in optimization
Local versus global optimum
Convexity
Gradient descent versus Newton descent

Concepts in optimization

The maximization and minimization of functions, or optimization problems, contain two components:
1. an objective function \( f(\mathbf{x}) \); and
2. constraints \( g(\mathbf{x}) \).
E.g., we may wish to optimize factory output \( f(x) \) as a function of hours \( x \) in a week, with a measure of our active machine-hours \( g(x) \) not exceeding a pre-specified limitation \( g(x)\leq C \).
Optimization problems can thus be classified into two categories:
- If there are constraints \( g(\mathbf{x}) \) affiliated with the objective function \( f(\mathbf{x}) \), then it is a constrained optimization problem, otherwise, it is a unconstrained optimization problem.
We will focus on unconstrained optimization as often arises in MLE; this is formulated as the following problem,

\[ \begin{align} f: \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} &\rightarrow f(\mathbf{x})\\ f(\mathbf{x}^\ast) &= \mathrm{max}_{\mathbf{x} \in \mathcal{D}} f \end{align} \]
We note that the above problem is equivalent to a minimization problem by a subsitution of \( \tilde{f} = -f \), i.e.,

\[ \begin{align} \tilde{f}: \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} &\rightarrow -f(\mathbf{x})\\ f(\mathbf{x}^\ast) &= \mathrm{max}_{\mathbf{x} \in \mathcal{D}} \tilde{f} = \mathrm{min}_{\mathbf{x}\in \mathcal{D}} f \end{align} \]

Concepts in optimization

Because these problems are equivalent, we focus on the minimization of functions at the moment as they are traditionally phrased in optimization literature.
The same techniques will apply for maximization by a simple change of variables.
We will need to identify a few key concepts: global and local minimizers.
Suppose we are trying to minimize an objective function \( f \).
We would ideally find a global minimizer of \( f \), a point where the function attains its least value over all possible values under consideration:
A point \( \mathbf{x}^\ast \) is a global minimizer if \( f(\mathbf{x}^\ast) \leq f(\mathbf{x}) \) for all other possible \( \mathbf{x} \) in the domain of consideration \( D\subset \mathbb{R}^n \).
A global minimizer can be difficult to find, because our knowledge of \( f \) is usually only local;

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

i.e., we only can approximate the behavior of the function \( f \) within small perturbations \( \boldsymbol{\delta}_{x} \) of values \( \mathbf{x} \) where we already know \( f(\mathbf{x}) \).

Since our algorithm hopefully does not need to compute \( f \) over many points, we usually do not have a good picture of the overall shape of \( f \),

generally, we can never be sure that the function does not take a sharp dip in some region that has not been sampled by the algorithm.

Concepts in optimization

Most algorithms are able to find only a local minimizer, which is a point that achieves the smallest value of \( f \) in its neighborhood, i.e.,

Let \( \mathcal{N}\subset \mathcal{D} \subset\mathbb{R}^n \) be a neighborhood of the point \( \mathbf{x}^\ast \) in the domain of consideration. We say \( \mathbf{x}^\ast \) is a local minimizer in the neighborhood of \( \mathcal{N} \) if \[ \begin{align} f(\mathbf{x}^\ast) \leq f(\mathbf{x}) & & \text{ for each other }\mathbf{x}\in \mathcal{N} \end{align} \]

For finding a local minimizer, the main tools will be derived directly from the second order approximation of the objective function \( f \), defined by

\[ \begin{align} f(\mathbf{x}_1) \approx f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+\frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \end{align} \]
We will consider how this is related to the notion of convexity as follows.

Convexity

Image of the upper region enclosed by a convex function.

Courtesy of: Oleg Alexandrov. Public domain, via Wikimedia Commons.

Throughout mathematics, the notion of convexity is a powerful tool, often used in optimization.
Particularly, a function is convex if and only if the region above its graph is a convex set.

The convexity of the full epigraph set means that the function attains a global minimum over its entire domain.

In non-convex functions, we can have regions that are also locally convex in the graph of the function.
For such regions, we can find local minimizers as defined in the last slide.
In a single variable, this is phrased in terms of the second derivative test, i.e.,
For the function of one variable \( f(x) \) we say that \( x^\ast \) is a local minimizer if \( f'(x^\ast)=0 \) and \( f''(x^\ast)> 0 \).
There is a direct analogy for a function of multiple variables, but this need to be rephrased slightly.

The second derivative test with the Hessian

For a real-valued function of multiple variables, \[ f:\mathbb{R}^n \rightarrow \mathbb{R}, \] we will instead phrase the second derivative test in terms of the Hessian of \( f \), \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \vdots &\ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \cdots & \partial_{x_n}^2 f \end{pmatrix} \end{align} \]
Particularly, the spectral theorem says the Hessian has an eigen decomposition such that by a change of coordinates, \( \mathbf{H}_f \) will be diagonal.

Different critical points for different Hessian spectrum.

Courtesy of: Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.

The different combinations of eigenvalues of the Hessian will determine the local curvature of the graph of \( f \) with examples in two dimensions pictured above:

Left: we see a convex equation for the simple paraboloid – around this critical point, the eigenvalues of \( \mathbf{H}_f \) will all be strictly positive.
Middle: we see an equation with no global minimum, but infinitely local minima – all points \( (x,y) \) with \( x=0 \) are critical points, but this is convex in only in the \( x \) direction and the \( y \) direction will correspond to a zero eigenvalue of \( \mathbf{H}_f \).
Right: there is a critical saddle point at \( (0,0) \) but this is not even a local mimizer – here the Hessian \( \mathbf{H}_f \) will have one positive and one negative eigenvalue.

These examples extend into higher dimensions, and generally we say that the function is locally convex at \( \mathbf{x}^\ast \) when the Hessian has only positive eigenvalues at \( \mathbf{x}^\ast \).

The second derivative test with the Hessian

We can understand the second derivative test with the Hessian using our second order Taylor approximation as follows.
Let \( \mathbf{x}^\ast \) be a critical point and \( \mathbf{x}_1 = \mathbf{x}^\ast + \boldsymbol{\delta}_{x_1} \) be a perturbation of this in the neighborhood \( \mathcal{N} \).
Suppose that the Hessian has only positive eigenvalues at \( \mathbf{x}^\ast \), then our approximation at second order gives \[ \begin{align} f(\mathbf{x}_1) &\approx f(\mathbf{x}^\ast) + \left(\nabla f(\mathbf{x}^\ast)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+ \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \\ &= f(\mathbf{x}^\ast) + \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \end{align} \]
Provided \( \mathcal{N} \) is a small enough neighborhood, the \( \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^3\right) \) will remain very small.
However, \( \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}^\ast) \boldsymbol{\delta}_{x_1} \) must be positive by the positive eigenvalues of the Hessian.
This says for a radius sufficiently small, \( \parallel \boldsymbol{\delta}_{x_1}\parallel \), and any perturbation of the point \( \mathbf{x}^\ast \) defined as \( \mathbf{x}_1 = \mathbf{x}^\ast + \boldsymbol{\delta}_{x_1} \), we have \[ f(\mathbf{x}_1) \geq f(\mathbf{x}^\ast). \]
Therefore, we can identify a local minimizer whenever the gradient is zero and the Hessian has positive eigenvalues, due to the local convexity.

Gradient descent vs Newton descent

We noted before that the gradient \( \nabla f \) is the direction and the velocity of the greatest rate of increase of the function \( f \).
There are some good reasons to consider thus following the direction \( -\nabla f \) to find a local minimizer.
However, it is possible that we will overshoot the local minimum if we take too long of a step along this direction.
We also may not know a good choice for what length of step to use, for example consider the following figure.

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

The contours represent fixed values for the function \( f \); i.e.,
if a contour is defined \( \mathcal{C} \), then for all \( \mathbf{c}_0 \in \mathcal{C} \), \[ f(\mathbf{c}_0 ) = C \] for a fixed value \( C \).

For \( t > 0 \) we define a perturbation along the gradient this vector by \( \mathbf{x}_k - t \nabla f \).
For most of this direction, \[ f(\mathbf{x}_k - t \nabla f) \leq f(\mathbf{x}_k) \] as this moves to the inner contours around \( \mathbf{x}^\ast \).
However, for \( t \) large enough, this is not true and we do not know by default what size \( t \) is appropriate, or if \( f \) is extremely sensitive in this way.

Gradient descent vs Newton descent

Let’s recall Newton’s method to find the appropriate direction and step length:

When we were looking for the zero of a function, we used the first order approximation and found where this function takes the value zero.

We will consider a similar idea using the second order approximation, \[ \begin{align} m(\boldsymbol{\delta}_{x_1}) = f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+\frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1}. \end{align} \]
By setting the derivative of \( m \) with respect to the perturbation \( \boldsymbol{\delta}_x \) equal to zero for some \( \boldsymbol{\delta}_{x_1} \): \[ \begin{align} & 0 = \nabla f(\mathbf{x}_0) + \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \\ \Leftrightarrow & \boldsymbol{\delta}_{x_1} = -\left(\mathbf{H}_f(\mathbf{x}_0)\right)^{-1} \nabla f(\mathbf{x}_0) \end{align} \]
If \( \mathbf{H}_f \) has an inverse at \( \mathbf{x}_0 \), and if \( \mathbf{H}_f(\mathbf{x}_0) \) has positive eigenvalues, this gives a descent direction in \( f \).

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

Particularly, our new choice for the estimated minimum will be given by \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) for which \( f(\mathbf{x}_0)\geq f(\mathbf{x}_1) \).
We may find a lower value along the direction of gradient descent, but without sampling this direction many times over, we do not by default know where this will lie.

Newton descent

In Newton descent, for each \( k \) we repeat the step to define the \( k+1 \) approximation, \[ \begin{align} & 0 = \nabla f(\mathbf{x}_k) + \mathbf{H}_f(\mathbf{x}_k) \boldsymbol{\delta}_{x_{k+1}} \\ \Leftrightarrow & \boldsymbol{\delta}_{x_{k+1}} = -\left(\mathbf{H}_f(\mathbf{x}_k)\right)^{-1} \nabla f(\mathbf{x}_k)\\ &\mathbf{x}_{k+1} = \mathbf{x}_k + \boldsymbol{\delta}_{x_k}. \end{align} \]
This process will continue until the approximation reaches an error tolerance (when we have a good initial guess \( \mathbf{x}_0 \)) or terminate after timing out, failing to converge if we are not in an appropriate neighborhood \( \mathcal{N} \) of a minimizer \( \mathbf{x}^\ast \).
- When we are in a locally convex neighborhood \( \mathcal{N} \) containing \( \mathbf{x}^\ast \), we can be assured that \( \mathbf{H}_f \) will be invertible, and that it will have strictly positive eigenvalues, making the initial choice very important in producing a result.
Unlike the gradient vector alone, this gives a step size choice derived by the local geometry, and this tends to converge to the local minimum quickly as long as the initial choice \( \mathbf{x}_0 \) is in the neighborhood \( \mathcal{N} \).
However, this method does not know if there is a better minimizing solution \( \mathbf{x}_0^\ast \) that lies in a different neighborhood \( \mathcal{N}_0 \).
- The results will depend strongly on the shape of the objective function \( f \), our initial choice \( \mathbf{x}_0 \) and whatever prior knowledge about the problem we include in making such an initial guess.
Newton descent is the basis for a wide class of line-search methods in optimization, i.e., where we try to optimize a function by finding a good descent direction and choose an appropriate step size based on the direction.