Unconstrained optimization part II

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

BFGS
Least-squares
Gauss-Newton

BFGS

The main issue with computing the minimization problem with a direct Newton descent as above is that \( \mathbf{H}_f \) is very computationally expensive to compute for any moderate number of variables.
- If the system size has \( n \) variables, \( \mathbf{x}_0 \in \mathbb{R}^n \), then \( \mathbf{H}_f(\mathbf{x}_0) \in \mathbb{R}^{n\times n} \).
Moreover, we need to recompute this quite often if we need to make multiple steps.
One popular way of avoiding the Hessian computation is the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno.
- The authors (approximately) simultaneously invented the algorithm independently, and the algorithm is named for them alphabetically.
We will go through a short description on how the approximation is made, but we will not belabor the details which go beyond our scope.

BFGS

Taylor's theorem can also be applied to the gradient of \( f \) as follows to derive the necessary approximation.
Suppose that \( f \) has two continuous partial derivatives and \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) is a small perturbation from an initial point \( \mathbf{x}_0 \).
Suppose we take \( \mathbf{f}(x) \) to be the function defined by the gradient of \( f \), i.e.,

\[ \begin{align} \mathbf{f}:\mathbb{R}^n& \rightarrow \mathbb{R}^n \\ \mathbf{x} &\rightarrow \nabla f (\mathbf{x}) \end{align} \]
\( \mathbf{f} \) also has a Taylor expansion, again taking the gradient, but of a vector valued function, i.e.,

\[ \nabla \mathbf{f} = \nabla^2 f = \mathbf{H}_f \] which can be shown by following the definition of taking the gradient of a vector valued function, and the definition of the Hessian.
Therefore, taking the first order Taylor series of \( \mathbf{f} \), we have

\[ \begin{align} \mathbf{f}(\mathbf{x}_1) &= \mathbf{f}(\mathbf{x}_0) + \nabla \mathbf{f}(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1)&= \nabla f (\mathbf{x}_0) + \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) &= \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \end{align} \]

BFGS

Recall our approximation from the last slide

\[ \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) = \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right). \]
We say that for a small perturbation \( \boldsymbol{\delta}_{x_1} \) and for \( \mathbf{x}_1,\mathbf{x}_0 \in \mathcal{N} \), the locally convex neighborhood,
- the action of the Hessian on the perturbation
\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \] is well approximated by the finite difference:

\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \approx \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0). \]
We thus choose a Hessian approximation \( \widetilde{\mathbf{H}}_f \) so that it mimics the finite difference equation of the true Hessian;
- that is, we require it to satisfy the following condition, known as the secant equation:
\[ \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \boldsymbol{\delta}_{x_{k+1}} = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_{k}) \]
We will also need to impose that the estimate \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \) is symmetric by default;
- also, we prefer that \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k+1}) \) differs only by a small amount the last-computed approximation \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \).

BFGS

For simplicity in the notations, let us denote \( \mathbf{B}_k = \widetilde{\mathbf{H}}_f(\mathbf{x}_k) \), \( \boldsymbol{\delta}_{x_k} = \boldsymbol{\delta}_k \) and \( \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k) \).
We will define the BFGS Hessian approximation as,

\[ \begin{align} \mathbf{B}_{k+1} = \mathbf{B}_{k} - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \end{align} \] where:
- the two right-hand-side terms are actually two matrix-valued, weighted outer products of the vectors \( \mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \) and \( \mathbf{y}_k \) with themselves;
- the above implies that the next approximation is only a difference from the last one by a rank-2 difference, i.e.,
\[ - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \] has only 2 non-zero (non-trivial to compute) eigen directions, regardless of the dimension \( n>2 \);
- the above \( \mathbf{B}_{k+1} \) remains symmetric and has only positive eigenvalues when \( \mathbf{B}_0 \) satisfies this condition and \( \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{y}_k > 0 \).
- the above can be solved for \( \mathbf{B}_{k+1} \) to produce \[ \boldsymbol{\delta}_{x_{k+2}} = - \mathbf{B}_{k+1}^{-1} \nabla f_{k+1} \]
We will not discuss the approximation further, only note that this works as a fairly general and efficient approximation.

BFGS example

Let us now consider a simple example with BFGS.
Let us define the paraboloid function, \[ f(x_1, x_2)= \left(3 - x_1 \right)^2 + \left(5 - x_2 \right)^2, \]
or in R as

f_exp <- expression( ( 3- x_1)^2 + (5 - x_2)^2) 
f <- function(x){(3 - x[1])^2 + (5 - x[2])^2}

BFGS

Let us compute the gradient and the Hessian of the function analytically first:

print(grad_f <- c(D(f_exp, "x_1"), D(f_exp, "x_2")))

[[1]]
-(2 * (3 - x_1))

[[2]]
-(2 * (5 - x_2))

print(hessian_f_11 <- D(D(f_exp,"x_1"),"x_1"))

[1] 2

print(hessian_f_12 <- D(D(f_exp,"x_1"),"x_2"))

[1] 0

print(hessian_f_22 <- D(D(f_exp,"x_2"),"x_2"))

[1] 2

Q: what can be deduced from the above gradient and Hessian equation for the minimization problem?
- A: the Hessian tells us once again the function is globally convex and there is a critical point at \( (3,5) \).

BFGS

We can verify this with the numerical gradient and Hessian as:

require("numDeriv")
grad(f, x=c(3,5))

[1] 0 0

hessian(f, x=c(3,5))

              [,1]          [,2]
[1,]  2.000000e+00 -4.993067e-18
[2,] -4.993067e-18  2.000000e+00

BFGS

The BFGS algorithm is provided in the package optimx.
We will call the minimization procedure on the function f as follows:

require("optimx")
fBFGS = optimx(fn = f, # define the objective function
   par = c(0, 0), # define the initial first guess at the minimum
   method ="BFGS") # define the method of solution
print(fBFGS)

     p1 p2        value fevals gevals niter convcode kkt1 kkt2 xtime
BFGS  3  5 7.949837e-24     11      3    NA        0 TRUE TRUE     0

Here, this quickly finds the true minimum of the function f rapidly, due to the global convexity.
More generally, it is possible that the minimization procedure will not converge due to structural issues in the problem.
We will more examples of using BFGS and the second derivative test in multiple variables in our activities and homework.

Least squares problems

In statistics and optimization problems in which we need to fit a model to data, one of the most common forms of objective functions is known as least-squares.
In least-squares problems, the objective function \( f \) has the following special form:

\[ \begin{align} f:\mathbb{R}^n &\rightarrow \mathbb{R} \\ r_i : \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} :&\rightarrow f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]
In the above expression, each of the \( r_i \) ranging from \( i=1,\cdots, m \) is called a residual where we assume \( m > n \);
- \( r_i \) measures the discrepancy of some prediction as a function of the free variables \( \mathbf{x} \) from a measured value in reality.
- \( r_i \) thus refers to the discrepancy of the \( i \)-th prediction from the \( i \)-th case over \( m > n \) total observed cases.
These types of problems arise in nearly any field that involve a predictive model and data, and this problem is at the heart of regression and machine learning.

Linear least squares problems

Assume that \( \boldsymbol{\phi} \) represents some prediction of a vector of observed values \( \mathbf{y} \), but the prediction depends on the vector of free, tuneable parameters \( \mathbf{x} \).
The goal then is to find the “best” choice of tuneable parameters \( \mathbf{x} \) that can minimize the difference between a prediction and reality.
This gives an objective function of the form,

\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m \frac{1}{2} \vert \boldsymbol{\phi}_i(\mathbf{x}) - \mathbf{y}_i\vert^2 \end{align} \] where:
1. \( \boldsymbol{\phi}_i(\mathbf{x}) \) and \( \mathbf{y}_i \) represent the predicted and observed \( i \)-th case respectively;
2. in the above, we would call \( r_i(\mathbf{x}) = \frac{1}{2}\vert \boldsymbol{\phi}_i(\mathbf{x}) - \mathbf{y}_i\vert \); but
3. we can generally take many other forms for the residual, amplifying specific features of the analysis with the specific choice.
One very special case is where \( \boldsymbol{\phi}(\mathbf{x}) \) is actually a linear relationship in the values \( \mathbf{x} \), in which case

\[ \boldsymbol{\Phi} \mathbf{x} = \mathbf{y} \] can be represented with the matrix equation in \( \boldsymbol{\Phi} \).
Unlike the typical linear inverse problem, \( \mathbf{\Phi}\in \mathbb{R}^{m \times n} \) is not square (assuming \( m > n \)), and no linear inverse exists in this problem for \( \boldsymbol{\Phi} \).

The normal equations

Using relationships of vector calculus, one can show

\[ \begin{align} \nabla f = \boldsymbol{\Phi}^\mathrm{T} \left(\boldsymbol{\Phi} \mathbf{x} - \mathbf{y}\right) & & \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \end{align} \]
We note, \( \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \) is a constant-valued, matrix squared and (as long as \( \boldsymbol{\Phi} \) has independent columns) the Hessian has positive eigenvalues;
- this says that linear least squares is a globally convex problem when well posed.
Similarly, if we take the unique minimizer to be defined as \( \mathbf{x}^\ast \),

\[ \begin{align} & \nabla f (\mathbf{x}^\ast) = 0 \\ \Leftrightarrow & \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \mathbf{x} = \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \end{align} \] by definition.
The above equations are known as the normal equations and have a unique solution by the property of the linear inverse problem

\[ \mathbf{x} = \left(\boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \] as long as the original \( \boldsymbol{\Phi} \) is well-posed.
This is the basis of linear regression, which we will return to after the midterm.

Nonlinear least squares – Gauss-Newton

Generally, we may take \( \boldsymbol{\phi} \) to be nonlinear, and a direct approach as above will not work.
We can appeal to BFGS in this case to find a local minimum, but least squares has an extremely special structure that will lead to a more efficient approximation.
Let us recall the general form for \( f \) in terms of the residuals \( r_i \)

\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]
Let us denote a vector of residuals its gradient as

\[ \begin{align} \mathbf{r} = \begin{pmatrix} r_1(\mathbf{x}) \\ \vdots \\ r_m(\mathbf{x}) \end{pmatrix} & & \nabla \mathbf{r} = \begin{pmatrix} \partial_{x_1} r_1(\mathbf{x}) & \cdots & \partial_{x_m} r_1(\mathbf{x}) \\ \vdots & \ddots & \vdots \\ \partial_{x_1} r_n(\mathbf{x}) & \cdots & \partial_{x_m} r_n(\mathbf{x}) \end{pmatrix} \end{align} \]
We can thus write,

\[ \begin{align} f(\mathbf{x}) = \mathbf{r}^\mathrm{T}\mathbf{r} \end{align} \] and utilize the special structure with the Taylor expansion.

Nonlinear least squares – Gauss-Newton

Specifically, we can write

\[ \begin{align} \nabla f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} & & \mathbf{H}_f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} + \sum_{i=1}^m r_i \mathbf{H}_{r_i} \end{align} \]
In most cases, \( \nabla \mathbf{r} \) is easy to calculate;
- moreover, the term \( \sum_{i=1}^m r_i \mathbf{H}_{r_i} \) tends to be much smaller than \( \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} \) when in a neighborhood \( \mathcal{N} \) of a minimizer \( \mathbf{x}^\ast \).
This leads to a direct approximation of the Newton decent method where we define the second order approximation as

\[ \begin{align} m(\boldsymbol{\delta}_{x_1}) = f(\mathbf{x}_0) + \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \boldsymbol{\delta}_{x_1} + \frac{1}{2}\boldsymbol{\delta}_{x_1}^\mathrm{T} \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r}\boldsymbol{\delta}_{x_1} \end{align} \]
Setting the derivative of the second order approximation to zero with respect to the perturbation, we get the Gauss-Newton approximation

\[ \boldsymbol{\delta}_{x_1} = - \left[ \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} \right]^{-1}\left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \] which can be computed entirely in terms of the residuals and their gradient.
The Gauss-Newton approximation is the basis of a variety of techniques to minimize a nonlinear least squares objective functions.