Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:

This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

- The following topics will be covered in this lecture:
- BFGS
- Least-squares
- Gauss-Newton

The main issue with computing the minimization problem with a direct Newton descent as above is that \( \mathbf{H}_f \)

**is very computationally expensive to compute for any moderate number of variables**.- If the system size has \( n \) variables, \( \mathbf{x}_0 \in \mathbb{R}^n \), then \( \mathbf{H}_f(\mathbf{x}_0) \in \mathbb{R}^{n\times n} \).

Moreover, we need to recompute this quite often if we need to make multiple steps.

One popular way of avoiding the Hessian computation is the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno.

- The authors (approximately) simultaneously invented the algorithm independently, and the algorithm is named for them alphabetically.

We will go through a short description on how the approximation is made, but we will not belabor the details which go beyond our scope.

Taylor's theorem can also be applied to the gradient of \( f \) as follows to derive the necessary approximation.

Suppose that \( f \) has two continuous partial derivatives and \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) is a small perturbation from an initial point \( \mathbf{x}_0 \).

Suppose we take \( \mathbf{f}(x) \) to be the function defined by the gradient of \( f \), i.e.,

\[ \begin{align} \mathbf{f}:\mathbb{R}^n& \rightarrow \mathbb{R}^n \\ \mathbf{x} &\rightarrow \nabla f (\mathbf{x}) \end{align} \]

\( \mathbf{f} \) also has a Taylor expansion, again taking the gradient, but of a vector valued function, i.e.,

\[ \nabla \mathbf{f} = \nabla^2 f = \mathbf{H}_f \] which can be shown by following the definition of taking the gradient of a vector valued function, and the definition of the Hessian.

Therefore, taking the

**first order Taylor series**of \( \mathbf{f} \), we have\[ \begin{align} \mathbf{f}(\mathbf{x}_1) &= \mathbf{f}(\mathbf{x}_0) + \nabla \mathbf{f}(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1)&= \nabla f (\mathbf{x}_0) + \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) &= \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \end{align} \]

Recall our approximation from the last slide

\[ \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) = \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right). \]

We say that for a small perturbation \( \boldsymbol{\delta}_{x_1} \) and for \( \mathbf{x}_1,\mathbf{x}_0 \in \mathcal{N} \), the locally convex neighborhood,

- the action of the Hessian on the perturbation

\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \] is well approximated by the finite difference:

\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \approx \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0). \]

We thus

**choose a Hessian approximation**\( \widetilde{\mathbf{H}}_f \) so that it**mimics the finite difference equation of the true Hessian**;- that is, we require it to satisfy the following condition, known as the secant equation:

\[ \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \boldsymbol{\delta}_{x_{k+1}} = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_{k}) \]

We will also need to impose that the estimate \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \) is symmetric by default;

- also, we prefer that \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k+1}) \) differs only by a small amount the last-computed approximation \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \).

For simplicity in the notations, let us denote \( \mathbf{B}_k = \widetilde{\mathbf{H}}_f(\mathbf{x}_k) \), \( \boldsymbol{\delta}_{x_k} = \boldsymbol{\delta}_k \) and \( \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k) \).

We will define the

**BFGS Hessian approximation**as,\[ \begin{align} \mathbf{B}_{k+1} = \mathbf{B}_{k} - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \end{align} \] where:

- the two right-hand-side terms are actually two
**matrix-valued, weighted outer products**of the vectors \( \mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \) and \( \mathbf{y}_k \) with themselves; - the above implies that the next approximation is only a difference from the last one by a
**rank-2 difference**, i.e.,

\[ - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \] has only

**2 non-zero (non-trivial to compute) eigen directions**, regardless of the dimension \( n>2 \);- the above \( \mathbf{B}_{k+1} \) remains
**symmetric and has only positive eigenvalues**when \( \mathbf{B}_0 \) satisfies this condition and \( \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{y}_k > 0 \). - the above can be solved for \( \mathbf{B}_{k+1} \) to produce \[ \boldsymbol{\delta}_{x_{k+2}} = - \mathbf{B}_{k+1}^{-1} \nabla f_{k+1} \]

- the two right-hand-side terms are actually two
We will not discuss the approximation further, only note that this works as a fairly general and efficient approximation.

Let us now consider a simple example with BFGS.

Let us define the paraboloid function, \[ f(x_1, x_2)= \left(3 - x_1 \right)^2 + \left(5 - x_2 \right)^2, \]

or in R as

```
f_exp <- expression( ( 3- x_1)^2 + (5 - x_2)^2)
f <- function(x){(3 - x[1])^2 + (5 - x[2])^2}
```

- Let us compute the gradient and the Hessian of the function analytically first:

```
print(grad_f <- c(D(f_exp, "x_1"), D(f_exp, "x_2")))
```

```
[[1]]
-(2 * (3 - x_1))
[[2]]
-(2 * (5 - x_2))
```

```
print(hessian_f_11 <- D(D(f_exp,"x_1"),"x_1"))
```

```
[1] 2
```

```
print(hessian_f_12 <- D(D(f_exp,"x_1"),"x_2"))
```

```
[1] 0
```

```
print(hessian_f_22 <- D(D(f_exp,"x_2"),"x_2"))
```

```
[1] 2
```

**Q:**what can be deduced from the above gradient and Hessian equation for the minimization problem?**A:**the Hessian tells us once again the function is globally convex and there is a critical point at \( (3,5) \).

- We can verify this with the numerical gradient and Hessian as:

```
require("numDeriv")
grad(f, x=c(3,5))
```

```
[1] 0 0
```

```
hessian(f, x=c(3,5))
```

```
[,1] [,2]
[1,] 2.000000e+00 -4.993067e-18
[2,] -4.993067e-18 2.000000e+00
```

The BFGS algorithm is provided in the package

`optimx`

.We will call the minimization procedure on the function

`f`

as follows:

```
require("optimx")
fBFGS = optimx(fn = f, # define the objective function
par = c(0, 0), # define the initial first guess at the minimum
method ="BFGS") # define the method of solution
print(fBFGS)
```

```
p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
BFGS 3 5 7.949837e-24 11 3 NA 0 TRUE TRUE 0
```

Here, this quickly finds the true minimum of the function

`f`

rapidly, due to the global convexity.More generally, it is possible that the minimization procedure will not converge due to structural issues in the problem.

We will more examples of using BFGS and the second derivative test in multiple variables in our activities and homework.

In statistics and optimization problems in which we need to

**fit a model to data**, one of the most common forms of objective functions is known as**least-squares**.In least-squares problems, the objective function \( f \) has the following special form:

\[ \begin{align} f:\mathbb{R}^n &\rightarrow \mathbb{R} \\ r_i : \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} :&\rightarrow f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]

In the above expression, each of the \( r_i \) ranging from \( i=1,\cdots, m \) is called a

**residual**where we assume \( m > n \);- \( r_i \)
**measures the discrepancy of some prediction as a function of the free variables \( \mathbf{x} \) from a measured value in reality**. - \( r_i \) thus refers to the discrepancy of the \( i \)-th prediction from the \( i \)-th case over \( m > n \) total observed cases.

- \( r_i \)
These types of problems arise in nearly any field that involve a predictive model and data, and this problem is at the heart of regression and machine learning.

Assume that \( \boldsymbol{\phi} \) represents some prediction of a vector of observed values \( \mathbf{y} \), but the prediction depends on the vector of free, tuneable parameters \( \mathbf{x} \).

The

**goal then is to find the “best” choice of tuneable parameters \( \mathbf{x} \)**that can**minimize the difference between a prediction and reality**.This gives an objective function of the form,

\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m \frac{1}{2} \vert \boldsymbol{\phi}_i(\mathbf{x}) - \mathbf{y}_i\vert^2 \end{align} \] where:

- \( \boldsymbol{\phi}_i(\mathbf{x}) \) and \( \mathbf{y}_i \) represent the predicted and observed \( i \)-th case respectively;
- in the above, we would call \( r_i(\mathbf{x}) = \frac{1}{2}\vert \boldsymbol{\phi}_i(\mathbf{x}) - \mathbf{y}_i\vert \); but
- we can generally take many other forms for the residual, amplifying specific features of the analysis with the specific choice.

One very special case is where \( \boldsymbol{\phi}(\mathbf{x}) \) is actually a linear relationship in the values \( \mathbf{x} \), in which case

\[ \boldsymbol{\Phi} \mathbf{x} = \mathbf{y} \] can be represented with the matrix equation in \( \boldsymbol{\Phi} \).

Unlike the typical linear inverse problem, \( \mathbf{\Phi}\in \mathbb{R}^{m \times n} \) is not square (assuming \( m > n \)), and no linear inverse exists in this problem for \( \boldsymbol{\Phi} \).

Using relationships of vector calculus, one can show

\[ \begin{align} \nabla f = \boldsymbol{\Phi}^\mathrm{T} \left(\boldsymbol{\Phi} \mathbf{x} - \mathbf{y}\right) & & \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \end{align} \]

We note, \( \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \) is a

**constant-valued, matrix squared**and (as long as \( \boldsymbol{\Phi} \) has independent columns) the**Hessian has positive eigenvalues**;- this says that linear least squares is a globally convex problem when well posed.

Similarly, if we take the unique minimizer to be defined as \( \mathbf{x}^\ast \),

\[ \begin{align} & \nabla f (\mathbf{x}^\ast) = 0 \\ \Leftrightarrow & \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \mathbf{x} = \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \end{align} \] by definition.

The above equations are known as the

**normal equations**and have a**unique solution by the property of the linear inverse problem**\[ \mathbf{x} = \left(\boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \] as long as the original \( \boldsymbol{\Phi} \) is well-posed.

This is the basis of linear regression, which we will return to after the midterm.

Generally, we may take

**\( \boldsymbol{\phi} \) to be nonlinear**, and a**direct approach as above will not work**.We can appeal to BFGS in this case to find a local minimum, but

**least squares has an extremely special structure that will lead to a more efficient approximation**.Let us recall the general form for \( f \) in terms of the residuals \( r_i \)

\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]

Let us denote a vector of residuals its gradient as

\[ \begin{align} \mathbf{r} = \begin{pmatrix} r_1(\mathbf{x}) \\ \vdots \\ r_m(\mathbf{x}) \end{pmatrix} & & \nabla \mathbf{r} = \begin{pmatrix} \partial_{x_1} r_1(\mathbf{x}) & \cdots & \partial_{x_m} r_1(\mathbf{x}) \\ \vdots & \ddots & \vdots \\ \partial_{x_1} r_n(\mathbf{x}) & \cdots & \partial_{x_m} r_n(\mathbf{x}) \end{pmatrix} \end{align} \]

We can thus write,

\[ \begin{align} f(\mathbf{x}) = \mathbf{r}^\mathrm{T}\mathbf{r} \end{align} \] and utilize the special structure with the Taylor expansion.

Specifically, we can write

\[ \begin{align} \nabla f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} & & \mathbf{H}_f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} + \sum_{i=1}^m r_i \mathbf{H}_{r_i} \end{align} \]

In most cases, \( \nabla \mathbf{r} \) is easy to calculate;

- moreover, the term \( \sum_{i=1}^m r_i \mathbf{H}_{r_i} \) tends to be much smaller than \( \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} \) when in a neighborhood \( \mathcal{N} \) of a minimizer \( \mathbf{x}^\ast \).

This leads to a direct approximation of the Newton decent method where we define the second order approximation as

\[ \begin{align} m(\boldsymbol{\delta}_{x_1}) = f(\mathbf{x}_0) + \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \boldsymbol{\delta}_{x_1} + \frac{1}{2}\boldsymbol{\delta}_{x_1}^\mathrm{T} \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r}\boldsymbol{\delta}_{x_1} \end{align} \]

Setting the derivative of the second order approximation to zero with respect to the perturbation, we get the

**Gauss-Newton approximation**\[ \boldsymbol{\delta}_{x_1} = - \left[ \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} \right]^{-1}\left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \] which can be computed entirely in terms of the residuals and their gradient.

The Gauss-Newton approximation is the basis of a variety of techniques to minimize a nonlinear least squares objective functions.