Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The main issue with computing the minimization problem with a direct Newton descent as above is that \( \mathbf{H}_f \) is very computationally expensive to compute for any moderate number of variables.
Moreover, we need to recompute this quite often if we need to make multiple steps.
One popular way of avoiding the Hessian computation is the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno.
We will go through a short description on how the approximation is made, but we will not belabor the details which go beyond our scope.
Taylor's theorem can also be applied to the gradient of \( f \) as follows to derive the necessary approximation.
Suppose that \( f \) has two continuous partial derivatives and \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) is a small perturbation from an initial point \( \mathbf{x}_0 \).
Suppose we take \( \mathbf{f}(x) \) to be the function defined by the gradient of \( f \), i.e.,
\[ \begin{align} \mathbf{f}:\mathbb{R}^n& \rightarrow \mathbb{R}^n \\ \mathbf{x} &\rightarrow \nabla f (\mathbf{x}) \end{align} \]
\( \mathbf{f} \) also has a Taylor expansion, again taking the gradient, but of a vector valued function, i.e.,
\[ \nabla \mathbf{f} = \nabla^2 f = \mathbf{H}_f \] which can be shown by following the definition of taking the gradient of a vector valued function, and the definition of the Hessian.
Therefore, taking the first order Taylor series of \( \mathbf{f} \), we have
\[ \begin{align} \mathbf{f}(\mathbf{x}_1) &= \mathbf{f}(\mathbf{x}_0) + \nabla \mathbf{f}(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1)&= \nabla f (\mathbf{x}_0) + \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \Leftrightarrow \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) &= \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right) \\ \end{align} \]
Recall our approximation from the last slide
\[ \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0) = \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel \right). \]
We say that for a small perturbation \( \boldsymbol{\delta}_{x_1} \) and for \( \mathbf{x}_1,\mathbf{x}_0 \in \mathcal{N} \), the locally convex neighborhood,
\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \] is well approximated by the finite difference:
\[ \mathbf{H}_f(\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \approx \nabla f (\mathbf{x}_1) - \nabla f (\mathbf{x}_0). \]
We thus choose a Hessian approximation \( \widetilde{\mathbf{H}}_f \) so that it mimics the finite difference equation of the true Hessian;
\[ \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \boldsymbol{\delta}_{x_{k+1}} = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_{k}) \]
We will also need to impose that the estimate \( \widetilde{\mathbf{H}}_f(\mathbf{x}_{k}) \) is symmetric by default;
For simplicity in the notations, let us denote \( \mathbf{B}_k = \widetilde{\mathbf{H}}_f(\mathbf{x}_k) \), \( \boldsymbol{\delta}_{x_k} = \boldsymbol{\delta}_k \) and \( \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k) \).
We will define the BFGS Hessian approximation as,
\[ \begin{align} \mathbf{B}_{k+1} = \mathbf{B}_{k} - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \end{align} \] where:
\[ - \frac{\mathbf{B}_{k}\boldsymbol{\delta}_{k+1} \boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k}}{\boldsymbol{\delta}_{k+1}^\mathrm{T} \mathbf{B}_{k} \boldsymbol{\delta}_{k+1}} + \frac{\mathbf{y}_k \mathbf{y}_k^\mathrm{T}}{\mathbf{y}_k^\mathrm{T} \boldsymbol{\delta}_{k+1}} \] has only 2 non-zero (non-trivial to compute) eigen directions, regardless of the dimension \( n>2 \);
We will not discuss the approximation further, only note that this works as a fairly general and efficient approximation.
Let us now consider a simple example with BFGS.
Let us define the paraboloid function, \[ f(x_1, x_2)= \left(3 - x_1 \right)^2 + \left(5 - x_2 \right)^2, \]
or in R as
f_exp <- expression( ( 3- x_1)^2 + (5 - x_2)^2)
f <- function(x){(3 - x[1])^2 + (5 - x[2])^2}
print(grad_f <- c(D(f_exp, "x_1"), D(f_exp, "x_2")))
[[1]]
-(2 * (3 - x_1))
[[2]]
-(2 * (5 - x_2))
print(hessian_f_11 <- D(D(f_exp,"x_1"),"x_1"))
[1] 2
print(hessian_f_12 <- D(D(f_exp,"x_1"),"x_2"))
[1] 0
print(hessian_f_22 <- D(D(f_exp,"x_2"),"x_2"))
[1] 2
Q: what can be deduced from the above gradient and Hessian equation for the minimization problem?
require("numDeriv")
grad(f, x=c(3,5))
[1] 0 0
hessian(f, x=c(3,5))
[,1] [,2]
[1,] 2.000000e+00 -4.993067e-18
[2,] -4.993067e-18 2.000000e+00
The BFGS algorithm is provided in the package optimx
.
We will call the minimization procedure on the function f
as follows:
require("optimx")
fBFGS = optimx(fn = f, # define the objective function
par = c(0, 0), # define the initial first guess at the minimum
method ="BFGS") # define the method of solution
print(fBFGS)
p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
BFGS 3 5 7.949837e-24 11 3 NA 0 TRUE TRUE 0
Here, this quickly finds the true minimum of the function f
rapidly, due to the global convexity.
More generally, it is possible that the minimization procedure will not converge due to structural issues in the problem.
We will more examples of using BFGS and the second derivative test in multiple variables in our activities and homework.
In statistics and optimization problems in which we need to fit a model to data, one of the most common forms of objective functions is known as least-squares.
In least-squares problems, the objective function \( f \) has the following special form:
\[ \begin{align} f:\mathbb{R}^n &\rightarrow \mathbb{R} \\ r_i : \mathbb{R}^n &\rightarrow \mathbb{R}\\ \mathbf{x} :&\rightarrow f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]
In the above expression, each of the \( r_i \) ranging from \( i=1,\cdots, m \) is called a residual where we assume \( m > n \);
These types of problems arise in nearly any field that involve a predictive model and data, and this problem is at the heart of regression and machine learning.
Assume that \( \boldsymbol{\phi} \) represents some prediction of a vector of observed values \( \mathbf{y} \), but the prediction depends on the vector of free, tuneable parameters \( \mathbf{x} \).
The goal then is to find the “best” choice of tuneable parameters \( \mathbf{x} \) that can minimize the difference between a prediction and reality.
This gives an objective function of the form,
\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m \frac{1}{2} \vert \boldsymbol{\phi}_i(\mathbf{x}) - \mathbf{y}_i\vert^2 \end{align} \] where:
One very special case is where \( \boldsymbol{\phi}(\mathbf{x}) \) is actually a linear relationship in the values \( \mathbf{x} \), in which case
\[ \boldsymbol{\Phi} \mathbf{x} = \mathbf{y} \] can be represented with the matrix equation in \( \boldsymbol{\Phi} \).
Unlike the typical linear inverse problem, \( \mathbf{\Phi}\in \mathbb{R}^{m \times n} \) is not square (assuming \( m > n \)), and no linear inverse exists in this problem for \( \boldsymbol{\Phi} \).
Using relationships of vector calculus, one can show
\[ \begin{align} \nabla f = \boldsymbol{\Phi}^\mathrm{T} \left(\boldsymbol{\Phi} \mathbf{x} - \mathbf{y}\right) & & \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \end{align} \]
We note, \( \mathbf{H}_f= \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \) is a constant-valued, matrix squared and (as long as \( \boldsymbol{\Phi} \) has independent columns) the Hessian has positive eigenvalues;
Similarly, if we take the unique minimizer to be defined as \( \mathbf{x}^\ast \),
\[ \begin{align} & \nabla f (\mathbf{x}^\ast) = 0 \\ \Leftrightarrow & \boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \mathbf{x} = \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \end{align} \] by definition.
The above equations are known as the normal equations and have a unique solution by the property of the linear inverse problem
\[ \mathbf{x} = \left(\boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^\mathrm{T}\mathbf{y} \] as long as the original \( \boldsymbol{\Phi} \) is well-posed.
This is the basis of linear regression, which we will return to after the midterm.
Generally, we may take \( \boldsymbol{\phi} \) to be nonlinear, and a direct approach as above will not work.
We can appeal to BFGS in this case to find a local minimum, but least squares has an extremely special structure that will lead to a more efficient approximation.
Let us recall the general form for \( f \) in terms of the residuals \( r_i \)
\[ \begin{align} f(\mathbf{x}) = \sum_{i=1}^m r_i(\mathbf{x})^2 \end{align} \]
Let us denote a vector of residuals its gradient as
\[ \begin{align} \mathbf{r} = \begin{pmatrix} r_1(\mathbf{x}) \\ \vdots \\ r_m(\mathbf{x}) \end{pmatrix} & & \nabla \mathbf{r} = \begin{pmatrix} \partial_{x_1} r_1(\mathbf{x}) & \cdots & \partial_{x_m} r_1(\mathbf{x}) \\ \vdots & \ddots & \vdots \\ \partial_{x_1} r_n(\mathbf{x}) & \cdots & \partial_{x_m} r_n(\mathbf{x}) \end{pmatrix} \end{align} \]
We can thus write,
\[ \begin{align} f(\mathbf{x}) = \mathbf{r}^\mathrm{T}\mathbf{r} \end{align} \] and utilize the special structure with the Taylor expansion.
Specifically, we can write
\[ \begin{align} \nabla f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} & & \mathbf{H}_f = \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} + \sum_{i=1}^m r_i \mathbf{H}_{r_i} \end{align} \]
In most cases, \( \nabla \mathbf{r} \) is easy to calculate;
This leads to a direct approximation of the Newton decent method where we define the second order approximation as
\[ \begin{align} m(\boldsymbol{\delta}_{x_1}) = f(\mathbf{x}_0) + \left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \boldsymbol{\delta}_{x_1} + \frac{1}{2}\boldsymbol{\delta}_{x_1}^\mathrm{T} \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r}\boldsymbol{\delta}_{x_1} \end{align} \]
Setting the derivative of the second order approximation to zero with respect to the perturbation, we get the Gauss-Newton approximation
\[ \boldsymbol{\delta}_{x_1} = - \left[ \left(\nabla\mathbf{r}\right)^\mathrm{T} \nabla\mathbf{r} \right]^{-1}\left(\nabla\mathbf{r}\right)^\mathrm{T} \mathbf{r} \] which can be computed entirely in terms of the residuals and their gradient.
The Gauss-Newton approximation is the basis of a variety of techniques to minimize a nonlinear least squares objective functions.