A review of vector calculus and concepts in optimization

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

Concepts in analytical and numerical differentiation
Newton’s method in one variable
Tangent vectors, tangent spaces and vector fields
The Jacobian, the inverse function theorem and Newton’s method in multiple variables
Gradients, Hessians, and concepts in optimization

Concepts in analytical differentiation

Tangent line approximation by derivative.

Courtesy of Pbroks13, CC BY-SA 3.0, via Wikimedia Commons

The derivative represents the slope of a tangent line to a curve.
In the figure to the left, we see the function $ f $ represented by the blue curve.
The derivative $ f'(x) $ at a given point gives the infinitesimal rate of change at that point with respect to small changes in $ x $, denoted $ \delta_x $.
Suppose we have a point $ x_0 $, a nearby point that differs by only a small amount in $ x $ \[ x_1 = x_0+\delta_{x_1}, \]
The function \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} \] is what is known as the tangent line approximation to the function $ f $.
Such an approximation exists when $ f $ is sufficiently smooth and is accurate when $ \delta_{x_1} $ is small, so that the difference of $ x_1 $ from the fixed value $ x_0 $ is small.

We can see graphically how the approximation becomes worse as we take $ \delta_{x_1} $ too large.

Concepts in analytical differentiation

More generally, the tangent line approximation is one kind of general Taylor approximation.
Suppose we have a point $ x_0 $ fixed, and define $ x_1 $ as a small perturbation \[ x_1 = x_0+\delta_{x_1}, \]
If a function $ f $ has $ k $ continuous derivatives we can write \[ f(x_1) = f(x_0) + f'(x_0)\delta_{x_1} + \frac{f''(x_0)}{2!}\delta_{x_1}^2 + \cdots + \frac{f^{(k)}(x_0)}{k!} \delta_{x_1}^k + \mathcal{O}\left(\delta_{x_1}^{k+1}\right) \]
The $ \mathcal{O}\left(\delta_{x_1}^{k+1}\right) $ refers to terms in the remainder, that grows or shrinks like the size of the perturbation to the power $ k+1 $.
- This is why this approximation works well when $ \delta_{x_1} $ is a small perturbation.
Another important practical example of using this Taylor approximation, when the function $ f $ has two continuous derivatives, is \[ f(x_0 + \delta_{x_1}) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0) \frac{\delta_{x_1}^2}{2} \] which will be used shortly for obtaining solutions to several kinds of equations.
Particularly, this is strongly related to our second derivative test from univariate calculus.

An approach to numerical derivation

At the moment, we consider how Taylor's expansion can be used at first order again to approximate the derivative.
Recall, we write

\[ \begin{align} f(x_1) &= f(x_0) + f'(x_0) \delta_{x_1} + \mathcal{O}\left( \delta_{x_1}^2\right) \\ \Leftrightarrow \frac{f(x_1) - f(x_0)}{ \delta_{x_1}} &= f'(x_0) + \mathcal{O}\left( \delta_{x_1}\right) \end{align} \]
This says that for a small value of $ \delta_{x_1} $, we can obtain the numerical approximation of $ f'(x_0) $ proportional to the accuracy of the largest decimal place of $ \delta_{x_1} $ by the difference on the left hand side.
This gives a forward finite difference equation approximation to the derivative.
We can similarly define a backward finite difference equation with $ \pmb{x}_1 := \pmb{x}_0 -\pmb{\delta}_{\pmb{x}_1} $.
In each case, we use the perturbation to parameterize the tangent-line approximation.

Newton's method in one variable

We have seen earlier the basic linear inverse problem,

\[ \begin{align} \mathbf{A}\pmb{x} = \pmb{b} \end{align} \] where $ \pmb{b} $ is an observed quantity and $ \pmb{x} $ are the unknown variables related to $ \pmb{b} $ by the relationships in $ \mathbf{A} $.
- We observed that a unique solution exists when all the relationships expressed by the columns are unique, corresponding to all non-zero eigenvalues.
A similar problem exists when the relationship between $ \pmb{x} $ and $ \pmb{b} $ is non-linear, but we still wish to find some such $ \pmb{x} $.

Nonlinear inverse problem (scalar case)
Suppose we know the nonlinear, scalar function $ f $ that gives a relationship \[ \begin{align} f(x^\ast) = b \end{align} \] for an observed $ b $ but an unknown $ x^\ast $. Finding a value of $ x^\ast $ that satisfies $ f(x^\ast)=b $ is known as a nonlinear inverse problem.

Define a function \[ \begin{align} \tilde{f}(x) = f(x)-b. \end{align} \]
Thus solving the nonlinear inverse problem in one variable is equivalent to finding the appropriate $ x^\ast $ for which \[ \begin{align} \tilde{f}(x^\ast)= 0 . \end{align} \]
Finding a zero of a function, or root finding, is thus equivalent to a nonlinear inverse problem.
The Newton-Raphson method is one classical approach which has inspired many modern techniques.

Newton's method in one variable

We are searching for the point $ x^\ast\in \mathbb{R} $ for which the modified equation $ \tilde{f}\left(x^\ast\right) = 0 $, and we suppose we have a good initial guess $ x_0 $.
We define the tangent approximation as, \[ t(\delta_x) = \tilde{f}(x_0) + \tilde{f}'(x_0) \delta_x \] for some small perturbation value of $ \delta_x $.
Recall, $ \tilde{f}'(x_0) $ refers to the value of the derivative of $ \tilde{f} $ at the point $ x_0 $ – suppose this value is nonzero.
In this case, we will examine where the tangent line intersects zero to find a better approximation of $ x^\ast $.
Suppose that for $ \delta_{x_0} $ we have \[ \begin{matrix} t(\delta_{x_0}) = 0 & \Leftrightarrow & 0= \tilde{f}(x_0) + \tilde{f}'(x_0) \delta_{x_0} & \Leftrightarrow &\delta_{x_0} = \frac{-\tilde{f}(x_0)}{\tilde{f}'(x_0)} \end{matrix} \]
The above solution makes sense as long as $ f'(x_0) $ is not equal to zero;

if not, this says that the tangent line intersects zero at $ x_1 = x_0 + \delta_{x_0} $, giving a new approximation of $ x^\ast $.

Courtesy of Ralf Pfeifer, CC BY-SA 3.0, via Wikimedia Commons

The process of recursively solving for a better approximation of $ x^\ast $ terminates when we reach a certain tolerated level of error in the solution or the process times out, failing to converge.
This method has a direct analog in multiple variables, for which we will need to extend our notion of the derivative and Taylor’s theorem to multiple dimensions.

Newton's method – example

As a quick example, let's consider the Newton algorithm built-in to Scipy.
- Scipy is another standard library like Numpy, but which contains various scientific methods and solvers rather than general linear algebra.
Specifically, we will import the built-in newton function from the optimize sub-module of scipy.

from scipy.optimize import newton

In the following, we define the cubic function $ f(x):=x^3 $, but we are interested in the value $ x^\ast $ for which $ f\left(x^\ast\right)=1 $
- The augmented function $ \tilde{f}(x):= x^3 - 1 $ defines the root-finding problem from the nonlinear inverse problem:

def f(x): return (x**3 - 1)

The newton function can be supplied an analytical derivative, if this can be computed, to improve the accuracy versus, e.g., a finite-differences approximation.
- In the below, we supply this as a simple lambda function in the arguments of newton:

root = newton(f, 1.5, fprime=lambda x: 3 * x**2)
root

1.0

Tangent vectors

To expand our discussion to multiple variables, we will review some fundamental concepts of vector calculus.
Suppose we have a vector valued function, with a single argument:

\[ \begin{align} \pmb{x}:&\mathbb{R} \rightarrow \mathbb{R}^{N};\\ \pmb{x}(t) :=& \begin{pmatrix} x_1(t) & \cdots & x_{N}(t)\end{pmatrix}^\top; \end{align} \]
- prototypically, we will think of $ \pmb{x}(t) $ as a curve in state-space, with its position at each time $ t\in\mathbb{R} $ defined by the equation above.

Tangent vector
Suppose $ \pmb{x}(t) $ is defined as above and that each of the component functions $ x_i(t) $ are differentiable. The tangent vector to the state trajectory $ \pmb{x} $ is defined as \[ \vec{x}:= \frac{\mathrm{d}}{\mathrm{d}t} \pmb{x}:= \begin{pmatrix}\frac{\mathrm{d}}{\mathrm{d}t} x_1(t) & \cdots & \frac{\mathrm{d}}{\mathrm{d}t} x_{N}(t)\end{pmatrix}^\top \]

In the above, the interpretation of the derivative defining a tangent line is extended into multiple variables;
- in this case, the tangent line is embedded in a higher-dimensional space of multiple variables.

Tangent spaces

An important extension of the tangent vector is the notion of the tangent space;

this can be defined in terms of all differential perturbations generated at a point:

Tangent spaces
Let $ \pmb{x}\in\mathbb{R}^{N} $ and $ \gamma(t) $ be an arbitrary differentiable curve $ \pmb{\gamma}:\mathbb{R}\rightarrow \mathbb{R}^{N} $ such that $ \pmb{\gamma}(0)= \pmb{x} $ with a tangent vector defined as $ \vec{\gamma}(0):= \frac{\mathrm{d}}{\mathrm{d}t}|_0 \pmb{\gamma} $. The tangent space at $ T_{\pmb{x}} $ is defined by the linear span of all tangent vectors as such through $ \pmb{x} $.

In the above, we consider only the simplest definition of the tangent space;

in this case the tangent space, $ T_{\pmb{x}} \equiv \mathbb{R}^{N} $, is simply the space of all perturbations to the point $ \pmb{x} $.
However, this idea is extended into far greater generality:

Courtesy of TN, Public domain, via Wikimedia Commons

A “differentiable manifold” is a space that looks “locally” like $ \mathbb{R}^{N} $, including, e.g., curved hyper-surfaces.
The notion of the tangent plane thus is extended by using calculus as usual on $ \mathbb{R}^{N} $ locally on a differentiable manifold.
We can imagine as in the figure above, the tangent plane being defined by all tangent vectors for differentiable curves that are defined in the manifold.
The tangent plane thus gives a general linear approximation to what the manifold looks like (in all directions) up to small perturbations.

Vector fields

The tangent space construction gives us the ability to define arbitrary vector fields.

Vector fields
For a smooth manifold $ M $, a vector field (or flow field) is a mapping from $ M $ to the tangent space $ TM $ such that for each point $ \pmb{x}\in M $, this defines a the tangent vector $ \vec{\gamma} $ for a curve $ \pmb{\gamma} $ that passes through $ \pmb{x} $.

In the above, we again consider a simple, intuitive version of this concept;

this idea can again be extended into far greater generality.

However, this gives a good picture of how we will use vector fields to define equations of motion and state space trajectories.
A vector field can for instance define the velocity vector (instantaneous rate of change and direction) for a particle of fluid in the atmosphere.
The figure to the right where we can imagine that the sphere represents the surface of the Earth with the vector field defining the equations of motion:

$Vector field on a sphere$

Courtesy of I, Cronholm144, CC BY-SA 3.0, via Wikimedia Commons

A curve $ \gamma(t) $ can be defined in time as the evolution of the particle where at any given point, its tangent vector $ \vec{\gamma}(t) $ will be equal to the vector field mapping.
More abstractly, we can consider a curve $ \pmb{x}(t):\mathbb{R}\rightarrow \mathbb{R}^{N} $ to represent all time-evolving states in a dynamical model;

this can represent, e.g., the temperature, pressure and humidity in the atmosphere at all grid points defined with a discretized sphere.

As we move forward, we will be interested in how to “track” the trajectory in the state space, knowing the principles of its evolution, a probability density on its current / past values and having random observations.

The Jacobian

Not all functions in multiple variables have a single input – we can instead write

\[ \begin{align} \pmb{f} : \mathbb{R}^N & \rightarrow \mathbb{R}^M\\ \pmb{x} &\rightarrow \begin{pmatrix} f_1\left(\pmb{x}\right) \\ \vdots \\ f_M(\pmb{x})\end{pmatrix} \end{align} \]
In the above, each of the component-functions $ f_i $ for $ i=1,\cdots , M $ is a function defined

\[ \begin{align} f_i :\mathbb{R}^N &\rightarrow \mathbb{R} \\ \pmb{x} &\rightarrow f_i (\pmb{x}) \end{align} \]

The Jacobian
For a vector-valued, continuously differentiable function $ \pmb{f} $ as above, the Jacobian is defined as the matrix of first partial derivatives \[ \begin{align} \nabla \pmb{f} &= \begin{pmatrix} \partial_{x_1} f_1 & \partial_{x_2} f_1 & \cdots & \partial_{x_n} f_1 \\ \partial_{x_1} f_2 & \partial_{x_2} f_2 & \cdots & \partial_{x_n} f_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_1} f_m & \partial_{x_2} f_m & \cdots & \partial_{x_n} f_m \end{pmatrix}\in\mathbb{R}^{m\times n} \end{align} \] where the partials are with respect to the components of $ \pmb{x} $.

The Jacobian

The Jacobian is also a tangent-linear approximation, taking into account perturbations in all directions of $ \pmb{x} $.
This also gives a version of Taylor's theorem where we can write a tangent-linear approximation of a mapping.

Tangent-linear approximation
Let $ \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^M $ in be a continuously differentiable mapping and $ \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} $ be a purturbation within a sufficiently small neighborhood. Then, the tangent-linear approximation for $ \pmb{f}(\pmb{x}_1) $ is given as \[ \begin{align} \pmb{f} \left(\pmb{x}_1\right) & = \pmb{f} \left(\pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1}\right) \\ & \approx \pmb{f}(\pmb{x}_0) + \nabla \pmb{f}|_{\pmb{x}_0} \pmb{\delta}_{\pmb{x}_1}. \end{align} \]

The tangent-linear approximation above gives an approximation of the image of $ \pmb{f} $ with the tangent space based at $ \pmb{f}(\pmb{x}_0) $, parameterized by the perturbation $ \pmb{\delta}_{\pmb{x}_1} $.
In particular, the Jacobian is a mapping defined

\[ \begin{align} \nabla \pmb{f}|_{\pmb{x}} :T_{\pmb{x}} \rightarrow T_{\pmb{f}(\pmb{x})} \end{align} \]

between the tangent space at the input and the tangent space at the output.
The tangent space is also a linear space by construction, thus giving the “linear” approximation.
This is also generalized in a local sense with mappings between differential manifolds and their tangent spaces.

Inverse function theorem

The Jacobian furthermore gives the extension of the nonlinear inverse problem to multiple dimensions.

Nonlinear inverse problem (multivariate case)
Suppose we know the nonlinear, multivariate function $ \pmb{f} $ that gives a relationship \[ \begin{align} \pmb{f}(\pmb{x}^\ast) = \pmb{b} \end{align} \] for an observed $ \pmb{b} $ but an unknown $ \pmb{x}^\ast $. Finding a value of $ \pmb{x}^\ast $ that satisfies $ \pmb{f}(\pmb{x}^\ast)=\pmb{b} $ is known as a nonlinear inverse problem.

An extremely important result from vector calculus establishes a local notion of invertibility for nonlinear inverse problems as above:

The inverse function theorem
Let $ \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^N $ be a nonlinear function such that for $ \pmb{x}^\ast $, $ \nabla \pmb{f}|_{\pmb{x}^\ast}\in\mathbb{R}^{N \times N} $ exists and is an invertible matrix, and it is a continuous function at $ \pmb{x}^\ast $. Then there exists a neighborhood $ \mathcal{N} $ containing the image $ \pmb{f}\left(\pmb{x}^\ast\right) $ for which any $ \pmb{p}\in\mathcal{N} $ has a unique inverse value $ \pmb{q} $ where \[ \begin{align} \pmb{p}:= \pmb{f}(\pmb{q}). \end{align} \] I.e., $ \pmb{f}^{-1} $ exists on $ \mathcal{N} $ such that $ \pmb{f}^{-1}\circ \pmb{f}= \mathbf{I}_N $ within the properly defined domain.

In particular, the inverse function theorem motivates the extension of Newton's method to multiple variables.

Multivariate Newton

The Newton-Raphson method can now be restated in terms of multiple variables as follows.
Suppose that we have a nonlinear inverse problem stated as follows:

\[ \begin{align} \pmb{f} :\mathbb{R}^N &\rightarrow \mathbb{R}^N \\ \pmb{x} & \rightarrow \pmb{f}(\pmb{x}) \\ \pmb{f}\left(\pmb{x}^\ast\right)& = \pmb{b} \end{align} \]

We redefine this in terms of the adjusted function $ \tilde{\pmb{f}}\left(\pmb{x}^\ast\right) = \pmb{f}\left(\pmb{x}^\ast\right) - \pmb{b} = \pmb{0} $, and we wish to make the same first order approximation as before.
Supposing we have a good initial guess $ \pmb{x}_0 $ for $ \pmb{x}^\ast $, we look for the point where the tangent-linear approximation equals zero, i.e., \[ \begin{align} 0 &= \pmb{f}\left(\pmb{x}_0\right) + \nabla \pmb{f}\left(\pmb{x}_0\right) \pmb{\delta}_{\pmb{x}_1} \\ \Leftrightarrow \pmb{\delta}_{\pmb{x}_1} &= -\left(\nabla \pmb{f}\left(\pmb{x}_0\right) \right)^{-1} \pmb{f}\left(\pmb{x}_0\right). \end{align} \]
The above makes sense as an approximation as long as $ \left(\nabla \pmb{f}\left(\pmb{x}_0\right) \right)^{-1} $ exists, i.e., as long as the Jacobian has no zero eigenvalues.
We can thus once again update the approximation recursively so that $ \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} $, as long as the inverse exists.
This update continues until an error tolerance is reached or the optimization times out.

Concepts in optimization

A related notion to the inverse problem is the maximization or minimization (optimization) of functions.
Optimization problems, contain two components:
1. an objective function $ f(\pmb{x}) $; and
2. constraints $ g(\pmb{x}) $.
E.g., we may wish to optimize factory output $ f(x) $ as a function of hours $ x $ in a week, with a measure of our active machine-hours $ g(x) $ not exceeding a pre-specified limitation $ g(x)\leq C $.
Optimization problems can thus be classified into two categories:
- If there are constraints $ g(\pmb{x}) $ affiliated with the objective function $ f(\pmb{x}) $, then it is a constrained optimization problem, otherwise, it is a unconstrained optimization problem.
We will focus on the simpler unconstrained optimization; this is formulated as the following problem,

\[ \begin{align} f: \mathbb{R}^N \rightarrow \mathbb{R}& & \pmb{x} \rightarrow f(\pmb{x}) & & f(\pmb{x}^\ast) = \mathrm{max}_{\pmb{x} \in \mathcal{D}} f \end{align} \]
We note that the above problem is equivalent to a minimization problem by a substitution of $ \tilde{f} = -f $, i.e.,

\[ \begin{align} \tilde{f}: \mathbb{R}^N \rightarrow \mathbb{R}& & \pmb{x} \rightarrow -f(\pmb{x})& & f(\pmb{x}^\ast) = \mathrm{max}_{\pmb{x} \in \mathcal{D}} f & & \tilde{f}(\pmb{x}^\ast)= \mathrm{min}_{\pmb{x}\in \mathcal{D}} \tilde{f} \end{align} \]

Concepts in optimization

Because these problems are equivalent, we focus on the minimization of functions as they are traditionally phrased in optimization.
The same techniques will apply for maximization by a simple change of variables.
We will need to identify a few key concepts: global and local minimizers.
Suppose we are trying to minimize an objective function $ f $.
We would ideally find a global minimizer of $ f $, a point where the function attains its least value over all possible values under consideration:
Global minimizer
A point $ \pmb{x}^\ast $ is a global minimizer if $ f(\pmb{x}^\ast) \leq f(\pmb{x}) $ for all other possible $ \pmb{x} $ in the domain of consideration $ D\subset \mathbb{R}^n $.
A global minimizer can be difficult to find, because our knowledge of $ f $ is usually only local;

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

i.e., we only can approximate the behavior of the function $ f $ within small perturbations $ \pmb{\delta}_{x} $ of values $ \pmb{x} $ where we already know $ f(\pmb{x}) $.

Since our algorithm hopefully does not need to compute $ f $ over many points, we usually do not have a good picture of the overall shape of $ f $,

generally, we can never be sure that the function does not take a sharp dip in some region that has not been sampled by the algorithm.

Local minima and convexity

The difficulty of finding a global minimum means that we will generally need to handle local minima:

Local minimizer
A point $ \pmb{x}^\ast $ is a local minimizer if there exists some neighborhood $ \mathcal{N}\subset D $ containing $ \pmb{x}^\ast $ such that $ f(\pmb{x}^\ast) \leq f(\pmb{x}) $ for all other possible $ \pmb{x} \in \mathcal{N} $.

Throughout mathematics, the notion of convexity is a powerful tool, often used in optimization for understanding local minima.
A function is convex if and only if the region above its graph is a convex set.

Image of the upper region enclosed by a convex function.

Courtesy of: Oleg Alexandrov. Public domain, via Wikimedia Commons.

The convexity of the full epigraph set means that the function attains a global minimum over its entire domain.

In non-convex functions, we can have regions that are also locally convex in the graph of the function.

For such regions, we can find local minimizers as defined above.

In a single variable, this is phrased in terms of the second derivative test.

Second derivative test
For the function of one variable $ f(x) $ we say that $ x^\ast $ is a local minimizer if $ f'(x^\ast)=0 $ and $ f''(x^\ast)> 0 $.

There is a direct analogy for a function of multiple variables, but this need to be rephrased slightly.

We introduce the tools as follows.

The gradient

We suppose that the objective function takes multivariate inputs and gives a scalar output:

\[ \begin{align} f:&\mathbb{R}^N \rightarrow \mathbb{R}\\ &\pmb{x}\mapsto f(\pmb{x}) \end{align} \]
Formally we will write the gradient as follows, using the same $ \nabla $ notation

The gradient
Suppose $ f $ is a contiuously differentiable objective function defined as above, then the gradient is given as \[ \begin{align} \nabla f = \begin{pmatrix} \partial_{x_1}f & \partial_{x_2} f & \cdots & \partial_{x_N} f \end{pmatrix}^\top \in \mathbb{R}^N \end{align} \] where the above partial derivatives are with respect to the components of $ \pmb{x}\in\mathbb{R}^N $.

An important property of the gradient is that it gives the tangent vector for a curve in the direction and velocity of greatest accent for the output of $ f $.
We also use the gradient similar to previous cases as a linear approximation for the multivariate-input, scalar-output function $ f(\pmb{x}) $.
Let $ \pmb{x}_0\in \mathbb{R}^n $ be some vector and $ \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{x_1} $, where $ \pmb{\delta}_{x_1} $ is now a vector of small perturbations.
At first order, the Taylor series is given as

\[ f(\pmb{x}_1) = f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\top \pmb{\delta}_{x_1} + \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^2\right); \]
- note, the scalar output is given by the inner product of the gradient with the perturbation.

The Hessian

A similar second order approximation to the one we developed for univariate function $ f(x) $, \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0)\frac{\delta_{x_1}^2}{2}, \] can be developed for the multivariate-input case.
In order to do so, we will need to introduce the idea of the Hessian as the second multivariate derivate of $ f $.

The Hessian
Suppose that $ f $ is a scalar-output function of multiple variables, \[ \begin{align} f: \mathbb{R}^n & \rightarrow \mathbb{R} \\ \pmb{x} &\rightarrow f(\pmb{x}) \end{align} \] with continuous second derivatives. The Hessian matrix for the function $ f $ is defined as $ \mathbf{H}_f $ \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \partial_{x_1}\partial_{x_2} f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \partial_{x_2}\partial_{x_1} f & \partial_{x_2}^2 f & \cdots & \partial_{x_2} \partial_{x_n}f \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \partial_{x_n}\partial_{x_2} & \cdots & \partial_{x_n}^2 f \end{pmatrix}. \end{align} \]

For short, this is often written as $ \mathbf{H}_f = \left\{ \partial_{x_i}\partial_{x_j} f\right\}_{i,j=1}^n $ where this refers to the $ i $-th row and $ j $-th column.
$ \mathbf{H}_f $ can be evaluated at a particular point $ \pmb{x}_0 $, and notice that $ \mathbf{H}_f $ is always symmetric.

The Hessian

As before, let $ \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} $ be given as a small perturbation of $ \pmb{x}_0 $.
Using the Hessian as defined on the last slide, if $ f $ has continuous second order partial derivatives, the Taylor series is given at second order as \[ \begin{align} f(\pmb{x}_1) = f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\top \pmb{\delta}_{x_1} + \frac{1}{2} \pmb{\delta}_{x_1}^\top\mathbf{H}_f (\pmb{x}_0) \pmb{\delta}_{x_1} + \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^3\right) \end{align} \]
Similarly, our second order approximation is defined as follows.

Second order objective function approximation
Let $ \pmb{f} $ be a multi-input, scalar-output function with second order continuous derivatives. We define the second order approximation as \[ \begin{align} f(\pmb{x}_1) \approx f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\mathrm{T} \pmb{\delta}_{x_1}+ \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}_0) \pmb{\delta}_{x_1} \end{align} \]

This approximation above is accurate when the size of the perturbation $ \parallel \pmb{\delta}_{x_1}\parallel $ is small.

The second derivative test with the Hessian

For a real-valued function of multiple variables, \[ f:\mathbb{R}^n \rightarrow \mathbb{R}, \] we define the second derivative test in terms of the Hessian of $ f $, \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \vdots &\ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \cdots & \partial_{x_n}^2 f \end{pmatrix} \end{align} \]
Particularly, the spectral theorem says the Hessian is diagonalizable. by an orthogonal change of basis

Different critical points for different Hessian spectrum.

Courtesy of: Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.

The different combinations of eigenvalues of the Hessian will determine the local curvature of the graph of $ f $ with examples in two dimensions pictured above:

Left: we see a convex equation for the simple paraboloid – around this critical point, the eigen values of $ \mathbf{H}_f $ will all be strictly positive.
Middle: we see an equation with no global minimum, but infinitely local minima – all points $ (x,y) $ with $ x=0 $ are critical points, but this is convex in only in the $ x $ direction and the $ y $ direction will correspond to a zero eigen value of $ \mathbf{H}_f $.
Right: there is a critical saddle point at $ (0,0) $ but this is not even a local mimizer – here the Hessian $ \mathbf{H}_f $ will have one positive and one negative eigen value.

These examples extend into higher dimensions, and generally we say that the function is locally convex at $ \pmb{x}^\ast $ when the Hessian has only positive eigenvalues at $ \pmb{x}^\ast $.

The second derivative test with the Hessian

We can understand the second derivative test with the Hessian using our second order Taylor approximation as follows.
Let $ \pmb{x}^\ast $ be a critical point and $ \pmb{x}_1 = \pmb{x}^\ast + \pmb{\delta}_{x_1} $ be a perturbation of this in the neighborhood $ \mathcal{N} $.
For any critical point, the gradient is zero like the regular first derivative.
Suppose that the Hessian has only positive eigenvalues at $ \pmb{x}^\ast $, then our second order approximation \[ \begin{align} f(\pmb{x}_1) &\approx f(\pmb{x}^\ast) + \left(\nabla f(\pmb{x}^\ast)\right)^\mathrm{T} \pmb{\delta}_{x_1}+ \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} \\ &= f(\pmb{x}^\ast) + \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} \end{align} \]
Provided $ \mathcal{N} $ is a small enough neighborhood, the $ \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^3\right) $ will remain very small.
However, $ \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} $ must be positive by the positive eigenvalues of the Hessian.
This says for a radius sufficiently small, $ \parallel \pmb{\delta}_{x_1}\parallel $, and any perturbation of the point $ \pmb{x}^\ast $ defined as $ \pmb{x}_1 = \pmb{x}^\ast + \pmb{\delta}_{x_1} $, we have \[ f(\pmb{x}_1) \geq f(\pmb{x}^\ast). \]
Therefore, we can identify a local minimizer whenever the gradient is zero and the Hessian has positive eigenvalues, due to the local convexity.

Gradient descent vs Newton's descent

We noted before that the gradient $ \nabla f $ is the direction and the velocity of the greatest rate of increase of the function $ f $.
There are some good reasons to consider thus following the direction $ -\nabla f $ to find a local minimizer.
However, it is possible that we will overshoot the local minimum if we take too long of a step along this direction.
We may not know a good choice for what length of step to use, for example consider the following figure.

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

The contours represent fixed values for the function $ f $; i.e.,
if a contour is defined $ \mathcal{C} $, then for all $ \mathbf{c}_0 \in \mathcal{C} $, \[ f(\mathbf{c}_0 ) = C \] for a fixed value $ C $.

For $ t > 0 $ we define a perturbation along the gradient vector by $ \pmb{x}_k - t \nabla f $.
For most of this direction, \[ f(\pmb{x}_k - t \nabla f) \leq f(\pmb{x}_k) \] as this moves to the inner contours around $ \pmb{x}^\ast $.
However, for $ t $ large enough, this is not true and we do not know by default what size $ t $ is appropriate, or if $ f $ is extremely sensitive in this way.

Gradient descent vs Newton's descent

Let’s recall Newton’s method to find the appropriate direction and step length:

When we were looking for the zero of a function, we used the first order approximation and found where this function takes the value zero.

We will consider a similar idea using the second order approximation, \[ \begin{align} m(\pmb{\delta}_{x_1}) = f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\mathrm{T} \pmb{\delta}_{x_1}+\frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}_0) \pmb{\delta}_{x_1}. \end{align} \]
By setting the derivative of $ m $ with respect to the perturbation $ \pmb{\delta}_x $ equal to zero for some $ \pmb{\delta}_{x_1} $: \[ \begin{align} & 0 = \nabla f(\pmb{x}_0) + \mathbf{H}_f(\pmb{x}_0) \pmb{\delta}_{x_1} \\ \Leftrightarrow & \pmb{\delta}_{x_1} = -\left(\mathbf{H}_f(\pmb{x}_0)\right)^{-1} \nabla f(\pmb{x}_0) \end{align} \]
If $ \mathbf{H}_f $ has an inverse at $ \pmb{x}_0 $, and if $ \mathbf{H}_f(\pmb{x}_0) $ has positive eigenvalues, this gives a descent direction in $ f $.

Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.

Particularly, our new choice for the estimated minimum will be given by $ \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{x_1} $ for which $ f(\pmb{x}_0)\geq f(\pmb{x}_1) $.
Moreover, the second order approximation with Taylor’s expansion gives a second order rate of convergence to the minimum.

Newton's descent

In Newton's descent, for each $ k $ we repeat the step to define the $ k+1 $ approximation, \[ \begin{align} & 0 = \nabla f(\pmb{x}_k) + \mathbf{H}_f(\pmb{x}_k) \pmb{\delta}_{x_{k+1}} \\ \Leftrightarrow & \pmb{\delta}_{x_{k+1}} = -\left(\mathbf{H}_f(\pmb{x}_k)\right)^{-1} \nabla f(\pmb{x}_k)\\ &\pmb{x}_{k+1} = \pmb{x}_k + \pmb{\delta}_{x_k}. \end{align} \]
This process will continue until:
1. approximation reaches an error tolerance (when we have a good initial guess $ \pmb{x}_0 $); or
2. it terminates after timing out, failing to converge if we are not in an appropriate neighborhood $ \mathcal{N} $ of a minimizer $ \pmb{x}^\ast $.
When we are in a locally convex neighborhood $ \mathcal{N} $ containing $ \pmb{x}^\ast $, we can be assured that $ \mathbf{H}_f $ will be invertible, and that it will have strictly positive eigenvalues, making the initial choice very important in producing a result.
Unlike the gradient vector alone, this gives a step choice derived by the local geometry, and this converges at second order as long as the initial choice $ \pmb{x}_0 $ is in the neighborhood $ \mathcal{N} $.
However, this method does not know if there is a better minimizing solution $ \pmb{x}_0^\ast $ that lies in a different neighborhood $ \mathcal{N}_0 $.
The biggest issue with Newton's method is that calculating the Hessian may not be realistic for a large number of inputs.
If $ N $ is large, then the Hessian has $ N^2 $ entries, and there may not be any exact expression for $ \mathbf{H}_f $ at all.
Newton's descent is therefore a basis for a wide class of “quasi-Newton” methods which typically approximate the Hessian in some form.