Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Courtesy of Pbroks13, CC BY-SA 3.0, via Wikimedia Commons
More generally, the tangent line approximation is one kind of general Taylor approximation.
Suppose we have a point \( x_0 \) fixed, and define \( x_1 \) as a small perturbation \[ x_1 = x_0+\delta_{x_1}, \]
If a function \( f \) has \( k \) continuous derivatives we can write \[ f(x_1) = f(x_0) + f'(x_0)\delta_{x_1} + \frac{f''(x_0)}{2!}\delta_{x_1}^2 + \cdots + \frac{f^{(k)}(x_0)}{k!} \delta_{x_1}^k + \mathcal{O}\left(\delta_{x_1}^{k+1}\right) \]
The \( \mathcal{O}\left(\delta_{x_1}^{k+1}\right) \) refers to terms in the remainder, that grows or shrinks like the size of the perturbation to the power \( k+1 \).
Another important practical example of using this Taylor approximation, when the function \( f \) has two continuous derivatives, is \[ f(x_0 + \delta_{x_1}) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0) \frac{\delta_{x_1}^2}{2} \] which will be used shortly for obtaining solutions to several kinds of equations.
Particularly, this is strongly related to our second derivative test from univariate calculus.
At the moment, we consider how Taylor's expansion can be used at first order again to approximate the derivative.
Recall, we write
\[ \begin{align} f(x_1) &= f(x_0) + f'(x_0) \delta_{x_1} + \mathcal{O}\left( \delta_{x_1}^2\right) \\ \Leftrightarrow \frac{f(x_1) - f(x_0)}{ \delta_{x_1}} &= f'(x_0) + \mathcal{O}\left( \delta_{x_1}\right) \end{align} \]
This says that for a small value of \( \delta_{x_1} \), we can obtain the numerical approximation of \( f'(x_0) \) proportional to the accuracy of the largest decimal place of \( \delta_{x_1} \) by the difference on the left hand side.
This gives a forward finite difference equation approximation to the derivative.
We can similarly define a backward finite difference equation with \( \pmb{x}_1 := \pmb{x}_0 -\pmb{\delta}_{\pmb{x}_1} \).
In each case, we use the perturbation to parameterize the tangent-line approximation.
We have seen earlier the basic linear inverse problem,
\[ \begin{align} \mathbf{A}\pmb{x} = \pmb{b} \end{align} \] where \( \pmb{b} \) is an observed quantity and \( \pmb{x} \) are the unknown variables related to \( \pmb{b} \) by the relationships in \( \mathbf{A} \).
A similar problem exists when the relationship between \( \pmb{x} \) and \( \pmb{b} \) is non-linear, but we still wish to find some such \( \pmb{x} \).
Nonlinear inverse problem (scalar case)
Suppose we know the nonlinear, scalar function \( f \) that gives a relationship \[ \begin{align} f(x^\ast) = b \end{align} \] for an observed \( b \) but an unknown \( x^\ast \). Finding a value of \( x^\ast \) that satisfies \( f(x^\ast)=b \) is known as a nonlinear inverse problem.
Define a function \[ \begin{align} \tilde{f}(x) = f(x)-b. \end{align} \]
Thus solving the nonlinear inverse problem in one variable is equivalent to finding the appropriate \( x^\ast \) for which \[ \begin{align} \tilde{f}(x^\ast)= 0 . \end{align} \]
Finding a zero of a function, or root finding, is thus equivalent to a nonlinear inverse problem.
The Newton-Raphson method is one classical approach which has inspired many modern techniques.
Courtesy of Ralf Pfeifer, CC BY-SA 3.0, via Wikimedia Commons
As a quick example, let's consider the Newton algorithm built-in to Scipy.
Specifically, we will import the built-in newton
function from the optimize
sub-module of scipy
.
from scipy.optimize import newton
In the following, we define the cubic function \( f(x):=x^3 \), but we are interested in the value \( x^\ast \) for which \( f\left(x^\ast\right)=1 \)
def f(x): return (x**3 - 1)
The newton
function can be supplied an analytical derivative, if this can be computed, to improve the accuracy versus, e.g., a finite-differences approximation.
newton
:root = newton(f, 1.5, fprime=lambda x: 3 * x**2)
root
1.0
To expand our discussion to multiple variables, we will review some fundamental concepts of vector calculus.
Suppose we have a vector valued function, with a single argument:
\[ \begin{align} \pmb{x}:&\mathbb{R} \rightarrow \mathbb{R}^{N};\\ \pmb{x}(t) :=& \begin{pmatrix} x_1(t) & \cdots & x_{N}(t)\end{pmatrix}^\top; \end{align} \]
Tangent vector
Suppose \( \pmb{x}(t) \) is defined as above and that each of the component functions \( x_i(t) \) are differentiable. The tangent vector to the state trajectory \( \pmb{x} \) is defined as \[ \vec{x}:= \frac{\mathrm{d}}{\mathrm{d}t} \pmb{x}:= \begin{pmatrix}\frac{\mathrm{d}}{\mathrm{d}t} x_1(t) & \cdots & \frac{\mathrm{d}}{\mathrm{d}t} x_{N}(t)\end{pmatrix}^\top \]
In the above, the interpretation of the derivative defining a tangent line is extended into multiple variables;
Tangent spaces
Let \( \pmb{x}\in\mathbb{R}^{N} \) and \( \gamma(t) \) be an arbitrary differentiable curve \( \pmb{\gamma}:\mathbb{R}\rightarrow \mathbb{R}^{N} \) such that \( \pmb{\gamma}(0)= \pmb{x} \) with a tangent vector defined as \( \vec{\gamma}(0):= \frac{\mathrm{d}}{\mathrm{d}t}|_0 \pmb{\gamma} \). The tangent space at \( T_{\pmb{x}} \) is defined by the linear span of all tangent vectors as such through \( \pmb{x} \).
Courtesy of TN, Public domain, via Wikimedia Commons
Vector fields
For a smooth manifold \( M \), a vector field (or flow field) is a mapping from \( M \) to the tangent space \( TM \) such that for each point \( \pmb{x}\in M \), this defines a the tangent vector \( \vec{\gamma} \) for a curve \( \pmb{\gamma} \) that passes through \( \pmb{x} \).
Courtesy of I, Cronholm144, CC BY-SA 3.0, via Wikimedia Commons
Not all functions in multiple variables have a single input – we can instead write
\[ \begin{align} \pmb{f} : \mathbb{R}^N & \rightarrow \mathbb{R}^M\\ \pmb{x} &\rightarrow \begin{pmatrix} f_1\left(\pmb{x}\right) \\ \vdots \\ f_M(\pmb{x})\end{pmatrix} \end{align} \]
In the above, each of the component-functions \( f_i \) for \( i=1,\cdots , M \) is a function defined
\[ \begin{align} f_i :\mathbb{R}^N &\rightarrow \mathbb{R} \\ \pmb{x} &\rightarrow f_i (\pmb{x}) \end{align} \]
The Jacobian
For a vector-valued, continuously differentiable function \( \pmb{f} \) as above, the Jacobian is defined as the matrix of first partial derivatives \[ \begin{align} \nabla \pmb{f} &= \begin{pmatrix} \partial_{x_1} f_1 & \partial_{x_2} f_1 & \cdots & \partial_{x_n} f_1 \\ \partial_{x_1} f_2 & \partial_{x_2} f_2 & \cdots & \partial_{x_n} f_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_1} f_m & \partial_{x_2} f_m & \cdots & \partial_{x_n} f_m \end{pmatrix}\in\mathbb{R}^{m\times n} \end{align} \] where the partials are with respect to the components of \( \pmb{x} \).
The Jacobian is also a tangent-linear approximation, taking into account perturbations in all directions of \( \pmb{x} \).
This also gives a version of Taylor's theorem where we can write a tangent-linear approximation of a mapping.
Tangent-linear approximation
Let \( \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^M \) in be a continuously differentiable mapping and \( \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} \) be a purturbation within a sufficiently small neighborhood. Then, the tangent-linear approximation for \( \pmb{f}(\pmb{x}_1) \) is given as \[ \begin{align} \pmb{f} \left(\pmb{x}_1\right) & = \pmb{f} \left(\pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1}\right) \\ & \approx \pmb{f}(\pmb{x}_0) + \nabla \pmb{f}|_{\pmb{x}_0} \pmb{\delta}_{\pmb{x}_1}. \end{align} \]
The tangent-linear approximation above gives an approximation of the image of \( \pmb{f} \) with the tangent space based at \( \pmb{f}(\pmb{x}_0) \), parameterized by the perturbation \( \pmb{\delta}_{\pmb{x}_1} \).
In particular, the Jacobian is a mapping defined
\[ \begin{align} \nabla \pmb{f}|_{\pmb{x}} :T_{\pmb{x}} \rightarrow T_{\pmb{f}(\pmb{x})} \end{align} \]
between the tangent space at the input and the tangent space at the output.
The tangent space is also a linear space by construction, thus giving the “linear” approximation.
This is also generalized in a local sense with mappings between differential manifolds and their tangent spaces.
Nonlinear inverse problem (multivariate case)
Suppose we know the nonlinear, multivariate function \( \pmb{f} \) that gives a relationship \[ \begin{align} \pmb{f}(\pmb{x}^\ast) = \pmb{b} \end{align} \] for an observed \( \pmb{b} \) but an unknown \( \pmb{x}^\ast \). Finding a value of \( \pmb{x}^\ast \) that satisfies \( \pmb{f}(\pmb{x}^\ast)=\pmb{b} \) is known as a nonlinear inverse problem.
The inverse function theorem
Let \( \pmb{f}:\mathbb{R}^N \rightarrow \mathbb{R}^N \) be a nonlinear function such that for \( \pmb{x}^\ast \), \( \nabla \pmb{f}|_{\pmb{x}^\ast}\in\mathbb{R}^{N \times N} \) exists and is an invertible matrix, and it is a continuous function at \( \pmb{x}^\ast \). Then there exists a neighborhood \( \mathcal{N} \) containing the image \( \pmb{f}\left(\pmb{x}^\ast\right) \) for which any \( \pmb{p}\in\mathcal{N} \) has a unique inverse value \( \pmb{q} \) where \[ \begin{align} \pmb{p}:= \pmb{f}(\pmb{q}). \end{align} \] I.e., \( \pmb{f}^{-1} \) exists on \( \mathcal{N} \) such that \( \pmb{f}^{-1}\circ \pmb{f}= \mathbf{I}_N \) within the properly defined domain.
The Newton-Raphson method can now be restated in terms of multiple variables as follows.
Suppose that we have a nonlinear inverse problem stated as follows:
\[ \begin{align} \pmb{f} :\mathbb{R}^N &\rightarrow \mathbb{R}^N \\ \pmb{x} & \rightarrow \pmb{f}(\pmb{x}) \\ \pmb{f}\left(\pmb{x}^\ast\right)& = \pmb{b} \end{align} \]
We redefine this in terms of the adjusted function \( \tilde{\pmb{f}}\left(\pmb{x}^\ast\right) = \pmb{f}\left(\pmb{x}^\ast\right) - \pmb{b} = \pmb{0} \), and we wish to make the same first order approximation as before.
Supposing we have a good initial guess \( \pmb{x}_0 \) for \( \pmb{x}^\ast \), we look for the point where the tangent-linear approximation equals zero, i.e., \[ \begin{align} 0 &= \pmb{f}\left(\pmb{x}_0\right) + \nabla \pmb{f}\left(\pmb{x}_0\right) \pmb{\delta}_{\pmb{x}_1} \\ \Leftrightarrow \pmb{\delta}_{\pmb{x}_1} &= -\left(\nabla \pmb{f}\left(\pmb{x}_0\right) \right)^{-1} \pmb{f}\left(\pmb{x}_0\right). \end{align} \]
The above makes sense as an approximation as long as \( \left(\nabla \pmb{f}\left(\pmb{x}_0\right) \right)^{-1} \) exists, i.e., as long as the Jacobian has no zero eigenvalues.
We can thus once again update the approximation recursively so that \( \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} \), as long as the inverse exists.
This update continues until an error tolerance is reached or the optimization times out.
A related notion to the inverse problem is the maximization or minimization (optimization) of functions.
Optimization problems, contain two components:
E.g., we may wish to optimize factory output \( f(x) \) as a function of hours \( x \) in a week, with a measure of our active machine-hours \( g(x) \) not exceeding a pre-specified limitation \( g(x)\leq C \).
Optimization problems can thus be classified into two categories:
We will focus on the simpler unconstrained optimization; this is formulated as the following problem,
\[ \begin{align} f: \mathbb{R}^N \rightarrow \mathbb{R}& & \pmb{x} \rightarrow f(\pmb{x}) & & f(\pmb{x}^\ast) = \mathrm{max}_{\pmb{x} \in \mathcal{D}} f \end{align} \]
We note that the above problem is equivalent to a minimization problem by a substitution of \( \tilde{f} = -f \), i.e.,
\[ \begin{align} \tilde{f}: \mathbb{R}^N \rightarrow \mathbb{R}& & \pmb{x} \rightarrow -f(\pmb{x})& & f(\pmb{x}^\ast) = \mathrm{max}_{\pmb{x} \in \mathcal{D}} f & & \tilde{f}(\pmb{x}^\ast)= \mathrm{min}_{\pmb{x}\in \mathcal{D}} \tilde{f} \end{align} \]
Global minimizer
A point \( \pmb{x}^\ast \) is a global minimizer if \( f(\pmb{x}^\ast) \leq f(\pmb{x}) \) for all other possible \( \pmb{x} \) in the domain of consideration \( D\subset \mathbb{R}^n \).
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
Local minimizer
A point \( \pmb{x}^\ast \) is a local minimizer if there exists some neighborhood \( \mathcal{N}\subset D \) containing \( \pmb{x}^\ast \) such that \( f(\pmb{x}^\ast) \leq f(\pmb{x}) \) for all other possible \( \pmb{x} \in \mathcal{N} \).
Courtesy of: Oleg Alexandrov. Public domain, via Wikimedia Commons.
Second derivative test
For the function of one variable \( f(x) \) we say that \( x^\ast \) is a local minimizer if \( f'(x^\ast)=0 \) and \( f''(x^\ast)> 0 \).
We suppose that the objective function takes multivariate inputs and gives a scalar output:
\[ \begin{align} f:&\mathbb{R}^N \rightarrow \mathbb{R}\\ &\pmb{x}\mapsto f(\pmb{x}) \end{align} \]
Formally we will write the gradient as follows, using the same \( \nabla \) notation
The gradient
Suppose \( f \) is a contiuously differentiable objective function defined as above, then the gradient is given as \[ \begin{align} \nabla f = \begin{pmatrix} \partial_{x_1}f & \partial_{x_2} f & \cdots & \partial_{x_N} f \end{pmatrix}^\top \in \mathbb{R}^N \end{align} \] where the above partial derivatives are with respect to the components of \( \pmb{x}\in\mathbb{R}^N \).
An important property of the gradient is that it gives the tangent vector for a curve in the direction and velocity of greatest accent for the output of \( f \).
We also use the gradient similar to previous cases as a linear approximation for the multivariate-input, scalar-output function \( f(\pmb{x}) \).
Let \( \pmb{x}_0\in \mathbb{R}^n \) be some vector and \( \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{x_1} \), where \( \pmb{\delta}_{x_1} \) is now a vector of small perturbations.
At first order, the Taylor series is given as
\[ f(\pmb{x}_1) = f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\top \pmb{\delta}_{x_1} + \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^2\right); \]
A similar second order approximation to the one we developed for univariate function \( f(x) \), \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0)\frac{\delta_{x_1}^2}{2}, \] can be developed for the multivariate-input case.
In order to do so, we will need to introduce the idea of the Hessian as the second multivariate derivate of \( f \).
The Hessian
Suppose that \( f \) is a scalar-output function of multiple variables, \[ \begin{align} f: \mathbb{R}^n & \rightarrow \mathbb{R} \\ \pmb{x} &\rightarrow f(\pmb{x}) \end{align} \] with continuous second derivatives. The Hessian matrix for the function \( f \) is defined as \( \mathbf{H}_f \) \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \partial_{x_1}\partial_{x_2} f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \partial_{x_2}\partial_{x_1} f & \partial_{x_2}^2 f & \cdots & \partial_{x_2} \partial_{x_n}f \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \partial_{x_n}\partial_{x_2} & \cdots & \partial_{x_n}^2 f \end{pmatrix}. \end{align} \]
For short, this is often written as \( \mathbf{H}_f = \left\{ \partial_{x_i}\partial_{x_j} f\right\}_{i,j=1}^n \) where this refers to the \( i \)-th row and \( j \)-th column.
\( \mathbf{H}_f \) can be evaluated at a particular point \( \pmb{x}_0 \), and notice that \( \mathbf{H}_f \) is always symmetric.
As before, let \( \pmb{x}_1 = \pmb{x}_0 + \pmb{\delta}_{\pmb{x}_1} \) be given as a small perturbation of \( \pmb{x}_0 \).
Using the Hessian as defined on the last slide, if \( f \) has continuous second order partial derivatives, the Taylor series is given at second order as \[ \begin{align} f(\pmb{x}_1) = f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\top \pmb{\delta}_{x_1} + \frac{1}{2} \pmb{\delta}_{x_1}^\top\mathbf{H}_f (\pmb{x}_0) \pmb{\delta}_{x_1} + \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^3\right) \end{align} \]
Similarly, our second order approximation is defined as follows.
Second order objective function approximation
Let \( \pmb{f} \) be a multi-input, scalar-output function with second order continuous derivatives. We define the second order approximation as \[ \begin{align} f(\pmb{x}_1) \approx f(\pmb{x}_0) + \left(\nabla f(\pmb{x}_0)\right)^\mathrm{T} \pmb{\delta}_{x_1}+ \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}_0) \pmb{\delta}_{x_1} \end{align} \]
Courtesy of: Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.
We can understand the second derivative test with the Hessian using our second order Taylor approximation as follows.
Let \( \pmb{x}^\ast \) be a critical point and \( \pmb{x}_1 = \pmb{x}^\ast + \pmb{\delta}_{x_1} \) be a perturbation of this in the neighborhood \( \mathcal{N} \).
For any critical point, the gradient is zero like the regular first derivative.
Suppose that the Hessian has only positive eigenvalues at \( \pmb{x}^\ast \), then our second order approximation \[ \begin{align} f(\pmb{x}_1) &\approx f(\pmb{x}^\ast) + \left(\nabla f(\pmb{x}^\ast)\right)^\mathrm{T} \pmb{\delta}_{x_1}+ \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} \\ &= f(\pmb{x}^\ast) + \frac{1}{2} \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} \end{align} \]
Provided \( \mathcal{N} \) is a small enough neighborhood, the \( \mathcal{O}\left(\parallel \pmb{\delta}_{x_1}\parallel^3\right) \) will remain very small.
However, \( \pmb{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\pmb{x}^\ast) \pmb{\delta}_{x_1} \) must be positive by the positive eigenvalues of the Hessian.
This says for a radius sufficiently small, \( \parallel \pmb{\delta}_{x_1}\parallel \), and any perturbation of the point \( \pmb{x}^\ast \) defined as \( \pmb{x}_1 = \pmb{x}^\ast + \pmb{\delta}_{x_1} \), we have \[ f(\pmb{x}_1) \geq f(\pmb{x}^\ast). \]
Therefore, we can identify a local minimizer whenever the gradient is zero and the Hessian has positive eigenvalues, due to the local convexity.
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
Courtesy of: J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
In Newton's descent, for each \( k \) we repeat the step to define the \( k+1 \) approximation, \[ \begin{align} & 0 = \nabla f(\pmb{x}_k) + \mathbf{H}_f(\pmb{x}_k) \pmb{\delta}_{x_{k+1}} \\ \Leftrightarrow & \pmb{\delta}_{x_{k+1}} = -\left(\mathbf{H}_f(\pmb{x}_k)\right)^{-1} \nabla f(\pmb{x}_k)\\ &\pmb{x}_{k+1} = \pmb{x}_k + \pmb{\delta}_{x_k}. \end{align} \]
This process will continue until:
When we are in a locally convex neighborhood \( \mathcal{N} \) containing \( \pmb{x}^\ast \), we can be assured that \( \mathbf{H}_f \) will be invertible, and that it will have strictly positive eigenvalues, making the initial choice very important in producing a result.
Unlike the gradient vector alone, this gives a step choice derived by the local geometry, and this converges at second order as long as the initial choice \( \pmb{x}_0 \) is in the neighborhood \( \mathcal{N} \).
However, this method does not know if there is a better minimizing solution \( \pmb{x}_0^\ast \) that lies in a different neighborhood \( \mathcal{N}_0 \).
The biggest issue with Newton's method is that calculating the Hessian may not be realistic for a large number of inputs.
If \( N \) is large, then the Hessian has \( N^2 \) entries, and there may not be any exact expression for \( \mathbf{H}_f \) at all.
Newton's descent is therefore a basis for a wide class of “quasi-Newton” methods which typically approximate the Hessian in some form.