Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Using the rules of calculus, e.g.,
we can compute derivatives of complex functions analytically.
However, if the function is only given in an implicit form, e.g., as the output of an algorithm, we still need approximate ways to compute such derivatives.
We will begin by introducing concepts in analytical derivation and then discuss one approach to their approximation.
We will follow this with one of the primary uses of such an approach, solving systems of nonlinear equations.
Recall, R has the ability to recognize mathematical expressions as objects.
f <- expression(x^3 * cos(x))
R also has a means to differentiate such expressions automatically;
D(expression, variable_name)
Q: suppose we use f
as the expression and x
as the variable name, what will be the result?
D(f, "x")
3 * x^2 * cos(x) - x^3 * sin(x)
Courtesy of Pbroks13, CC BY-SA 3.0, via Wikimedia Commons
More generally, the tangent line approximation is one kind of general Taylor approximation.
Suppose we have a point \( x_0 \) fixed, and define \( x_1 \) as a small perturbation \[ x_1 = x_0+\delta_{x_1}, \]
If a function \( f \) has \( k \) continuous derivatives we can write \[ f(x_1) = f(x_0) + f'(x_0)\delta_{x_1} + \frac{f''(x_0)}{2!}\delta_{x_1}^2 + \cdots + \frac{f^{(k)}(x_0)}{k!} \delta_{x_1}^k + \mathcal{o}\left(\delta_{x_1}^{k+1}\right) \]
The \( \mathcal{o}\left(\delta_{x_1}^{k+1}\right) \) refers to terms in the remainder, that grows or shrinks like the size of the perturbation to the power \( k+1 \).
Another important practical example of using this Taylor approximation, when the function \( f \) has two continuous derivatives, is \[ f(x_0 + \delta_x) \approx f(x_0) + f'(x_0)\delta_x + f''(x_0) \frac{\delta_x^2}{2} \] which will be used shortly for obtaining solutions to several kinds of equations.
Particularly, this is strongly related to our second derivative test from univariate calculus.
At the moment, we consider how Taylor's expansion can be used at first order again to approximate the derivative.
Recall, we write
\[ \begin{align} f(x_1) &= f(x_0) + f'(x_0) \delta_{x_1} + o\left( \delta_{x_1}^2\right) \\ \Leftrightarrow \frac{f(x_1) - f(x_0)}{ \delta_{x_1}} &= f'(x_0) + o\left( \delta_{x_1}\right) \end{align} \]
This says that for a small value of \( \delta_{x_1} \), we can obtain the numerical approximation of \( f'(x_0) \) up to approximately to the accuracy of the largest decimal place of \( \delta_{x_1} \) by the difference on the left hand side.
This simple approach is the basic version of a finite difference equation, and approximation to the derivative.
In simple cases this can be sufficiently accurate, variations can give better approximations.
We will return on how to compute such numerical derivatives in R when we introduce this in full generality of multiple variables.
We have seen earlier the basic linear inverse problem,
\[ \begin{align} \mathbf{A}\mathbf{x} = \mathbf{b} \end{align} \] where \( \mathbf{b} \) is an observed quantity and \( \mathbf{x} \) are the unknown variables related to \( \mathbf{b} \) by the relationships in \( \mathbf{A} \).
A similar problem exists when the relationship between \( \mathbf{x} \) and \( \mathbf{b} \) is non-linear, but we still wish to find some such \( \mathbf{x} \).
Suppose we know the nonlinear function \( f \) that gives a relationship in one variable as \[ \begin{align} f(x^\ast) = b \end{align} \] for an observed \( b \) but an unknown \( x^\ast \).
We will start by re-writing the equation into a more general form, define a function \[ \begin{align} \tilde{f}(x) = f(x)-b. \end{align} \]
Thus solving the nonlinear inverse problem in one variable is equivalent to finding the appropriate \( x^\ast \) for which \[ \begin{align} \tilde{f}(x^\ast)= 0 . \end{align} \]
The means of finding one such \( x^\ast \) is known as root finding.
The Newton-Raphson method (often Newton's for short) is one classical approach which has inspired many modern techniques for complex systems of equations – we will introduce the main concepts here.
Courtesy of Ralf Pfeifer, CC BY-SA 3.0, via Wikimedia Commons
uniroot
with syntax asuniroot(function_to_root_find, interval_to_search_for_roots)
f <- function(x){
return (x^2 - 4)
}
uniroot(f, c(-3, 0))
$root
[1] -2.000001
$f.root
[1] 3.223832e-06
$iter
[1] 6
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05
uniroot(f, c(0, 3))
$root
[1] 2.000001
$f.root
[1] 3.223832e-06
$iter
[1] 6
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05
uniroot(f,c(-3,3))
Error in uniroot(f, c(-3, 3)): f() values at end points not of opposite sign
we get an error message.
This is because the Newton algorithm needs a good first guess, and here the values of \( f(-3) \) and \( f(3) \) are of the same sign
f
f <- expression(x^3 * cos(x))
Q: suppose we differentiate the expression with respect to y
, what will be the answer?
y
so that,D(f,"y")
[1] 0
y
can be held as constants with the derivative in the other variable.f
as,f <- expression(x^3 * cos(x) * y + y)
Q: what will the derivative of the above expression evaluate to when take with respect to y
? What about with respect to x
?
D(f, "y")
x^3 * cos(x) + 1
D(f, "x")
(3 * x^2 * cos(x) - x^3 * sin(x)) * y
We can extend into arbitrary functions,
\[ \begin{align} f:\mathbb{R}^n& \rightarrow \mathbb{R} \\ (x_1, x_2, \cdots, x_n) & \rightarrow f(x_1, x_2, \cdots, x_n) \\ \mathbf{x} & \rightarrow f(\mathbf{x}) \end{align} \]
The notation \( \partial_{x_i} \) refers to the derivative with respect to the variable \( x_i \) in the same sense as discussed in the above D(f, x)
and D(f, y)
example.
Courtesy of MartinThoma, CC0, via Wikimedia Commons
Formally we will write the gradient as follows.
Let \( f \) be a real valued function of multiple arguments, i.e.,
\[ \begin{align} f:\mathbb{R}^n& \rightarrow \mathbb{R}\\ \mathbf{x} & \rightarrow f(\mathbf{x}) \end{align} \]
We define the Greek letter nabla, \( \nabla \), algebraically to be the gradient operation.
When applied as follows, \[ \nabla f \] we define \[ \begin{align} \nabla f = \begin{pmatrix} \partial_{x_1}f \\ \partial_{x_2} f \\ \vdots \\ \partial_{x_n} f \end{pmatrix} \in \mathbb{R}^n \end{align} \] which is called the gradient of \( f \).
The partial derivatives above can be evaluated at a given point \( \mathbf{x}_0\in\mathbb{R}^n \) and this will then give the direction and velocity of greatest accent in the value of \( f \).
f
f
expression(x^3 * cos(x) * y + y)
Q: using the definition of the gradient on the last slide, what will this vector be?
D
function as,c(D(f,"x"), D(f, "y"))
[[1]]
(3 * x^2 * cos(x) - x^3 * sin(x)) * y
[[2]]
x^3 * cos(x) + 1
We developed the first order approximation of the univariate function \( f(x) \) as \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1}. \]
We also use the gradient similar to the above as a linear approximation for the multivariate function \( f(\mathbf{x}) \).
Let \( \mathbf{x}_0\in \mathbb{R}^n \) be some vector and \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \), where \( \boldsymbol{\delta}_{x_1} \) is now a vector of small perturbations.
At first order, the Taylor series is given as
\[ f(\mathbf{x}_1) = f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^2\right). \]
The approximation above when \( \boldsymbol{\delta}_x \) is a smaller perturbation of the vector \( \mathbf{x}_0 \), that is when the Euclidean norm \( \parallel \boldsymbol{\delta}_{x_1}\parallel \) is small.
A similar statement to the second order approximation we developed for univariate function \( f(x) \), \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0)\frac{\delta_{x_1}^2}{2}, \] can be developed for the multivariate case.
In order to do so, we will need to introduce the idea of the Hessian as the second multivariate derivate of \( f \).
We suppose again that \( f \) is a scalar-valued function of multiple variables, \[ \begin{align} f: \mathbb{R}^n & \rightarrow \mathbb{R} \\ \mathbf{x} &\rightarrow f(\mathbf{x}) \end{align} \]
We will define the Hessian matrix for the function \( f \) as \( \mathbf{H}_f \) with, \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \partial_{x_1}\partial_{x_2} f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \partial_{x_2}\partial_{x_1} f & \partial_{x_2}^2 f & \cdots & \partial_{x_2} \partial_{x_n}f \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \partial_{x_n}\partial_{x_2} & \cdots & \partial_{x_n}^2 f \end{pmatrix} \end{align} \]
For short, this is often written as \( \mathbf{H}_f = \left\{ \partial_{x_i}\partial_{x_j} f\right\}_{i,j=1}^n \) where this refers to the \( i \)-th row and \( j \)-th column.
\( \mathbf{H}_f \) can be evaluated at a particular point \( \mathbf{x}_0 \), and notice that \( \mathbf{H}_f \) is always symmetric.
As before, let \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) be given as a small perturbation of \( \mathbf{x}_0 \).
Using the Hessian as defined on the last slide, if \( f \) has continuous second order partial derivatives, the Taylor series is given at second order as \[ \begin{align} f(\mathbf{x}_1) = f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1} + \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^3\right) \end{align} \]
Similarly, our second order approximation is defined by
\[ \begin{align} f(\mathbf{x}_1) \approx f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+ \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \end{align} \] which is accurate when the size of the perturbation \( \parallel \boldsymbol{\delta}_{x_1}\parallel \) is small.
We will consider how to use this approximation directly in our next lecture.
For now, we will look an example of how we can compute this.
Courtesy of Babak. K. Shandiz, CC BY-SA 4.0, via Wikimedia Commons
For most functions it is not a trivial step to compute the gradient and the Hessian and we might need to compute this numerically instead of analytically.
The package numDeriv
has tools for computing these by forms of the finite different approximations discussed earlier.
require(numDeriv)
f <- function(x){
return (sum(x^2))
}
grad(f, x=c(0,0))
[1] 0 0
hessian(f, x=c(0,0))
[,1] [,2]
[1,] 2 0
[2,] 0 2
Not all functions give a single value as an output, for example we can instead write an expression as,
\[ \begin{align} \mathbf{f} : \mathbb{R}^n & \rightarrow \mathbb{R}^m\\ \mathbf{x} &\rightarrow \begin{pmatrix} f_1\left(\mathbf{x}\right) \\ \vdots \\ f_m(\mathbf{x})\end{pmatrix} \end{align} \]
In the above, each of the component-functions \( f_i \) for \( i=1,\cdots , m \) is a function like we saw before,
\[ \begin{align} f_i :\mathbb{R}^n &\rightarrow \mathbb{R} \\ \mathbf{x} &\rightarrow f_i (\mathbf{x}) \end{align} \]
This could look like, for example, in dimensions \( \mathbb{R}^2 \rightarrow \mathbb{R}^2 \)
\[ \begin{align} \mathbf{f}(x, y) = \begin{pmatrix} x^2 \\ y^2 \end{pmatrix} \end{align} \] where
For a vector valued function such as \( \mathbf{f} \) we can define the action of the gradient once again, but to define a matrix of first derivatives called the Jacobian.
This is given as, \[ \begin{align} \nabla \mathbf{f} &= \begin{pmatrix} \partial_{x_1} f_1 & \partial_{x_2} f_1 & \cdots & \partial_{x_n} f_1 \\ \partial_{x_1} f_2 & \partial_{x_2} f_2 & \cdots & \partial_{x_n} f_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_1} f_m & \partial_{x_2} f_m & \cdots & \partial_{x_n} f_m \end{pmatrix}\in\mathbb{R}^{m\times n}. \end{align} \]
The Jacobian has a similar interpretation to the gradient, but the meaning is more subtle due to the multiple inputs and outputs of \( \mathbf{f} \) simultaneously.
However, we can easily restate the first order Taylor expansion as,
\[ \begin{align} \mathbf{f}\left(\mathbf{x}_1\right) = \mathbf{f}\left(\mathbf{x}_0\right) + \nabla \mathbf{f}\left(\mathbf{x}_0\right) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1} \parallel^2\right) \end{align} \]
The Newton-Raphson method can now be restated in terms of multiple variables as follows.
Suppose that we have a nonlinear inverse problem stated as follows:
\[ \begin{align} \mathbf{f} :\mathbb{R}^n &\rightarrow \mathbb{R}^n \\ \mathbf{x} & \rightarrow \mathbf{f}(\mathbf{x}) \\ \mathbf{f}\left(\mathbf{x}^\ast\right)& = \mathbf{b} \end{align} \]
We redefine this in terms of the adjusted function \( \tilde{\mathbf{f}}\left(\mathbf{x}^\ast\right) = \mathbf{f}\left(\mathbf{x}^\ast\right) - \mathbf{b} = \boldsymbol{0} \), and we wish to make the same first order approximation as before.
Supposing we have a good initial guess \( \mathbf{x}_0 \) for \( \mathbf{x}^\ast \), we look for the point where the linear approximation equals zero, i.e., \[ \begin{align} 0 &= \mathbf{f}\left(\mathbf{x}_0\right) + \nabla \mathbf{f}\left(\mathbf{x}_0\right) \boldsymbol{\delta}_{x_1} \\ \Leftrightarrow \boldsymbol{\delta}_{x_1} &= -\left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \mathbf{f}\left(\mathbf{x}_0\right). \end{align} \]
The above makes sense as an approximation as long as \( \left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \) exists, i.e., as long as it has a non-zero determinant (no zero eigenvalues).
We can thus once again update the approximation recursively so that \( \mathbf{x}_1 = \mathbf{x}_0 -\left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \mathbf{f}\left(\mathbf{x}_0\right) \), as long as the inverse exists.
We will continue this idea in the next lecture, looking at how to minimize or maximize functions.