Numerical differentiation in R

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

Concepts in analytical differentiation
An approach to numerical differentiation
Newton’s method in one variable.
Functions of multiple variables.
Gradient and Hessians.
Multivariate Taylor approximation.
Jacobians
Newton’s method in multiple variables

Concepts in analytical differentiation

Using the rules of calculus, e.g.,
1. the product rule;
2. the power rule;
3. the chain rule;
we can compute derivatives of complex functions analytically.
However, if the function is only given in an implicit form, e.g., as the output of an algorithm, we still need approximate ways to compute such derivatives.
We will begin by introducing concepts in analytical derivation and then discuss one approach to their approximation.
- This will include some examples on how to compute such analytical derivatives and approximations in R.
We will follow this with one of the primary uses of such an approach, solving systems of nonlinear equations.

Concepts in analytical differentiation

Recall, R has the ability to recognize mathematical expressions as objects.
- E.g., let us create the following expression:

f <- expression(x^3 * cos(x))

R also has a means to differentiate such expressions automatically;
- This is with the function 'D' that takes a syntax of the form

D(expression, variable_name)

Q: suppose we use f as the expression and x as the variable name, what will be the result?
- A: we can compute the above derivative analytically using a combination of the product and power rules to obtain:

D(f, "x")

3 * x^2 * cos(x) - x^3 * sin(x)

Concepts in analytical differentiation

Tangent line approximation by derivative.

Courtesy of Pbroks13, CC BY-SA 3.0, via Wikimedia Commons

The derivative represents the slope of a tangent line to a curve.
In the figure to the left, we see the function \( f \) represented by the blue curve.
The derivative \( f'(x) \) at a given point gives the infinitesimal rate of change at that point with respect to small changes in \( x \), denoted \( \delta_x \).
Suppose we have a point \( x_0 \), a nearby point that differs by only a small amount in \( x \) \[ x_1 = x_0+\delta_{x_1}, \]
The function \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} \] is what is known as the tangent line approximation to the function \( f \).
Such an approximation exists when \( f \) is sufficiently smooth and is accurate when \( \delta_{x_1} \) is small, so that the difference of \( x_1 \) from the fixed value \( x_0 \) is small.

We can see graphically how the approximation becomes worse as we take \( \delta_{x_1} \) too large.

Concepts in analytical differentiation

More generally, the tangent line approximation is one kind of general Taylor approximation.
Suppose we have a point \( x_0 \) fixed, and define \( x_1 \) as a small perturbation \[ x_1 = x_0+\delta_{x_1}, \]
If a function \( f \) has \( k \) continuous derivatives we can write \[ f(x_1) = f(x_0) + f'(x_0)\delta_{x_1} + \frac{f''(x_0)}{2!}\delta_{x_1}^2 + \cdots + \frac{f^{(k)}(x_0)}{k!} \delta_{x_1}^k + \mathcal{o}\left(\delta_{x_1}^{k+1}\right) \]
The \( \mathcal{o}\left(\delta_{x_1}^{k+1}\right) \) refers to terms in the remainder, that grows or shrinks like the size of the perturbation to the power \( k+1 \).
- This is why this approximation works well when \( \delta_{x_1} \) is a small perturbation.
Another important practical example of using this Taylor approximation, when the function \( f \) has two continuous derivatives, is \[ f(x_0 + \delta_x) \approx f(x_0) + f'(x_0)\delta_x + f''(x_0) \frac{\delta_x^2}{2} \] which will be used shortly for obtaining solutions to several kinds of equations.
Particularly, this is strongly related to our second derivative test from univariate calculus.

An approach to numerical derivation

At the moment, we consider how Taylor's expansion can be used at first order again to approximate the derivative.
Recall, we write

\[ \begin{align} f(x_1) &= f(x_0) + f'(x_0) \delta_{x_1} + o\left( \delta_{x_1}^2\right) \\ \Leftrightarrow \frac{f(x_1) - f(x_0)}{ \delta_{x_1}} &= f'(x_0) + o\left( \delta_{x_1}\right) \end{align} \]
This says that for a small value of \( \delta_{x_1} \), we can obtain the numerical approximation of \( f'(x_0) \) up to approximately to the accuracy of the largest decimal place of \( \delta_{x_1} \) by the difference on the left hand side.
This simple approach is the basic version of a finite difference equation, and approximation to the derivative.
In simple cases this can be sufficiently accurate, variations can give better approximations.
We will return on how to compute such numerical derivatives in R when we introduce this in full generality of multiple variables.

Newtwon's method in one variable

We have seen earlier the basic linear inverse problem,

\[ \begin{align} \mathbf{A}\mathbf{x} = \mathbf{b} \end{align} \] where \( \mathbf{b} \) is an observed quantity and \( \mathbf{x} \) are the unknown variables related to \( \mathbf{b} \) by the relationships in \( \mathbf{A} \).
- We observed that a unique solution exists when the \( \mathrm{det}\left(\mathbf{A}\right)\neq 0 \), i.e., all the relationships expressed by the columns (or eigenvalues) are unique.
A similar problem exists when the relationship between \( \mathbf{x} \) and \( \mathbf{b} \) is non-linear, but we still wish to find some such \( \mathbf{x} \).
Suppose we know the nonlinear function \( f \) that gives a relationship in one variable as \[ \begin{align} f(x^\ast) = b \end{align} \] for an observed \( b \) but an unknown \( x^\ast \).
We will start by re-writing the equation into a more general form, define a function \[ \begin{align} \tilde{f}(x) = f(x)-b. \end{align} \]
Thus solving the nonlinear inverse problem in one variable is equivalent to finding the appropriate \( x^\ast \) for which \[ \begin{align} \tilde{f}(x^\ast)= 0 . \end{align} \]
The means of finding one such \( x^\ast \) is known as root finding.
The Newton-Raphson method (often Newton's for short) is one classical approach which has inspired many modern techniques for complex systems of equations – we will introduce the main concepts here.

Newtwon's method in one variable

We are searching for the point \( x^\ast\in \mathbb{R} \) for which the modified equation \( \tilde{f}\left(x^\ast\right) = 0 \), and we suppose we have a good initial guess \( x_0 \).
We define the tangent approximation as, \[ t(\delta_x) = \tilde{f}(x_0) + \tilde{f}'(x_0) \delta_x \] for some small perturbation value of \( \delta_x \).
Recall, \( \tilde{f}'(x_0) \) refers to the value of the derivative of \( \tilde{f} \) at the point \( x_0 \) – suppose this value is nonzero.
In this case, we will examine where the tangent line intersects zero to find a better approximation of \( x^\ast \).
Suppose that for \( \delta_{x_0} \) we have \[ \begin{matrix} t(\delta_{x_0}) = 0 & \Leftrightarrow & 0= \tilde{f}(x_0) + \tilde{f}'(x_0) \delta_{x_0} & \Leftrightarrow &\delta_{x_0} = \frac{-\tilde{f}(x_0)}{\tilde{f}'(x_0)} \end{matrix} \]
The above solution makes sense as long as \( f'(x_0) \) is not equal to zero;
if not, this says that the tangent line intersects zero at \( x_1 = x_0 - \delta_{x_0} \), giving a new approximation of \( x^\ast \).

Courtesy of Ralf Pfeifer, CC BY-SA 3.0, via Wikimedia Commons

The process of recursively solving for a better approximation of \( x^\ast \) terminates when we reach a certain tolerated level of error in the solution or the process times out, failing to converge.
This method has a direct analog in multiple variables, for which we will need to return to the concept of the matrix inverse.
We will return to this at the end of the lecture and for now consider a simple example.

Newtwon's method in one variable – example

Newton's method in a single variable is implemented by the function uniroot with syntax as

uniroot(function_to_root_find, interval_to_search_for_roots)

We will consider the polynomial \( x^2-4 = (x+2)(x-2) \) which clearly has roots at \( \pm 2 \),

f <- function(x){
  return (x^2 - 4)
}
uniroot(f, c(-3, 0))

$root
[1] -2.000001

$f.root
[1] 3.223832e-06

$iter
[1] 6

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05

Notice this solves for the root \( -2 \) in the interval.

Newtwon's method in one variable – example

Now consider,

uniroot(f, c(0, 3))

$root
[1] 2.000001

$f.root
[1] 3.223832e-06

$iter
[1] 6

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05

Newtwon's method in one variable – example

But if we try the following interval,

uniroot(f,c(-3,3))

Error in uniroot(f, c(-3, 3)): f() values at end points not of opposite sign

we get an error message.
This is because the Newton algorithm needs a good first guess, and here the values of \( f(-3) \) and \( f(3) \) are of the same sign
- in this case, the solver doesn't have enough information to begin a search with an initial \( x_0 \) and the interval should be shortened around the first proposal.

Multiple variables

Recall our earlier expression f

f <- expression(x^3 * cos(x))

Q: suppose we differentiate the expression with respect to y, what will be the answer?
- A: all values in the above expression are constant with respect to the value y so that,

D(f,"y")

[1] 0

This is the basic principle with respect to partial derivatives: expressions that do not include a different variable, e.g., y can be held as constants with the derivative in the other variable.

Multiple variables

Suppose we redefine our expression f as,

f <- expression(x^3 * cos(x) * y + y)

Q: what will the derivative of the above expression evaluate to when take with respect to y? What about with respect to x?
- A: we find that,

D(f, "y")

x^3 * cos(x) + 1

D(f, "x")

(3 * x^2 * cos(x) - x^3 * sin(x)) * y

We can extend into arbitrary functions,

\[ \begin{align} f:\mathbb{R}^n& \rightarrow \mathbb{R} \\ (x_1, x_2, \cdots, x_n) & \rightarrow f(x_1, x_2, \cdots, x_n) \\ \mathbf{x} & \rightarrow f(\mathbf{x}) \end{align} \]
The notation \( \partial_{x_i} \) refers to the derivative with respect to the variable \( x_i \) in the same sense as discussed in the above D(f, x) and D(f, y) example.

The gradient

Diagram of kurtosis for different distributions.

Courtesy of MartinThoma, CC0, via Wikimedia Commons

As with the linear approximation in one variable, \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} \] the derivative in multiple variables can be seen similar to the slope of a tangent line but for a higher dimensional function.
Instead of giving just a slope, in higher dimensions we are looking at a tangent-vector;

this gives the instantaneous direction and velocity of the change in the function \( f \) with respect to a small perturbation of, e.g., \( \left(\delta_{x_1}, \delta_{y_1}\right) \).

In the picture to the left, we see various tangent directions and velocities represented by the direction and the size of arrows.

The arrows depend on the particular value in the \( (x,y) \)-plane below the surface, and the surface represents the values of \[ f(x, y) = -\left(\cos(x)^2 + \sin(y)^2\right)^2 \] according to each value of \( (x,y) \) below.

These particular tangent vectors give the direction and velocity where the change is most rapid along the surface above.
This particular choice of tangent vector pictured above has a special name: the gradient vector.

The gradient vector is the direction and velocity of the greatest rate of increase in the output of \( f \) in a perturbation of \( (\delta_{x_1},\delta_{y_1}) \) at a given point \( (x_0, y_0) \).

The gradient

Formally we will write the gradient as follows.
Let \( f \) be a real valued function of multiple arguments, i.e.,

\[ \begin{align} f:\mathbb{R}^n& \rightarrow \mathbb{R}\\ \mathbf{x} & \rightarrow f(\mathbf{x}) \end{align} \]
We define the Greek letter nabla, \( \nabla \), algebraically to be the gradient operation.
When applied as follows, \[ \nabla f \] we define \[ \begin{align} \nabla f = \begin{pmatrix} \partial_{x_1}f \\ \partial_{x_2} f \\ \vdots \\ \partial_{x_n} f \end{pmatrix} \in \mathbb{R}^n \end{align} \] which is called the gradient of \( f \).
The partial derivatives above can be evaluated at a given point \( \mathbf{x}_0\in\mathbb{R}^n \) and this will then give the direction and velocity of greatest accent in the value of \( f \).

The gradient

Recall our function f

expression(x^3 * cos(x) * y + y)

Q: using the definition of the gradient on the last slide, what will this vector be?
- A: this can be computed with the D function as,

c(D(f,"x"), D(f, "y"))

[[1]]
(3 * x^2 * cos(x) - x^3 * sin(x)) * y

[[2]]
x^3 * cos(x) + 1

I.e., \[ \nabla f = \begin{pmatrix}\partial_x f \\ \partial_y f \end{pmatrix}. \]

The gradient

We developed the first order approximation of the univariate function \( f(x) \) as \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1}. \]
We also use the gradient similar to the above as a linear approximation for the multivariate function \( f(\mathbf{x}) \).
Let \( \mathbf{x}_0\in \mathbb{R}^n \) be some vector and \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \), where \( \boldsymbol{\delta}_{x_1} \) is now a vector of small perturbations.
At first order, the Taylor series is given as

\[ f(\mathbf{x}_1) = f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^2\right). \]
The approximation above when \( \boldsymbol{\delta}_x \) is a smaller perturbation of the vector \( \mathbf{x}_0 \), that is when the Euclidean norm \( \parallel \boldsymbol{\delta}_{x_1}\parallel \) is small.

The Hessian

A similar statement to the second order approximation we developed for univariate function \( f(x) \), \[ f(x_1) \approx f(x_0) + f'(x_0)\delta_{x_1} + f''(x_0)\frac{\delta_{x_1}^2}{2}, \] can be developed for the multivariate case.
In order to do so, we will need to introduce the idea of the Hessian as the second multivariate derivate of \( f \).
We suppose again that \( f \) is a scalar-valued function of multiple variables, \[ \begin{align} f: \mathbb{R}^n & \rightarrow \mathbb{R} \\ \mathbf{x} &\rightarrow f(\mathbf{x}) \end{align} \]
We will define the Hessian matrix for the function \( f \) as \( \mathbf{H}_f \) with, \[ \begin{align} \mathbf{H}_{f} = \begin{pmatrix} \partial_{x_1}^2 f & \partial_{x_1}\partial_{x_2} f & \cdots & \partial_{x_1}\partial_{x_n}f \\ \partial_{x_2}\partial_{x_1} f & \partial_{x_2}^2 f & \cdots & \partial_{x_2} \partial_{x_n}f \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_n}\partial_{x_1} f & \partial_{x_n}\partial_{x_2} & \cdots & \partial_{x_n}^2 f \end{pmatrix} \end{align} \]
For short, this is often written as \( \mathbf{H}_f = \left\{ \partial_{x_i}\partial_{x_j} f\right\}_{i,j=1}^n \) where this refers to the \( i \)-th row and \( j \)-th column.
\( \mathbf{H}_f \) can be evaluated at a particular point \( \mathbf{x}_0 \), and notice that \( \mathbf{H}_f \) is always symmetric.

The Hessian

As before, let \( \mathbf{x}_1 = \mathbf{x}_0 + \boldsymbol{\delta}_{x_1} \) be given as a small perturbation of \( \mathbf{x}_0 \).
Using the Hessian as defined on the last slide, if \( f \) has continuous second order partial derivatives, the Taylor series is given at second order as \[ \begin{align} f(\mathbf{x}_1) = f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1} + \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1}\parallel^3\right) \end{align} \]
Similarly, our second order approximation is defined by

\[ \begin{align} f(\mathbf{x}_1) \approx f(\mathbf{x}_0) + \left(\nabla f(\mathbf{x}_0)\right)^\mathrm{T} \boldsymbol{\delta}_{x_1}+ \frac{1}{2} \boldsymbol{\delta}_{x_1}^\mathrm{T}\mathbf{H}_f (\mathbf{x}_0) \boldsymbol{\delta}_{x_1} \end{align} \] which is accurate when the size of the perturbation \( \parallel \boldsymbol{\delta}_{x_1}\parallel \) is small.

We will consider how to use this approximation directly in our next lecture.
For now, we will look an example of how we can compute this.

An example of computing the gradient and the Hessian

Courtesy of Babak. K. Shandiz, CC BY-SA 4.0, via Wikimedia Commons

Consider the function for the simple paraboloid in two variables \( f(x_1,x_2) = x_1^2 + x_2^2 \).
We will consider how to compute both the analytical gradient and Hessian, and how to perform this numerically.
Q: using the definitions, what are the gradient vector and Hessian matrix for \( f \) defined above?
- A: we recall, \[ \begin{align} \nabla f &= \begin{pmatrix} \partial_{x_1}f \\ \partial_{x_2}f \end{pmatrix} = \begin{pmatrix} 2x_1 \\ 2x_2 \end{pmatrix} \\ \mathbf{H}_f& = \begin{pmatrix} \partial_{x_1}^2 f & \partial_{x_1}\partial_{x_2} f \\ \partial_{x_2}\partial_{x_1} f & \partial_{x_2}^2 f \end{pmatrix} = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \end{align} \]
Very similar to the notion of a critical point in one variable, we have that \( \nabla f = \boldsymbol{0} \), the zero vector, at a critical point for \( f \).
Q: in the graph, what is the critical point pictured?

A: the critical point is the unique minimum at \( (0,0) \).
Q: what is special about the Hessian above?
A: we can see that this is in an eigenvalue decomposition and all eigenvalues are positive – we will return to this next lecture.

An example of computing the gradient and the Hessian

For most functions it is not a trivial step to compute the gradient and the Hessian and we might need to compute this numerically instead of analytically.
The package numDeriv has tools for computing these by forms of the finite different approximations discussed earlier.

require(numDeriv)
f <- function(x){
  return (sum(x^2))
}

Now, we consider the evaluation of the gradient and the Hessian at the critical point to find:

grad(f, x=c(0,0))

[1] 0 0

hessian(f, x=c(0,0))

     [,1] [,2]
[1,]    2    0
[2,]    0    2

The Jacobian

Not all functions give a single value as an output, for example we can instead write an expression as,

\[ \begin{align} \mathbf{f} : \mathbb{R}^n & \rightarrow \mathbb{R}^m\\ \mathbf{x} &\rightarrow \begin{pmatrix} f_1\left(\mathbf{x}\right) \\ \vdots \\ f_m(\mathbf{x})\end{pmatrix} \end{align} \]
In the above, each of the component-functions \( f_i \) for \( i=1,\cdots , m \) is a function like we saw before,

\[ \begin{align} f_i :\mathbb{R}^n &\rightarrow \mathbb{R} \\ \mathbf{x} &\rightarrow f_i (\mathbf{x}) \end{align} \]
This could look like, for example, in dimensions \( \mathbb{R}^2 \rightarrow \mathbb{R}^2 \)

\[ \begin{align} \mathbf{f}(x, y) = \begin{pmatrix} x^2 \\ y^2 \end{pmatrix} \end{align} \] where
- \( f_1(x,y) = x^2 \), and
- \( f_2(x,y) = y^2 \).

The Jacobian

For a vector valued function such as \( \mathbf{f} \) we can define the action of the gradient once again, but to define a matrix of first derivatives called the Jacobian.
This is given as, \[ \begin{align} \nabla \mathbf{f} &= \begin{pmatrix} \partial_{x_1} f_1 & \partial_{x_2} f_1 & \cdots & \partial_{x_n} f_1 \\ \partial_{x_1} f_2 & \partial_{x_2} f_2 & \cdots & \partial_{x_n} f_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial_{x_1} f_m & \partial_{x_2} f_m & \cdots & \partial_{x_n} f_m \end{pmatrix}\in\mathbb{R}^{m\times n}. \end{align} \]
The Jacobian has a similar interpretation to the gradient, but the meaning is more subtle due to the multiple inputs and outputs of \( \mathbf{f} \) simultaneously.
However, we can easily restate the first order Taylor expansion as,

\[ \begin{align} \mathbf{f}\left(\mathbf{x}_1\right) = \mathbf{f}\left(\mathbf{x}_0\right) + \nabla \mathbf{f}\left(\mathbf{x}_0\right) \boldsymbol{\delta}_{x_1} + \mathcal{o}\left(\parallel \boldsymbol{\delta}_{x_1} \parallel^2\right) \end{align} \]

This again gives a linear approximation, but in higher dimensions in terms of a vector tangent direction \( \nabla \mathbf{f}\left(\mathbf{x}_0\right) \boldsymbol{\delta}_{x_1} \) away from the value \( \mathbf{f}(\mathbf{x}_0) \) approximating \( \mathbf{f}\left(\mathbf{x}_1\right) \).

Multivariate Newton

The Newton-Raphson method can now be restated in terms of multiple variables as follows.
Suppose that we have a nonlinear inverse problem stated as follows:

\[ \begin{align} \mathbf{f} :\mathbb{R}^n &\rightarrow \mathbb{R}^n \\ \mathbf{x} & \rightarrow \mathbf{f}(\mathbf{x}) \\ \mathbf{f}\left(\mathbf{x}^\ast\right)& = \mathbf{b} \end{align} \]

We redefine this in terms of the adjusted function \( \tilde{\mathbf{f}}\left(\mathbf{x}^\ast\right) = \mathbf{f}\left(\mathbf{x}^\ast\right) - \mathbf{b} = \boldsymbol{0} \), and we wish to make the same first order approximation as before.
Supposing we have a good initial guess \( \mathbf{x}_0 \) for \( \mathbf{x}^\ast \), we look for the point where the linear approximation equals zero, i.e., \[ \begin{align} 0 &= \mathbf{f}\left(\mathbf{x}_0\right) + \nabla \mathbf{f}\left(\mathbf{x}_0\right) \boldsymbol{\delta}_{x_1} \\ \Leftrightarrow \boldsymbol{\delta}_{x_1} &= -\left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \mathbf{f}\left(\mathbf{x}_0\right). \end{align} \]
The above makes sense as an approximation as long as \( \left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \) exists, i.e., as long as it has a non-zero determinant (no zero eigenvalues).
We can thus once again update the approximation recursively so that \( \mathbf{x}_1 = \mathbf{x}_0 -\left(\nabla \mathbf{f}\left(\mathbf{x}_0\right) \right)^{-1} \mathbf{f}\left(\mathbf{x}_0\right) \), as long as the inverse exists.
We will continue this idea in the next lecture, looking at how to minimize or maximize functions.