Appendix D Calculus

SEBASTIAN RASCHKA
Introduction to
Artificial Neural Networks
and Deep Learning
with Applications in Python
Introduction to Artificial
Neural Networks
with Applications in Python
Sebastian Raschka
D RAFT
Last updated: February 12, 2019
This book will be available at http://leanpub.com/ann-and-deeplearning.
Please visit https://github.com/rasbt/deep-learning-book for more

information, supporting material, and code examples.
© 2016-2018 Sebastian Raschka

Contents
D Calculus and Differentiation Primer 4

D.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
D.2 Derivatives of Common Functions . . . . . . . . . . . . . . . 8
D.3 Common Differentiation Rules . . . . . . . . . . . . . . . . . 9
D.4 The Chain Rule – Computing the Derivative of a Composi-
tion of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 10
D.4.1 A Chain Rule Example . . . . . . . . . . . . . . . . . . 11
D.5 Arbitrarily Long Function Compositions . . . . . . . . . . . . 13
D.6 When a Function is Not Differentiable . . . . . . . . . . . . . 13
D.7 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . 18
D.8 Second Order Partial Derivatives . . . . . . . . . . . . . . . . 21
D.9 The Multivariable Chain Rule . . . . . . . . . . . . . . . . . . 21
D.10 The Multivariable Chain Rule in Vector Form . . . . . . . . . 22
D.11 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . 23
D.12 The Laplacian Operator . . . . . . . . . . . . . . . . . . . . . 24
i
Website
Please visit the GitHub repository to download the code examples accom-
panying this book and other supplementary material.
If you like the content, please consider supporting the work by buy-
ing a copy of the book on Leanpub. Also, I would appreciate hearing
your opinion and feedback about the book, and if you have any ques-
tions about the contents, please don’t hesitate to get in touch with me via
mail@sebastianraschka.com. Happy learning!
Sebastian Raschka
1
About the Author
Sebastian Raschka received his doctorate from Michigan State University

developing novel computational methods in the field of computational bi-
ology. In summer 2018, he joined the University of Wisconsin–Madison
as Assistant Professor of Statistics. Among others, his research activities
include the development of new deep learning architectures to solve prob-
lems in the field of biometrics. Among his other works is his book "Python
Machine Learning," a bestselling title at Packt and on Amazon.com, which
received the ACM Best of Computing award in 2016 and was translated
into many different languages, including German, Korean, Italian, tradi-
tional Chinese, simplified Chinese, Russian, Polish, and Japanese.
Sebastian is also an avid open-source contributor and likes to contribute
to the scientific Python ecosystem in his free-time. If you like to find more
about what Sebastian is currently up to or like to get in touch, you can find
his personal website at https://sebastianraschka.com.
2
Acknowledgements
I would like to give my special thanks to the readers, who provided feed-
back, caught various typos and errors, and offered suggestions for clarify-
ing my writing.
• Appendix A: Artem Sobolev, Ryan Sun
• Appendix B: Brett Miller, Ryan Sun
• Appendix D: Marcel Blattner, Ignacio Campabadal, Ryan Sun, Denis

Parra Santander
• Appendix F: Guillermo Monecchi, Ged Ridgway, Ryan Sun, Patric

Hindenberger
• Appendix H: Brett Miller, Ryan Sun, Nicolas Palopoli, Kevin Zakka
DRAFT 3
Appendix D
Calculus and Differentiation

Primer
Calculus is a discipline of mathematics that provides us with tools to ana-

lyze rates of change, or decay, or motion. Both Isaac Newton and Gottfried
Leibniz developed the foundations of calculus independently in the 17th
century. Although we recognize Gottfried and Leibniz as the founding fa-
thers of calculus, this field, however, has a very long series of contributors,
which dates back to the ancient period and includes Archimedes, Galileo,
Plato, Pythagoras, just to name a few [Boyer, 1970].
In this appendix we will only concentrate on the subfield of calculus
that is of most relevance to machine and deep learning: differential calcu-
lus. In simple terms, differential calculus is focused on instantaneous rates
of change or computing the slope of a linear function. We will review the
basic concepts of computing the derivatives of functions that take on one
or more parameters. Also, we will refresh the concepts of the chain rule, a
rule that we use to compute the derivatives of composite functions, which
we so often deal with in machine learning.
D.1 Intuition
So, what is the derivative of a function? In simple terms, the derivative a
function is a function’s instantaneous rate of change. Now, let us start this
section with a visual explanation, where we consider the function
f (x) = 2x (D.1)
DRAFT 4
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 5
shown in the graph in Figure D.1.
Figure D.1: Graph of a linear function, f (x) = 2x.
Given the linear function in Equation D.1, we can interpret the "rate of
change" as the slope of this function. And to compute the slope of a function,
we take an arbitrary x-axis value, say a, and plug it into this function: f (a).
Then, we take another value on the x-axis, let us call it b = a + ∆a, where
∆ is the change between a and b. Now, to compute the slope of this linear
function, we divide the change in the function’s output f (a + ∆a) by the
change in the function’s input a + ∆a:
f (a + ∆a) − f (a)
Slope = . (D.2)
a + ∆a − a
In other words, the slope is simply the fraction of the change in a and the
function’s output:
f (a + ∆a) − f (a) f (a + ∆a) − f (a)

Slope = = . (D.3)
a + ∆a − a ∆a
Now, let’s take this intuition, the slope of a linear function, and formulate
the general definition of the derivative of a continuous function f(x):
DRAFT
df f (x + ∆x) − f (x)
f 0 (x) = = lim , (D.4)
dx ∆x→0 ∆x
where lim∆x→0 means "as the change in x becomes infinitely small (for
instance, ∆x approaches zero)." Since this appendix is merely a refresher
rather than a comprehensive calculus resource, we have to skip over some
important concepts such as Limit Theory. So, if this is the first time you
encounter calculus, I recommend consulting additional resources such as
"Calculus I, II, and III" by Jerrold E. Marsden and Alan Weinstein 1 .
Infobox D.1.1 Derivative Notations

df
The two different notations dx and f 0 (x) both refer to the derivative
of a function f (x). The former is the "Lagrange notation," and the lat-
df
ter is called "Leibniz notation," respectively. In Leibniz notation, dx is
d d
sometimes also written as dx f (x), and dx is an operator that we read as
"differentiation with respect to x." Although the Leibniz notation looks
a bit verbose at first, it plays nicely into our intuition by regarding df as
a small change in the output of a function f and dx as a small change of
df
its input x. Hence, we can interpret the ratio dx as the slope of a point
in a function graph.
Based on the linear function introduced at the beginning of this section

(Equation D.1), let us use the concepts introduced in this section to compute
the derivative of this function from basic principles. Given the function
f (x) = 2x, we have
f (x + ∆x) = 2(x + ∆x) = 2x + 2∆x, (D.5)

so that
df f (x + ∆x) − f (x)
= lim
dx ∆x→0 ∆x
2x + 2∆x − 2x
= lim
∆x→0 ∆x (D.6)
2∆x
= lim
∆x→0 ∆x
= lim 2.
∆x→0
1
http:/ /www.cds.caltech.edu/ marsden/volume/Calculus/
DRAFT
We conclude that the derivative of f (x) = 2x is simply a constant, namely

f 0 (x) = 2.
Applying these same principles, let us take a look at a slightly more
interesting example, a quadratic function,
f (x) = x2 , (D.7)
as illustrated in Figure D.2.
Figure D.2: Graph of a quadratic function, f (x) = x2 .
As we can see in Figure D.2, this quadratic function (Equation D.7) does
not have a constant slope, in contrast to a linear function. Geometrically,
we can interpret the derivative of a function as the slope of a tangent to a
function graph at any given point. And we can approximate the slope of a
tangent at a given point by a secant connecting this point to a second point
that is infinitely close, which is where the lim∆x→0 notation comes from.
(In the case of a linear function, the tangent is equal to the secant between
two points.)
Now, to compute the derivative of the quadratic function f (x) = x2 , we
can, again, apply the basic concepts we used earlier, using the fact that
DRAFT
f (x + ∆x) = (x + ∆x)2 = x2 + 2x∆x + (∆x)2 . (D.8)

Now, computing the derivative, we get
df f (x + ∆x) − f (x)
= lim
dx ∆x→0 ∆x
x2 + 2x∆x + (∆x)2 − x2
= lim
∆x→0 ∆x (D.9)
2x∆x + (∆x)2
= lim
∆x→0 ∆x
= lim 2x + ∆x.
∆x→0
And since ∆x approaches zero due to the limit, we arrive at f 0 (x) = 2x,
which is the derivative of f (x) = x2 .
D.2 Derivatives of Common Functions

After we gained some intuition in the previous section, this section pro-
vides tables and lists of the basic rules for computing function derivatives
for our convenience – as an exercise, readers are encouraged to apply the
basic principles to derive these rules.
The following table, Table D.1, in this subsection lists derivatives of
commonly used functions; the intention is that we can use it as quick look-
up table. As mentioned earlier, we can obtain these derivates using the
basic principles we discussed at the beginning of this appendix. For in-
stance, we just used these basic principles to compute the derivative of a
linear function (Table D.1, row 3) and a quadratic function (Table D.1, row
4) earlier on.
DRAFT
Function f (x) Derivative with respect to x

1 a 0
2 x 1
3 ax a
4 x2 2x
5 xa axa−1
6 ax log(a)ax
7 log(x) 1/x
8 loga (x) 1/(x log(a))
9 sin(x) cos(x)
10 cos(x) − sin(x)
11 tan(x) sec2 (x)
Table D.1: Derivatives of common functions.
D.3 Common Differentiation Rules

In addition to the constant rule (Table D.1, row 1) and the power rule (Ta-
ble D.1, row 5), the following table lists the most common differentiation
rules that we often encounter in practice. Although we will not go over the
derivations of these rules, it is highly recommended to memorize and prac-
tice them. Most machine learning concepts heavily rely on applications of
these rules, and in the following sections, we will pay special attention to
the last rule in this list, the chain rule.
Function Derivative
Sum Rule f (x) + g(x) f 0 (x) + g 0 (x)
Difference Rule f (x) − g(x) f 0 (x) − g 0 (x)
Product Rule f (x)g(x) f 0 (x)g(x) + f (x)g 0 (x)
Quotient Rule f (x)/g(x) [g(x)f 0 (x) − f (x)g 0 (x)]/[g(x)]2
Reciprocal Rule 1/f (x) −[f 0 (x)]/[f (x)]2
Chain Rule f (g(x)) f 0 (g(x))g 0 (x)
Table D.2: Common differentiation rules.
DRAFT
D.4 The Chain Rule – Computing the Derivative of a

Composition of Functions
The chain rule is essential to understanding backpropagation; thus, let us
discuss it in more detail. In its essence, the chain rule is just a mental crutch
that we use to differentiate composite functions, functions that are nested
within each other. For example,
F (x) = f (g(x)). (D.10)

To differentiate such a function F , we can use this chain rule, which we
can break down to a three-step procedure. First, we compute the derivative
of the outer function (f 0 ) with respect to the inner function (g). Second, we
compute the derivative of the inner function (g 0 ) with respect to its function
argument (x). Third, we multiply the outcome of step 1 and step 2:
F 0 (x) = f 0 (g(x))g 0 (x). (D.11)

Since this notation may look quite daunting, let us use a more visual ap-
proach, breaking down the function F into individual steps as illustrated
in Figure D.3: We take the argument x, feed it to g, then, we take the out-
come of g(x) and feed it to f .
Figure D.3: Visual decomposition of a function
Using the chain rule, Figure D.4 illustrates how we can derive F (x) via
two parallel steps: We compute the derivative of the inner function g (i.e.,
g 0 (x)) and multiply it by the outer derivative f 0 (g(x)).
DRAFT
Figure D.4: Concept of the chain rule
Now, for the rest of the section, let us use the Leibniz notation, which makes
these concepts easier to follow:
d df dg
f (g(x)) = · . (D.12)
dx dg dx
(Remember that the equation above is equivalent to writing F 0 (x) = f 0 (g(x))g 0 (x).)
D.4.1 A Chain Rule Example

Let us now walk through an application of the chain rule, working through
the differentiation of the following function:
√
f (x) = log( x). (D.13)
Step 0: Organization
First, we identify the innermost function:
√
g(x) = x. (D.14)
Using the definition of the inner function, we can now express the outer
function in terms of g(x):
f (x) = log(g(x)). (D.15)

But before we start executing the chain rule, let us substitute in our defi-
nitions into the familiar framework, differentiating function f with respect
DRAFT
to the inner function g, multiplied by the derivative of g with respect to the

function argument:
df df dg
= · , (D.16)
dx dg dx
which lets us arrive at
df d d√
= log(g) · x. (D.17)
dx dg dx
Step 1: Derivative of the outer function

Now that we have set up everything nicely to apply the chain rule, let us
compute the derivative of the outer function with respect to the inner func-
tion:
d 1 1
log(g) = = √ . (D.18)
dg g x
Step 2: Derivative of the inner function

To find the derivative of the inner function with respect to x, let us rewrite
g(x) as
√
g(x) = x = x1/2 . (D.19)
Then, we can use the power rule (Table D.1 row 5) to arrive at
d 1/2 1 −1/2 1
x = x = √ . (D.20)
dx 2 2 x
Step 3: Multiplying inner and outer derivatives

Finally, we multiply the derivatives of the outer (step 1) and inner function
√
(step 2), to get the derivative of the function f (x) = log( x):
df 1 1 1
=√ · √ = . (D.21)
dx x 2 x 2x
DRAFT
D.5 Arbitrarily Long Function Compositions

In the previous sections, we introduced the chain rule in context of two
nested functions. However, the chain rule can also be used for an arbitrar-
ily long function composition. For example, suppose we have five different
functions, f (x), g(x), h(x), u(x), and v(x), and let F be the function compo-
sition:
F (x) = f (g(h(u(v(x))))). (D.22)

Then, we compute the derivative as
dF d d
= F (x) = f (g(h(u(v(x)))))
dx dx dx
(D.23)
df dg dh du dv
= · · · · .
dg dh du dv dx
As we can see in Equation D.23, composing multiple function is similar

to the previous two-function example; here, we create a chain of deriva-
tives of functions with respect to their inner function until we arrive at the
innermost function, which we then differentiate with respect to the func-
tion parameter x.
D.6 When a Function is Not Differentiable

A function is only differentiable if the derivative exists for each value in the
function’s domain (for instance, at each point). Non-differentiable func-
tions may be a bit cumbersome to deal with mathematically; however, they
can still be useful in practical contexts such as deep learning. A popular ex-
ample of a non-differentiable function that is widely used in deep learning
is the Rectified Linear Unit (ReLU) function. The ReLU function f (x) is not
differentiable because its derivative does not exist at x = 0, but more about
that later in this section.
One criterion for the derivative to exist at a given point is continuity at
that point. However, continuity is not sufficient for the derivative to exist.
For the derivative to exist, we require the left-hand and the right-hand limit
to exist and to be equal.
Remember that conceptually, the derivative at a given point is defined
as the slope of a tangent to the function graph at that point. Or in other
words, we approximate the function graph at a given point with a straight
DRAFT
line as shown in Figure D.5. (Intuitively, we can say that a curve, when
closely observed, resembles a straight line.)
f(x) = x 3 f(x) = x 3
100 100 f '(x) = 3x 2
50 50
27
f(x) 0 0
50 50
100 100
4 2 0 2 4 4 2 0 2 4
x -3 x
Figure D.5: Graph of the function f (x) = x3 with a tangent line to approximate the
derivative at point x = −3 (left) and the derivative at each point on the function
graph (right).
Now, if there are breaks or gaps at a given point, we cannot draw a

straight line or tangent approximating the function at that point, because –
in intuitive terms – we would not know how to place the tangent. Other
common scenarios where derivatives do not exist are sharp turns or cor-
ners in a function graph since it is not clear how to place the tangent if
we compute the limit from the left or the right side. Finally, any point on
a function graph that results in a vertical tangent (parallel to the vertical
axis) is not differentiable – note that a vertical line is not a function due to
the one-to-many mapping condition.
The reason why the derivative of sharp turns or corners (for instance,
points on a function graph that are not "smooth") does not exist is that the
limit from the left and the right side are different and do not agree. To
illustrate this, let us take a look at a simple example, the absolute value
function shown in Figure D.6.
DRAFT
3
f(x)
f(x)= |x|
0
4 2 0 2 4
x
Figure D.6: Graph of the "sharp turn"-containing function f (x) = |x|
We will now show that the derivative for f (x) = |x| does not exist at the
sharp turn at x = 0. Recall the definition of the derivative of a continuous
function f (x) that was introduced in Section D.1:
f (x + ∆x) − f (x)
f 0 (x) = lim . (D.24)
∆x→0 ∆x
If we substitute f (x) by the absolute value function, |x|, we obtain
|x + ∆x| − |x|
f 0 (x) = lim .
∆x→0 ∆x
Next, let us set x = 0, the point we want to evaluate the equation
|0 + ∆x| − |0|
f 0 (0) = lim .
∆x→0 ∆x
If the derivative f 0 (0) exists, it should not matter whether we approach the
limit from the left or the right side2 . So, let us compute the left-side limit
first (here, ∆x represents an infinitely small, negative number):
|0 + ∆x| − |0| |∆x|

f 0 (0) = lim = lim = −1.
∆x
∆x→0− ∆x→0− ∆x
As shown above, the left-hand limit evaluates to −1 because dividing a
positive number be a negative number yields a negative number. We can
now do the same calculation by approaching the limit from the right, where
∆x is an infinitely small, non-negative number:
2
Here, "left" and "right" refer to the position of a number on the number line with respect
to 0.
DRAFT
|0 + ∆x| − |0| |∆x|

f 0 (0) = lim = lim = 1.
∆x→0+ ∆x ∆x→0+ ∆x
We can see that the limits are not equal (1 6= −1), and because they do
not agree, we have no formal notion of how to draw the tangent line to the
function graph at the point x = 0. Hence, we say that the derivative of the
function f (x) = |x| does not exist (DNE) at point x = 0:
f 0 (0) = DNE.
A widely-used function in deep learning applications that is not differ-
entiable at a point3 is the ReLU function, which was introduced at the be-
ginning of this section. To provide another example of a non-differentiable
function, we now apply the concepts of left- and right-hand limits to the
piece-wise defined ReLU function (Figure D.7).
f(x)
5
4 2 0 2 4 x
Figure D.7: Graph of the ReLU function.
The ReLU function is commonly defined as
f (x) = max(0, x)
or
(
0 if x < 0
f (x) =
x if x ≥ 0
3
Coincidentally, the point where the ReLU function is not defined is also x = 0.
DRAFT
(These two function definitions are equivalent.) If we substitute the ReLU

equation into Equation D.24, we then obtain
max(0, x + ∆x) − max(0, x)

f 0 (x) = lim .
x→0 ∆x
Next, let us compute the left- and right-side limits. Starting from the
left side, where ∆x is an infinitely small, negative number, we get
0−0
f 0 (0) = lim = 0.
x→0− ∆x
And for the right-hand limit, where ∆x is an infinitely small, positive num-
ber, we get
0 + ∆x − 0
f 0 (0) = lim = 1.
x→0+ ∆x
Again, the left- and right-hand limits are not equal at x = 0; hence, the
derivative of the ReLU function at x = 0 is not defined.
For completeness’ sake, the derivative of the ReLU function for x > 0 is
x + ∆x − x ∆x
f 0 (x) = lim = = 1.
∆x
x→0 ∆x
And for x < 0, the ReLU derivative is
0−0
f 0 (x) = lim
=0
∆xx→0
To summarize, the derivative of the ReLU function is defined as follows:


0

 if x < 0
0
f (x) = 1 if x > 0 .


DNE if x = 0
Infobox D.6.1 ReLU Derivative in Deep Learning
In practical deep learning applications, the ReLU derivative for x = 0 is

typically set to 0, 1, or 0.5. However, it is extremely rare that x is exactly
zero, which is why the decision whether we set the ReLU derivative to
0, 1, or 0.5 has little impact on the parameterization of a neural network
with ReLU activation functions.
DRAFT
D.7 Partial Derivatives and Gradients

Throughout the previous sections, we only looked at univariate functions,
functions that only take one input variable, for example, f (x). In this sec-
tion, we will compute the derivatives of multivariable functions f (x, y, z, ...).
Note that we still consider scalar-valued functions, which return a scalar or
single value.
While the derivative of a univariate function is a scalar, the derivative
of a multivariable function is a vector, the so-called gradient. We denote
the derivative of a multivariable function f using the gradient symbol ∇
(pronounced "nabla" or "del"):
 
∂f /∂x
∂f /∂y 
 
∇f = 
 ∂f /∂z  .
 (D.25)
..
 
.
As we can see, the gradient is simply a vector listing the derivatives of a
function with respect to each argument of the function. In Leibniz notation,
we use the symbol ∂ instead of d to distinguish partial from ordinary deriva-
tives. The adjective "partial" is based on the idea that a partial derivative
with respect to a function argument does not tell the whole story about a
∂
function f . For instance, given a function f , the partial derivative ∂x f (x, y)
only considers the change in f if x changes while treating y as a constant.
To illustrate the concept of partial derivatives, let us walk through a
concrete example, where we will compute the gradient of the function
f (x, y) = x2 y + y. (D.26)
The plot in Figure D.8 shows a graph of this function for different values of
x and y.
DRAFT
Figure D.8: Graph of the function f (x, y) = x2 y + y.
The subfigures shown in Figure D.9 illustrate how the function looks
like if we treat either x or y as a constant.
DRAFT
Figure D.9: Graph of function f (x, y) = x2 y + y when treating y (left) or x (right)

as a constant.
Intuitively, we can think of the two graphs in Figure D.9 as slices of

the multivariable function graph shown in Figure D.8. And computing the
partial derivative of a multivariable function – with respect to a function’s
argument – means that we compute the slope of the slice of the multivari-
able function graph.
Now, to compute the gradient of f , we compute the two partial deriva-
tives of that function as follows:
" #
∂f /∂x
∇f (x, y) = , (D.27)
∂f /∂y
where
∂f ∂ 2
= x y + y = 2xy (D.28)
∂x ∂x
(via the power rule and constant rule), and
∂f ∂ 2
= x y + y = x2 + 1. (D.29)
∂y ∂y
So, the gradient of the function f is defined as
" #
2xy
∇f (x, y) = 2 . (D.30)
x +1
DRAFT
D.8 Second Order Partial Derivatives

Let us briefly go over the notation of second order partial derivatives, since
the notation may look a bit strange at first. In a nutshell, the second or-
der partial derivative of a function is the partial derivative of the partial
derivative. For instance, we write the second derivative of a function f
with respect to x as
∂ ∂f ∂2f

= . (D.31)
∂x ∂x ∂x2
For example, we compute the second partial derivative of a function f (x, y) =
x2 y + y as follows:
∂2f ∂ ∂ 2 ∂ ∂

2
= x y+y = 2xy = = 2y. (D.32)
∂x ∂x ∂x ∂x ∂x
Note that in the initial definition (Equation D.31) and the example (Equa-
tion D.32) both the first and second order partial derivatives were com-
puted with respect to the same input argument, x. However, depending on
what measurement we are interested in, the second order partial derivative
can involve a different input argument. For instance, given a multivariable
function with two input arguments, we can in fact compute four distinct
second order partial derivatives:
∂2f ∂2f ∂2f ∂2f

, , , and , (D.33)
∂x2 ∂y 2 ∂x∂y ∂y∂x
∂2f
where, for example, ∂y∂x is defined as
∂2f ∂ ∂f

= . (D.34)
∂y∂x ∂y ∂x
D.9 The Multivariable Chain Rule

In this section, we will take a look at how to apply the chain rule to func-
tions that take multiple arguments. For instance, let us consider the follow-
ing function:
f g, h = g 2 h + h,

(D.35)
where g(x) = 3x, and h(x) = x2 . So, as it turns out, our function is a
composition of two functions:
DRAFT

f g(x), h(x) (D.36)
Previously, in Section D.4, we defined the chain rule for the univariate case
as follows:
d df dg
f (g(x)) = · . (D.37)
dx dg dx
To extend apply this concept to multivariable functions, we simply extend
the notation above using the product rule. Hence, we can define the multi-
variable chain rule as follows:
d ∂f dg ∂f dh
f (g(x), h(x)) = · + · . (D.38)
dx ∂g dx ∂h dx
Applyingthe multivariable chain rule to our multivariable function exam-
ple f g, h = g 2 h + h, let us start with the partial derivatives:
∂f
= 2gh (D.39)
∂g
and
∂f
= g 2 + 1. (D.40)
∂h
Next, we take the ordinary derivatives of the two functions g and h:
dg d
= 3x = 3 (D.41)
dx dx
dh d 2
= x = 2x. (D.42)
dx dx
And finally, plugging everything into our multivariable chain rule definition,
we arrive at
d
f (g(x), h(x)) = [2gh · 3] + [g 2 + 1 · 2x] = g 2 + 6gh + 2x.

(D.43)
dx
D.10 The Multivariable Chain Rule in Vector Form

After we introduced the general concept of the multivariable chain rule, we
often prefer a more compact notation in practice: the multivariable chain
rule in vector form.
DRAFT
Infobox D.10.1 Dot Products

As we remember from the linear algebra appendix, we compute the dot
product between two vectors, a and b , as follows:
" # " #
a x
a·b= · = ax + by
b y
In vector form, we write the multivariable chain rule
d ∂f dg ∂f dh
f (g(x), h(x)) = · + · (D.44)
dx ∂g dx ∂h dx
as follows:
d
f (g(x), h(x)) = ∇f · v0 (x).

(D.45)
dx
Here, v is a vector listing the function arguments:
" #
g(x)
v(x) = . (D.46)
h(x)
And the derivative ("v-prime" in Lagrange notation) is defined as follows:
" # " #
0 d g(x) dg/dx
v (x) = = . (D.47)
dx h(x) dh/dx
So, putting everything together, we have
" # " #
0 ∂f /∂g dg/dx ∂f dg ∂f dh
∇f · v (x) = · = · + · . (D.48)
∂f /∂h dh/dx ∂g dx ∂h dx
D.11 The Hessian Matrix

As mentioned earlier in Section D.8 Second Order Partial Derivatives, we can
compute four distinct partial derivatives for a two-variable function:
f (x, y). (D.49)

The Hessian matrix is simply a matrix that packages them up:
DRAFT
" #
∂ 2 f /∂x2 ∂ 2 f /∂x∂y
Hf = 2 . (D.50)
∂ f /∂y∂x ∂ 2 f /∂y 2
To formulate the Hessian for a multivariable function that takes n argu-
ments,
f (x1 , x2 , ..., xn ), (D.51)

we write the Hessian as
 ∂2f ∂2f ∂2f

...
 ∂x∂12∂x
f
1 ∂x1 ∂x2
∂2f
∂x1 ∂xn
∂2f 

 ∂x2 ∂x1 ∂x2 ∂x2 ... 
∂x2 ∂xn 
Hf =  . .. .. .. . (D.52)
 .. . . . 
 
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2 ... ∂xn ∂xn
D.12 The Laplacian Operator

At its core, the Laplacian operator (∆) is an operator that takes in a func-
tion and returns another function. In particular, it is the divergence of the
gradient of a function f – a kind of second order partial derivative, or "the
dircection that increases the direction most rapidly:"
∆f (g(x), h(x)) = ∇ · ∇f. (D.53)

Remember, we compute the gradient of a function f (g, h) as follows:
" #
∂f /∂g
∇f (g, h) = . (D.54)
∂f /∂h
Plugging it into the definition of the Laplacian, we arrive at
" # " #
∂f /∂g ∂f /∂g ∂2f ∂2f
∆f (g(x), h(x)) = · f= + . (D.55)
∂f /∂h ∂f /∂h ∂g 2 ∂h2
And in more general terms, we can define the Laplacian of a function
f (x1 , x2 , ..., xn ) (D.56)

as
DRAFT
n
X ∂2f
∆f = . (D.57)
i=1
∂x2i
DRAFT
Bibliography
[Boyer, 1970] Boyer, C. B. (1970). The history of the calculus. The Two-Year
College Mathematics Journal, 1(1):60–86.
DRAFT 26
Abbreviations and Terms
AMI [Amazon Machine Image]

API [Application Programming Interface]
CNN [Convolutional Neural Network]
DNE [Does Not Exist]
DRAFT 27
Index
DRAFT
28

Appendix D Calculus

Uploaded by

Copyright:

Available Formats

Appendix D Calculus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appendix D Calculus

Uploaded by

Copyright:

Available Formats

What topics are covered in the calculus primer?

What topics are covered in the calculus primer?

What is the multivariable chain rule?

What is the multivariable chain rule?

SEBASTIAN RASCHKA

This book will be available at http://leanpub.com/ann-and-deeplearning.

Please visit https://github.com/rasbt/deep-learning-book for more

© 2016-2018 Sebastian Raschka

D Calculus and Differentiation Primer 4

Sebastian Raschka received his doctorate from Michigan State University

• Appendix A: Artem Sobolev, Ryan Sun

• Appendix B: Brett Miller, Ryan Sun

• Appendix D: Marcel Blattner, Ignacio Campabadal, Ryan Sun, Denis

• Appendix F: Guillermo Monecchi, Ged Ridgway, Ryan Sun, Patric

• Appendix H: Brett Miller, Ryan Sun, Nicolas Palopoli, Kevin Zakka

Calculus and Differentiation

Calculus is a discipline of mathematics that provides us with tools to ana-

shown in the graph in Figure D.1.

Figure D.1: Graph of a linear function, f (x) = 2x.

f (a + ∆a) − f (a) f (a + ∆a) − f (a)

Infobox D.1.1 Derivative Notations

Based on the linear function introduced at the beginning of this section

f (x + ∆x) = 2(x + ∆x) = 2x + 2∆x, (D.5)

We conclude that the derivative of f (x) = 2x is simply a constant, namely

Figure D.2: Graph of a quadratic function, f (x) = x2 .

f (x + ∆x) = (x + ∆x)2 = x2 + 2x∆x + (∆x)2 . (D.8)

D.2 Derivatives of Common Functions

Function f (x) Derivative with respect to x

Table D.1: Derivatives of common functions.

D.3 Common Differentiation Rules

Table D.2: Common differentiation rules.

D.4 The Chain Rule – Computing the Derivative of a

F (x) = f (g(x)). (D.10)

F 0 (x) = f 0 (g(x))g 0 (x). (D.11)

Figure D.3: Visual decomposition of a function

Figure D.4: Concept of the chain rule

D.4.1 A Chain Rule Example

f (x) = log(g(x)). (D.15)

to the inner function g, multiplied by the derivative of g with respect to the

Step 1: Derivative of the outer function

Step 2: Derivative of the inner function

Step 3: Multiplying inner and outer derivatives

D.5 Arbitrarily Long Function Compositions

F (x) = f (g(h(u(v(x))))). (D.22)

As we can see in Equation D.23, composing multiple function is similar

D.6 When a Function is Not Differentiable

100 100 f '(x) = 3x 2

Now, if there are breaks or gaps at a given point, we cannot draw a

Figure D.6: Graph of the "sharp turn"-containing function f (x) = |x|

|0 + ∆x| − |0| |∆x|

|0 + ∆x| − |0| |∆x|

Figure D.7: Graph of the ReLU function.

The ReLU function is commonly defined as

(These two function definitions are equivalent.) If we substitute the ReLU

max(0, x + ∆x) − max(0, x)

To summarize, the derivative of the ReLU function is defined as follows:

Infobox D.6.1 ReLU Derivative in Deep Learning

In practical deep learning applications, the ReLU derivative for x = 0 is

D.7 Partial Derivatives and Gradients