Appendix D Calculus
Appendix D Calculus
Appendix D Calculus
Introduction to
Artificial Neural Networks
and Deep Learning
with Applications in Python
Introduction to Artificial
Neural Networks
with Applications in Python
Sebastian Raschka
D RAFT
Last updated: February 12, 2019
i
Website
Please visit the GitHub repository to download the code examples accom-
panying this book and other supplementary material.
If you like the content, please consider supporting the work by buy-
ing a copy of the book on Leanpub. Also, I would appreciate hearing
your opinion and feedback about the book, and if you have any ques-
tions about the contents, please don’t hesitate to get in touch with me via
mail@sebastianraschka.com. Happy learning!
Sebastian Raschka
1
About the Author
2
Acknowledgements
I would like to give my special thanks to the readers, who provided feed-
back, caught various typos and errors, and offered suggestions for clarify-
ing my writing.
DRAFT 3
Appendix D
D.1 Intuition
So, what is the derivative of a function? In simple terms, the derivative a
function is a function’s instantaneous rate of change. Now, let us start this
section with a visual explanation, where we consider the function
f (x) = 2x (D.1)
DRAFT 4
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 5
Given the linear function in Equation D.1, we can interpret the "rate of
change" as the slope of this function. And to compute the slope of a function,
we take an arbitrary x-axis value, say a, and plug it into this function: f (a).
Then, we take another value on the x-axis, let us call it b = a + ∆a, where
∆ is the change between a and b. Now, to compute the slope of this linear
function, we divide the change in the function’s output f (a + ∆a) by the
change in the function’s input a + ∆a:
f (a + ∆a) − f (a)
Slope = . (D.2)
a + ∆a − a
In other words, the slope is simply the fraction of the change in a and the
function’s output:
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 6
df f (x + ∆x) − f (x)
f 0 (x) = = lim , (D.4)
dx ∆x→0 ∆x
where lim∆x→0 means "as the change in x becomes infinitely small (for
instance, ∆x approaches zero)." Since this appendix is merely a refresher
rather than a comprehensive calculus resource, we have to skip over some
important concepts such as Limit Theory. So, if this is the first time you
encounter calculus, I recommend consulting additional resources such as
"Calculus I, II, and III" by Jerrold E. Marsden and Alan Weinstein 1 .
df f (x + ∆x) − f (x)
= lim
dx ∆x→0 ∆x
2x + 2∆x − 2x
= lim
∆x→0 ∆x (D.6)
2∆x
= lim
∆x→0 ∆x
= lim 2.
∆x→0
1
http:/ /www.cds.caltech.edu/ marsden/volume/Calculus/
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 7
f (x) = x2 , (D.7)
as illustrated in Figure D.2.
As we can see in Figure D.2, this quadratic function (Equation D.7) does
not have a constant slope, in contrast to a linear function. Geometrically,
we can interpret the derivative of a function as the slope of a tangent to a
function graph at any given point. And we can approximate the slope of a
tangent at a given point by a secant connecting this point to a second point
that is infinitely close, which is where the lim∆x→0 notation comes from.
(In the case of a linear function, the tangent is equal to the secant between
two points.)
Now, to compute the derivative of the quadratic function f (x) = x2 , we
can, again, apply the basic concepts we used earlier, using the fact that
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 8
df f (x + ∆x) − f (x)
= lim
dx ∆x→0 ∆x
x2 + 2x∆x + (∆x)2 − x2
= lim
∆x→0 ∆x (D.9)
2x∆x + (∆x)2
= lim
∆x→0 ∆x
= lim 2x + ∆x.
∆x→0
And since ∆x approaches zero due to the limit, we arrive at f 0 (x) = 2x,
which is the derivative of f (x) = x2 .
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 9
Function Derivative
Sum Rule f (x) + g(x) f 0 (x) + g 0 (x)
Difference Rule f (x) − g(x) f 0 (x) − g 0 (x)
Product Rule f (x)g(x) f 0 (x)g(x) + f (x)g 0 (x)
Quotient Rule f (x)/g(x) [g(x)f 0 (x) − f (x)g 0 (x)]/[g(x)]2
Reciprocal Rule 1/f (x) −[f 0 (x)]/[f (x)]2
Chain Rule f (g(x)) f 0 (g(x))g 0 (x)
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 10
Using the chain rule, Figure D.4 illustrates how we can derive F (x) via
two parallel steps: We compute the derivative of the inner function g (i.e.,
g 0 (x)) and multiply it by the outer derivative f 0 (g(x)).
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 11
Now, for the rest of the section, let us use the Leibniz notation, which makes
these concepts easier to follow:
d df dg
f (g(x)) = · . (D.12)
dx dg dx
(Remember that the equation above is equivalent to writing F 0 (x) = f 0 (g(x))g 0 (x).)
Step 0: Organization
First, we identify the innermost function:
√
g(x) = x. (D.14)
Using the definition of the inner function, we can now express the outer
function in terms of g(x):
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 12
df df dg
= · , (D.16)
dx dg dx
which lets us arrive at
df d d√
= log(g) · x. (D.17)
dx dg dx
d 1 1
log(g) = = √ . (D.18)
dg g x
d 1/2 1 −1/2 1
x = x = √ . (D.20)
dx 2 2 x
df 1 1 1
=√ · √ = . (D.21)
dx x 2 x 2x
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 13
dF d d
= F (x) = f (g(h(u(v(x)))))
dx dx dx
(D.23)
df dg dh du dv
= · · · · .
dg dh du dv dx
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 14
line as shown in Figure D.5. (Intuitively, we can say that a curve, when
closely observed, resembles a straight line.)
f(x) = x 3 f(x) = x 3
50 50
27
f(x) 0 0
50 50
100 100
4 2 0 2 4 4 2 0 2 4
x -3 x
Figure D.5: Graph of the function f (x) = x3 with a tangent line to approximate the
derivative at point x = −3 (left) and the derivative at each point on the function
graph (right).
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 15
3
f(x)
f(x)= |x|
0
4 2 0 2 4
x
We will now show that the derivative for f (x) = |x| does not exist at the
sharp turn at x = 0. Recall the definition of the derivative of a continuous
function f (x) that was introduced in Section D.1:
f (x + ∆x) − f (x)
f 0 (x) = lim . (D.24)
∆x→0 ∆x
If we substitute f (x) by the absolute value function, |x|, we obtain
|x + ∆x| − |x|
f 0 (x) = lim .
∆x→0 ∆x
Next, let us set x = 0, the point we want to evaluate the equation
|0 + ∆x| − |0|
f 0 (0) = lim .
∆x→0 ∆x
If the derivative f 0 (0) exists, it should not matter whether we approach the
limit from the left or the right side2 . So, let us compute the left-side limit
first (here, ∆x represents an infinitely small, negative number):
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 16
We can see that the limits are not equal (1 6= −1), and because they do
not agree, we have no formal notion of how to draw the tangent line to the
function graph at the point x = 0. Hence, we say that the derivative of the
function f (x) = |x| does not exist (DNE) at point x = 0:
f 0 (0) = DNE.
A widely-used function in deep learning applications that is not differ-
entiable at a point3 is the ReLU function, which was introduced at the be-
ginning of this section. To provide another example of a non-differentiable
function, we now apply the concepts of left- and right-hand limits to the
piece-wise defined ReLU function (Figure D.7).
f(x)
5
4 2 0 2 4 x
f (x) = max(0, x)
or
(
0 if x < 0
f (x) =
x if x ≥ 0
3
Coincidentally, the point where the ReLU function is not defined is also x = 0.
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 17
0 + ∆x − 0
f 0 (0) = lim = 1.
x→0+ ∆x
Again, the left- and right-hand limits are not equal at x = 0; hence, the
derivative of the ReLU function at x = 0 is not defined.
For completeness’ sake, the derivative of the ReLU function for x > 0 is
x + ∆x − x ∆x
f 0 (x) = lim = = 1.
∆x
x→0 ∆x
And for x < 0, the ReLU derivative is
0−0
f 0 (x) = lim
=0
∆xx→0
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 18
f (x, y) = x2 y + y. (D.26)
The plot in Figure D.8 shows a graph of this function for different values of
x and y.
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 19
The subfigures shown in Figure D.9 illustrate how the function looks
like if we treat either x or y as a constant.
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 20
∂f ∂ 2
= x y + y = 2xy (D.28)
∂x ∂x
(via the power rule and constant rule), and
∂f ∂ 2
= x y + y = x2 + 1. (D.29)
∂y ∂y
So, the gradient of the function f is defined as
" #
2xy
∇f (x, y) = 2 . (D.30)
x +1
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 21
∂ ∂f ∂2f
= . (D.31)
∂x ∂x ∂x2
For example, we compute the second partial derivative of a function f (x, y) =
x2 y + y as follows:
∂2f ∂ ∂ 2 ∂ ∂
2
= x y+y = 2xy = = 2y. (D.32)
∂x ∂x ∂x ∂x ∂x
Note that in the initial definition (Equation D.31) and the example (Equa-
tion D.32) both the first and second order partial derivatives were com-
puted with respect to the same input argument, x. However, depending on
what measurement we are interested in, the second order partial derivative
can involve a different input argument. For instance, given a multivariable
function with two input arguments, we can in fact compute four distinct
second order partial derivatives:
∂2f ∂ ∂f
= . (D.34)
∂y∂x ∂y ∂x
f g, h = g 2 h + h,
(D.35)
where g(x) = 3x, and h(x) = x2 . So, as it turns out, our function is a
composition of two functions:
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 22
f g(x), h(x) (D.36)
Previously, in Section D.4, we defined the chain rule for the univariate case
as follows:
d df dg
f (g(x)) = · . (D.37)
dx dg dx
To extend apply this concept to multivariable functions, we simply extend
the notation above using the product rule. Hence, we can define the multi-
variable chain rule as follows:
d ∂f dg ∂f dh
f (g(x), h(x)) = · + · . (D.38)
dx ∂g dx ∂h dx
Applyingthe multivariable chain rule to our multivariable function exam-
ple f g, h = g 2 h + h, let us start with the partial derivatives:
∂f
= 2gh (D.39)
∂g
and
∂f
= g 2 + 1. (D.40)
∂h
Next, we take the ordinary derivatives of the two functions g and h:
dg d
= 3x = 3 (D.41)
dx dx
dh d 2
= x = 2x. (D.42)
dx dx
And finally, plugging everything into our multivariable chain rule definition,
we arrive at
d
f (g(x), h(x)) = [2gh · 3] + [g 2 + 1 · 2x] = g 2 + 6gh + 2x.
(D.43)
dx
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 23
d ∂f dg ∂f dh
f (g(x), h(x)) = · + · (D.44)
dx ∂g dx ∂h dx
as follows:
d
f (g(x), h(x)) = ∇f · v0 (x).
(D.45)
dx
Here, v is a vector listing the function arguments:
" #
g(x)
v(x) = . (D.46)
h(x)
And the derivative ("v-prime" in Lagrange notation) is defined as follows:
" # " #
0 d g(x) dg/dx
v (x) = = . (D.47)
dx h(x) dh/dx
So, putting everything together, we have
" # " #
0 ∂f /∂g dg/dx ∂f dg ∂f dh
∇f · v (x) = · = · + · . (D.48)
∂f /∂h dh/dx ∂g dx ∂h dx
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 24
" #
∂ 2 f /∂x2 ∂ 2 f /∂x∂y
Hf = 2 . (D.50)
∂ f /∂y∂x ∂ 2 f /∂y 2
To formulate the Hessian for a multivariable function that takes n argu-
ments,
" # " #
∂f /∂g ∂f /∂g ∂2f ∂2f
∆f (g(x), h(x)) = · f= + . (D.55)
∂f /∂h ∂f /∂h ∂g 2 ∂h2
DRAFT
APPENDIX D. CALCULUS AND DIFFERENTIATION PRIMER 25
n
X ∂2f
∆f = . (D.57)
i=1
∂x2i
DRAFT
Bibliography
[Boyer, 1970] Boyer, C. B. (1970). The history of the calculus. The Two-Year
College Mathematics Journal, 1(1):60–86.
DRAFT 26
Abbreviations and Terms
DRAFT 27
Index
DRAFT
28