Chapter 1: Introduction and Mathematical Preliminaries: Evy Kersal e
Chapter 1: Introduction and Mathematical Preliminaries: Evy Kersal e
Chapter 1: Introduction and Mathematical Preliminaries: Evy Kersal e
Evy Kersalé
Most of the mathematical problems you have encountered so far can be solved
analytically. However, in real-life, analytic solutions are rather rare, and
therefore we must devise a way of approximating the solutions.
Motivation
Most of the mathematical problems you have encountered so far can be solved
analytically. However, in real-life, analytic solutions are rather rare, and
therefore we must devise a way of approximating the solutions.
Z 2 Z 2
2
For example, while ex dx has a well known analytic solution, ex dx can
1 Z 2 1
3
only be solved in terms of special functions and ex dx has no analytic
1
solution.
Motivation
Most of the mathematical problems you have encountered so far can be solved
analytically. However, in real-life, analytic solutions are rather rare, and
therefore we must devise a way of approximating the solutions.
Z 2 Z 2
2
For example, while ex dx has a well known analytic solution, ex dx can
1 Z 2 1
3
only be solved in terms of special functions and ex dx has no analytic
1
solution.
Traditionally, numerical algorithms are built upon the most simple arithmetic
operations (+, −, × and ÷).
Definition
Traditionally, numerical algorithms are built upon the most simple arithmetic
operations (+, −, × and ÷).
Computers can store integers exactly but not real numbers in general. Instead,
they approximate them as floating point numbers.
Computer arithmetic: Floating point numbers.
Computers can store integers exactly but not real numbers in general. Instead,
they approximate them as floating point numbers.
A decimal floating point (or machine number) is a number of the form
± 0. d1 d2 . . . dk ×10±n , 0 ≤ di ≤ 9, d1 6= 0,
| {z }
m
where the significand or mantissa m (i.e. the fractional part) and the exponent
n are fixed-length integers. (m cannot start with a zero.)
Computer arithmetic: Floating point numbers.
Computers can store integers exactly but not real numbers in general. Instead,
they approximate them as floating point numbers.
A decimal floating point (or machine number) is a number of the form
± 0. d1 d2 . . . dk ×10±n , 0 ≤ di ≤ 9, d1 6= 0,
| {z }
m
where the significand or mantissa m (i.e. the fractional part) and the exponent
n are fixed-length integers. (m cannot start with a zero.)
In fact, computers use binary numbers (base 2) rather than decimal numbers
(base 10) but the same principle applies (see handout).
Computer arithmetic: Machine ε.
Consider a simple computer where m is 3 digits long and n is one digit long.
The smallest positive number this computer can store is 0.1 × 10−9 and the
largest is 0.999 × 109 .
Computer arithmetic: Machine ε.
Consider a simple computer where m is 3 digits long and n is one digit long.
The smallest positive number this computer can store is 0.1 × 10−9 and the
largest is 0.999 × 109 .
Thus, the length of the exponent determines the range of numbers that can be
stored.
Computer arithmetic: Machine ε.
Consider a simple computer where m is 3 digits long and n is one digit long.
The smallest positive number this computer can store is 0.1 × 10−9 and the
largest is 0.999 × 109 .
Thus, the length of the exponent determines the range of numbers that can be
stored.
However, not all values in the range can be distinguished: numbers can only be
recorded to a certain relative accuracy ε.
Computer arithmetic: Machine ε.
Consider a simple computer where m is 3 digits long and n is one digit long.
The smallest positive number this computer can store is 0.1 × 10−9 and the
largest is 0.999 × 109 .
Thus, the length of the exponent determines the range of numbers that can be
stored.
However, not all values in the range can be distinguished: numbers can only be
recorded to a certain relative accuracy ε.
For example, on our simple computer, the next floating point number after
1 = 0.1 × 101 is 0.101 × 101 = 1.01. The quantity εmachine = 0.01 (machine ε) is
the worst relative uncertainty in the floating point representation of a number.
Chopping and rounding
There are two ways of terminating the mantissa of the k-digit decimal machine
number approximating 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n , 0 ≤ di ≤ 9, d1 6= 0,
Chopping and rounding
There are two ways of terminating the mantissa of the k-digit decimal machine
number approximating 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n , 0 ≤ di ≤ 9, d1 6= 0,
I Chopping: chop off the digits dk+1 , dk+2 , . . . to get 0.d1 d2 . . . dk × 10n .
Chopping and rounding
There are two ways of terminating the mantissa of the k-digit decimal machine
number approximating 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n , 0 ≤ di ≤ 9, d1 6= 0,
I Chopping: chop off the digits dk+1 , dk+2 , . . . to get 0.d1 d2 . . . dk × 10n .
I Rounding: add 5 × 10n−(k+1) and chop off the k + 1, k + 2, . . . digits. (If
dk+1 ≥ 5 we add 1 to dk before chopping.)
Rounding is more accurate than chopping.
Chopping and rounding
There are two ways of terminating the mantissa of the k-digit decimal machine
number approximating 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n , 0 ≤ di ≤ 9, d1 6= 0,
I Chopping: chop off the digits dk+1 , dk+2 , . . . to get 0.d1 d2 . . . dk × 10n .
I Rounding: add 5 × 10n−(k+1) and chop off the k + 1, k + 2, . . . digits. (If
dk+1 ≥ 5 we add 1 to dk before chopping.)
Rounding is more accurate than chopping.
Example
The five-digit floating-point form of π = 3.14159265359 . . . is 0.31415 × 10
using chopping and 0.31416 × 10 using rounding.
Chopping and rounding
There are two ways of terminating the mantissa of the k-digit decimal machine
number approximating 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n , 0 ≤ di ≤ 9, d1 6= 0,
I Chopping: chop off the digits dk+1 , dk+2 , . . . to get 0.d1 d2 . . . dk × 10n .
I Rounding: add 5 × 10n−(k+1) and chop off the k + 1, k + 2, . . . digits. (If
dk+1 ≥ 5 we add 1 to dk before chopping.)
Rounding is more accurate than chopping.
Example
The five-digit floating-point form of π = 3.14159265359 . . . is 0.31415 × 10
using chopping and 0.31416 × 10 using rounding.
Let p? be the result of a numerical calculation and p the exact answer (i.e. p?
is an approximation to p). We define two measures of the error,
Measure of the error
Let p? be the result of a numerical calculation and p the exact answer (i.e. p?
is an approximation to p). We define two measures of the error,
I Absolute error: E = |p − p? |
Measure of the error
Let p? be the result of a numerical calculation and p the exact answer (i.e. p?
is an approximation to p). We define two measures of the error,
I Absolute error: E = |p − p? |
I Relative error: Er = |p − p? |/|p| (provided p 6= 0) which takes into
consideration the size of the value.
Measure of the error
Let p? be the result of a numerical calculation and p the exact answer (i.e. p?
is an approximation to p). We define two measures of the error,
I Absolute error: E = |p − p? |
I Relative error: Er = |p − p? |/|p| (provided p 6= 0) which takes into
consideration the size of the value.
Example
If p = 2 and p? = 2.1, the absolute error E = 10−1 ;
if p = 2 × 10−3 and p? = 2.1 × 10−3 , E = 10−4 is smaller;
and if p = 2 × 103 and p? = 2.1 × 103 , E = 102 is larger
but in all three cases the relative error remains the same, Er = 5 × 10−2 .
Round-off errors
Example
√
The 4-digit representation of x = 2 = 1.4142136 . . . is
x? = 1.414 = 0.1414 × 10. Using 4-digit arithmetic, we can evaluate
x?2 = 1.999 6= 2, due to round-off errors.
Round-off errors
Example
√
The 4-digit representation of x = 2 = 1.4142136 . . . is
x? = 1.414 = 0.1414 × 10. Using 4-digit arithmetic, we can evaluate
x?2 = 1.999 6= 2, due to round-off errors.
Thus, the absolute error is the sum of the errors in x and y , E = ε(|x? | + |y? |)
but the relative error of the answer is
|x? | + |y? |
Er = ε .
|x? + y? |
Magnification of the error.
Thus, the absolute error is the sum of the errors in x and y , E = ε(|x? | + |y? |)
but the relative error of the answer is
|x? | + |y? |
Er = ε .
|x? + y? |
If x? and y? both have the same sign the relative accuracy remains equal to ε,
but if the have opposite signs the relative error will be larger.
Magnification of the error.
Thus, the absolute error is the sum of the errors in x and y , E = ε(|x? | + |y? |)
but the relative error of the answer is
|x? | + |y? |
Er = ε .
|x? + y? |
If x? and y? both have the same sign the relative accuracy remains equal to ε,
but if the have opposite signs the relative error will be larger.
This magnification becomes particularly significant when two very close
numbers are subtracted.
Magnification of the error: Example
Using a finite value of h leads to a truncation error of size ' h/2 |f 00 (x0 )|.
Truncation errors: Example continued
The absolute round-off error in f (x0 + h) − f (x0 ) is 2ε|f (x0 )| and that in the
derivative (f (x0 + h) − f (x0 ))/h is 2ε/h |f (x0 )|.
So, the round-off error increases with decreasing h .
Truncation errors: Example continued
The absolute round-off error in f (x0 + h) − f (x0 ) is 2ε|f (x0 )| and that in the
derivative (f (x0 + h) − f (x0 ))/h is 2ε/h |f (x0 )|.
So, the round-off error increases with decreasing h .
h |f 00 | 2ε |f |
Er = + ,
2 |f 0 | h |f 0 |
hm h
√ p √ p
which has a minimum, min(Er ) = 2 ε |ff 00 |/|f 0 |, for hm = 2 ε |f /f 00 |.