Rounding and Machine Arithmetic
Rounding and Machine Arithmetic
Related Matters
x∈R iff x = ±(bn ·2n +bn−1 ·2n−1 +. . .+b0 +b−1 ·2−1 +b−2 ·2−2 +. . .), (1)
or
x = ±(bn bn−1 . . . b0 . b−1 b−2 . . .)2 , (2)
where n ≥ 0 is some integer and the “binary digits” bi :
bi = 0 or bi = 1, for all i.
Note:
1
1.2 Machine numbers:
Floating-point representation:(recall: scientific notation)
t : the number of binary digits of fractional part,
s : the number of binary digits in the exponent.
where
f : the mantissa of x,
e: the integer e as the exponent of x.
2: the base.
t: t-digits precision.
1
2 ≤ |f | ≤ (2−1 + 2−1 + . . . + 2−t ) = 1 − 2−t ;
2−(1+2+...+2 = 2−(2
s−1 ) s −1) s−1 s −1)
≤ 2e ≤ 21+2+...+2 = 2(2 .
In order to increase the precision, one can use two machine registers to
represent a machine number. In effect, one then embeds R(t, s) ⊂ R(2t, s)
and call x ∈ R(2t, s) a double-precision number.
2
2 Rounding
For a real number in the machine register, if it is too long, its tail end
will be cut off or if it is too short, it is padded by zeros at the end. More
specifically, let
∞
∑
x ∈ R iff x = ±( b−k 2−k )2e , (7)
k=1
be the “exact” real number (in normalized floating-point form)
and
∑t
∗
x∗ ∈ R iff x∗ = ±( b∗−k 2−k )2e , (8)
k=1
be the rounded number.
There are two commonly used ways of translating a given “ exact ”
real number x into an t − β-digit (for instance, in the binary system,
β = 2)floating-point number f l(x) : chopping and symmetric rounding.
Definition:
• Absolute error= |approximate value - true value|
2.1 Chopping
One takes
= 2 · 2−t . (16)
3
H.W :1
x∗ −x
1. Proof : x ≤ 0;
x∗ −x
2. Proof : x ≥ −2 · 2−t .
and
−(0.84)103 Symmetric Rounded
f l(−838) = (20)
−(0.83)103 chopped
1
H.W means that you should work these problems at home, but no need to submit
it. However, these problems may appear in the quiz or midterm/final
4
Remark: Set
rd(x) − x
= ϵ, |ϵ| ≤ eps (24)
x
then, we have
rd(x) = x(1 + ϵ), |ϵ| ≤ eps (25)
and defers dealing with the inequality (for ϵ) to the very end.
3. The rounding error: widespread, but cause little harm unless there
are some weak spots on computations.
Assumption:
1. Arithmetic operations are carried out exactly and Operands x, y are
rounded and represented by
x∗ = x(1 + ϵx ) (26)
∗
y = y(1 + ϵy ). (27)
5
3.2.1 Multiplication
f (x · y) = x∗ · y ∗ = x(1 + ϵx ) · y(1 + ϵy )
= x · y(1 + ϵx + ϵy + ϵx ϵy )
≈ x · y(1 + ϵx + ϵy ), (where ϵx ϵy are second order!) (28)
3.2.2 Division
x(1 + ϵx )
f (x/y) = x∗ /y ∗ =
y(1 + ϵy )
x(1 + ϵx )(1 − ϵy )
=
y(1 − ϵ2y )
x
≈ (1 + ϵx − ϵy − ϵx ϵy )
y
x
≈ (1 + ϵx − ϵy ), (where ϵx ϵy are second order!) (30)
y
∴ the relative error is:
f (x/y) − x/y
ϵx/y = = ϵx − ϵy (31)
x/y
Division is also a benign operation !
f (x + y) = x∗ + y ∗ = x(1 + ϵx ) + y(1 + ϵy )
= x + xϵx + y + yϵy
= x + y + xϵx + yϵy
assuming x + y ̸= 0. Therefore,
xϵx + yϵy
f (x + y) = (x + y)(1+ )
x+y
x y
= (x + y)(1+ ϵx + ϵy ).(32)
x+y x+y
∴ the relative error is:
f (x + y) − (x + y) x y
ϵx+y = = ϵx + ϵy . (33)
(x + y) x+y x+y
Note:
• The error of addition or subtraction in the result is a linear combi-
nation of the errors in the data, but the coefficient are no longer ±1,
but can be assumed arbitrary large.
6
• Two cases to be discussed in (33):
↓
x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b′ b′ g g g g e
′′ ′′
x−y = 0 0 0 0 0 0 0 0 0 b b g g g g e
′′ ′′
= b b g g g g ? ? ? ? ? ? ? ? ? e-9
↓
Notation:
′′
1. b, b′ and b : reliable binary digits
x2 x3
e−x = 1 − x + − + ... , x > 0,
2! 3!
may give disastrous results because of catastrophic cancellation. (Hint:
Using e−x = 1/ex ).
Examples:
7
(1) Algebra operations:[Stewart(1996)]
On a 5-decimal-digit computer:
8
The result agrees with the true r1 . To calculate r2 , if we use:
c
r2 = ,
r1
we will have: r2 =√5.6558 · 10−4 , which is as accurate as we expect.
√
(3) Compute y = x + δ − x, x > 0 and |δ| ≪ 1.
Writing instead
δ
y=√ √ .
x+δ+ x
completely removes the cancellation error !
(4) Compute y = cos(x + δ) − cos(x) and |δ| ≪ 1.
Writing y in the equivalent form[Gautschi(1997)]
δ δ
y = −2 sin sin(x + ).
2 2
(5) Compute y = f (x + δ) − f (x) and |δ| ≪ 1, f is a function.
if f is smooth enough around x, we have:[Gautschi(1997), Heath(1997)]
1 ′′
y = f ′ (x)δ + f (x)δ 2 + . . . . (36)
2
The terms in this series decrease rapidly. So the cancellation is not a
problem any more.
Considering the finite difference approximation to the first derivative
f (x + δ) − f (x)
f ′ (x) ≈ . (37)
δ
′′
From the Eq.(36), there will be a truncation error ET = 12 |f (x′ )|δ in
Eq.(37) and with smaller δ, the truncation error becomes smaller. But it
is also known that , for (f (x + δ) − f (x)), there will be some cancellation
error Ec and to reduce the cancellation error, one needs δ larger. The total
error in Eq.(37)to calculate the first order derivative will be given as:
ET = Ec + ET . (38)
Thus, there is a trade-off between truncation error and rounding error in
choosing the size of δ. How to choose the optimized δ ?
′′
If we assume that: ∀x, both |f (x)| and |f (x)| are bounded by M , i.e.,
′′
|f (x)| ≤ M, |f (x)| ≤ M. (39)
Therefore, ET ≤ 21 M δ. The cancellation error Ec can be given as:
Ec = |f ′ (x) − f ′ (x)(1 + ϵ)| (40)
′
= |f (x)|ϵ (41)
f (x + δ) − f (x)
= | |ϵ (42)
δ
|f (x + δ)| + |f (x)|
≤ ϵ (43)
δ
2M ϵ
≤ . (44)
δ
The total error can be bounded by 21 M δ + 2M ϵ
δ , which is minimized when
√
δ = 2 ϵ. (45)
If we also assume that the function values are accurate to the machine
precision, we have:
√
δ ≈ 2 eps. (46)
9
4 A Condition of a Problem
A problem as a black box typically has an input and output shown in
Fig.2, where box P accepts some input x and then solve the problem for
this input to produce the output y.
f : Rm → Rn , y = f (x)
For this problem, if the input coefficients ai , i = 0..4 are rounded to six
digits, the polynomial becomes
Then, one finds p(10.11) = 0.0015, while pe(10.11) = −0.0481. The er-
roneous value is more than 30 times larger than the correct 0.0015. The
present problem, therefore, is so sensitive to small changes occurred in the
inputs. How to quantify and measure the degree of the sensitive?
10
Ex. 2[Vandenberghe and Boyd(2011)]:
Problem (P): To solve a linear algebraic equation: Ax = b, where
1 1 1
A=
2 1 + 10−10 1 − 10−10
and
1
b= .
1
Input (x): A, b;
Output(y): x.
Therefore, small △b can lead to extremely large △x. How to quantify and
measure the degree of the sensitive?
3. x ̸= 0, y ̸= 0
Problem:
x∗ = x + △x,
y ∗ = f (x∗ ),
△y = y ∗ − y = f (x∗ ) − f (x) = f (x + △x) − f (x) ≈ f ′ (x)△x,
△y △x
how sensitive of y to x at some point x?
condition number:
def △y/y xf ′ (x)
∴ (Cond f )(x) : = lim = (48)
△x→0 △x/x f (x)
Remark:
1. if x = 0 and y ̸= 0, then the condition number can be given as:
△y/y f ′ (x)
(Cond f )(x) := lim = (49)
△x→0 △x f (x)
11
3. If x = 0 and y = 0, then the condition number will simply be:
△y
(Cond f )(x) := lim = f ′ (x) (51)
△x→0 △x
Note that:
Example:
Solution:
∫ 1
tn
In = dt,
0 t+5
∫ 1
t
= t(n−1) dt,
0 t+5
∫ 1
5
= t(n−1) (1 − )dt,
0 t + 5
∫ 1 ∫ 1
1
= t(n−1)
dt − 5 t(n−1) dt,
0 0 t+5
∫ 1 (n−1)
tn 1 t
= −5 dt,
n 0 0 t+5
1
= − 5In−1 .
n
(52)
12
Therefore, by recursing this equation, we have:
1
In = −5In−1 + , (53)
n
1
−5I(n−1) = (−5)2 In−2 + (−5) , (54)
n−1
..
. (55)
(n−1) n (n−1)
(−5) I1 = (−5) I0 + (−5) . (56)
In = (−5)n I0 + pn ,
∫1 t0
∫1 1
where pn is some number and independent of I0 (= 0 t+5 dt = 0 t+5 dt =
ln 65 ).
In this problem:
13
Because t ∈ (0, 1), In should be decrease monotonically in n
In+1 In 1
(Con f )(In+1 ) = < = < 1. (61)
5In 5In 5
∴ recursing from In+1 to In is well-conditioned.
This may tell us that, by reversing the recurrence, i.e., from ν to
n, ν > n, the problem may become well-conditioned. By recursing
Eq.(59), we have:
1 1 1
In = (− )In+1 + , (62)
5 5n+1
1 1 1 1 1
(− )In+1 = (− )2 In+2 + (− )( ) , (63)
5 5 5 5 n+2
..
. (64)
1 (ν−n−1) 1 (ν−n) 1 (ν−n−1) 1 1
(− ) Iν−1 = (− ) Iν + (− ) ( ) . (65)
5 5 5 5 µ
Summation of these equations yields:
1
In = (− )(ν−n) Iν + p̂n , (66)
5
where p̂n is some number and independent of Iν .
In this problem:
Iν · ( 15 )(ν−n) In · ( 51 )(ν−n) 1
(Con f )(Iν ) = < = ( )(ν−n) . (68)
In In 5
√
2. Calculate the condition number of f (x) = x, x ≥ 0.[Conte and de Boor(1980)]
14
∴ f ′ (x) = 1
√
2 x
. Hence, the condition number of f is
xf ′ (x) x 2√1 x 1
Conf (x) = | |= √ = , (69)
f (x) x 2
which indicates that taking square root is a well-conditioned process.
√ √
3. Calculate the condition number of f (x) = x + 1− x, x ≥ 0.[Conte and de Boor(1980)]
∴ f ′ (x) = 12 ( √x+1
1
− √1 ).
x
Hence, the condition number of f is
xf ′ (x)
Conf (x) = | |,
f (x)
x 21 ( √x+1
1
− √1x )
= | √ √ |,
x+1− x
1 x
= √ √ ,
2 x+1 x
√
1 x
= . (70)
2 x+1
def
∑
n
1
∥x∥p = ( |xi |p ) p (73)
i=1
15
• 1-norm:
∑
n
∥x∥1 = |xi |. (74)
i=1
• 2-norm:
∑
n
1
∥x∥2 = ( |xi |2 ) 2 (75)
i=1
• ∞-norm:
∥x∥∞ = max |xi |. (76)
1≤i≤n
(a) ∥x∥ ≥ 0, if x ̸= 0;
(b) ∥γx∥ = |γ| · ∥x∥, for any scalar γ;
(c) ∥x + y∥ ≤ ∥x∥ + ∥y∥;
(d) ∥x − y∥ ≥ ∥x∥ − ∥y∥;
2. Matrix Norms
def ∥Ax∥
∥A∥ = max (77)
x̸=0 ∥x∥
Remark:(Important special cases)
• 2-norm:
∥Ax∥2 √
∥A∥2 = max = max{ λ : λ is an eigenvalue of AT A}
x̸=0 ∥x∥2
(79)
• ∞-norm: the maximum absolute row sum of the matrix
∑
n
∥A∥∞ = max |aij |. (80)
i
j=1
(a) ∥A∥ ≥ 0, if A ̸= 0;
(b) ∥γA∥ = |γ| · ∥A∥, for any scalar γ;
(c) ∥A + B∥ ≤ ∥A∥ + ∥B∥;
(d) ∥AB∥ ≤ ∥A∥ · ∥B∥;
(e) ∥Ax∥ ≤ ∥A∥ · ∥x∥, for any vector x;
16
4.2.2 Condition number of a vector function
Generally, for arbitrary m, n, namely,
Example:
The inverse of A is
−1
a b 1 d −b 1.01 −0.99 25.25 −24.75
A−1 = = = 1 = ,
c d ad − bc −c a 0.04 −0.99 1.01 −24.75 25.25
(86)
−1
∴ ∥A∥∞ = (1.01 + 0.99) = 2, and ∥A ∥∞ = (25.25 + 24.75) = 50. Hence,
the condition number of A is
17
i.e.,
1 1
1 ...
2 n
1 1 1
2 ...
3 n+1
· · ·
...
Hn = ∈ Rn×n (88)
· · ... ·
· · ... ·
1 1 1
n n+1 ... 2n−1
This matrix is symmetric and positive define (i.e., Hn = HnT , and xT Hn x >
0 for all x ∈ Rn ). The condition number of Hilbert matrices are shown in
Table 1.
n Cond2 Hn
10 1.60 × 1013
20 2.45 × 1028
40 7.65 × 1058
For n = 10, the system can not be solved with any reliability in single
precision on a 14-decimal computer and, for double precision, if n = 20, it
will also “exhausted” by the time. In deed, it can be shown that
√
( (2) + 1)(4n+4)
Cond2 Hn ∼ √ as n → ∞. (89)
215/4 πn
2(ν − 1)
tν = 1 − , ν = 1, 2, . . . , n. (91)
n−1
then,
1 − π n( π + 1 ln 2)
Cond∞ Vn ∼ e 4e 4 2 as n → ∞. (92)
π
With different n, the condition number of the Vandermonde matrices are
shown in Table 2. They are not growing as fast as it for the Hilbert matrix,
18
Table 2: The condition number of Vandermonde matrices
n Cond2 Hn
10 1.36 × 104
20 1.05 × 109
40 6.93 × 1018
80 3.15 × 1038
Note:
1. Hilbert Matrix is very useful for the least-square problem in Chapter
2.
2. Vandermonde Matrix is very important to the interpolation problem
in Chapter 2.
5 Review questions
1. Review the floating-point representation in a binary computer:
a) write down the definition of the R(t, s) system;
b) write down the definition of normalization and give some reason
why we need do normalization;
c) Derive the largest and smallest magnitudes of a (normalized)
floating-point number in the R(t, s) system;
d) What are the meaning of the overflow and underflow ? In floating-
point arithmetic, which is generally more harmful, under-flow or
over-flow? Why?
2. Review the concept of rounding:
a) Explain the difference between the rounding rules “chopping(round
toward zero)” and “Symmetric rounding(round to nearest)” in
a floating-point system.
b) Which of these two rounding rules is more accurate?
c) write down the definition of the machine precision and explain
why the machine precision is machine dependent.
3. Review the definition of the model of machine arithmetic and error
propagation:
a) True or false(why ? Give an example): If two real numbers are
exactly representable as floating-point numbers, then the result
of an arithmetic operation O(+, −, ×, /) on them will also be
representable as a floating point number.
19
b)True or false(Give examples): Floating-point addition is associa-
tive but not commutative.
c)In floating-point arithmetic, which of the following operations on
two positive floating point operands can be benign operations:
• Multiplication
• Division
• Addition
• Subtraction
d)Explain why the cancellation that occurs when two numbers of
similar magnitude are subtracted is often bad even though the
result may be exactly correct for the actual operands involved.
(Hint: Lost some significant digits and information)
e) True or false(why ? Give an example):Computing a small quan-
tity as a difference of large quantities is generally a bad idea.
a)what are definitions of the well conditioned and the ill conditioned
problems ?
b)True or false: A problem is ill conditioned if its solution is highly
sensitive to small changes in the problem data.
b) True or false: Using higher-precision arithmetic will make an
ill-conditioned problem better conditioned.
c)True or false: The conditioning of a problem depends on the al-
gorithm used to solve it.
d)True or false: A good algorithm will produce an accurate solution
regardless of the condition of the problem being solved.
References
[Gautschi(1997)] W. Gautschi, Numerical analysis: An intrtoduction, 1st
ed., Brikhauser Boston, Berlin, 1997.
20