0% found this document useful (0 votes)

3 views

Rounding and Machine Arithmetic

The document discusses machine arithmetic, focusing on the representation of real numbers and machine numbers using binary systems. It explains concepts such as floating-point representation, rounding methods, and error propagation in arithmetic operations, emphasizing the importance of machine precision and the potential for errors during calculations. The document also highlights the differences in error behavior for various arithmetic operations, particularly noting the risks of cancellation errors in addition and subtraction.

Uploaded by

Emily Reyna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Rounding and Machine Arithmetic

Uploaded by

Emily Reyna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 1 Machine Arithmetic and

Related Matters

1 Real Numbers and Machine Numbers

1.1 Real number(Note that: the “exact” number)
binary number system:

x∈R iff x = ±(bn ·2n +bn−1 ·2n−1 +. . .+b0 +b−1 ·2−1 +b−2 ·2−2 +. . .), (1)

or
x = ±(bn bn−1 . . . b0 . b−1 b−2 . . .)2 , (2)
where n ≥ 0 is some integer and the “binary digits” bi :

bi = 0 or bi = 1, for all i.

Note:

• In general, to represent a real number, it may need infinitely many

binary digits;

• The representation in Eq. (2) is not unique; for instance, (.0111)2 =

(.1)2

• If we always insist on a finite representation and if one exists, we

regain uniqueness. (10011.01)2 = (19.25)10

• A finite decimal number may correspond a infinite binary represen-

tation. ( 15 )10 = (.0.00110011)2

Algorithm 1 : Algorithm for determining the fractions

Input: x ∈ (0, 1), integer β ≥ 2 and c0 = x;
i = 0;
while ci ̸= 0 .and. i ≤ Imax do
i = i + 1;
bi = (βci−1 )I ;
ci = (βci−1 )F ;
end while ∑
Output: x = ( . b1 b2 b3 . . . bi )β = ik=1 bk β −k .

1
1.2 Machine numbers:
Floating-point representation:(recall: scientific notation)
t : the number of binary digits of fractional part,
s : the number of binary digits in the exponent.

x ∈ R(t, s) iff x = f · 2e , (3)

where

f = ±(b−1 b−2 . . . b−t )2 and e = ±(cs−1 cs−2 . . . c0 )2 . (4)

f : the mantissa of x,
e: the integer e as the exponent of x.
2: the base.
t: t-digits precision.

Normalized: if in fraction f , we have b−1 = 1.

(Why do normalization ? Hint: 1. uniqueness of the representation; 2. To
keep more significant digits in mantissa with in t bits)

Figure 1: Packing of a floating point number in a register.

Note that: (bi = 0 or bi = 1, for all i)

1
2 ≤ |f | ≤ (2−1 + 2−1 + . . . + 2−t ) = 1 − 2−t ;

2−(1+2+...+2 = 2−(2
s−1 ) s −1) s−1 s −1)
≤ 2e ≤ 21+2+...+2 = 2(2 .

The largest magnitude of a (normalized) floating-point number:

max |x| = (1 − 2−t )22

s −1
(5)
x∈R(t,s)

The smallest magnitude of a (normalized) floating-point number:

min |x| = 2−2

s
(6)
x∈R(t,s)

Overflow: a real number whose modulus is not in the range determined

by (5) and (6), if its modulus is larger than maxx∈R(t,s) |x|;

underflow: a real number whose modulus is not in the range determined

by (5) and (6), if its modulus is smaller than minx∈R(t,s) |x|;

In order to increase the precision, one can use two machine registers to
represent a machine number. In effect, one then embeds R(t, s) ⊂ R(2t, s)
and call x ∈ R(2t, s) a double-precision number.

2
2 Rounding
For a real number in the machine register, if it is too long, its tail end
will be cut off or if it is too short, it is padded by zeros at the end. More
specifically, let
∞
∑
x ∈ R iff x = ±( b−k 2−k )2e , (7)
k=1
be the “exact” real number (in normalized floating-point form)
and
∑t
∗
x∗ ∈ R iff x∗ = ±( b∗−k 2−k )2e , (8)
k=1
be the rounded number.
There are two commonly used ways of translating a given “ exact ”
real number x into an t − β-digit (for instance, in the binary system,
β = 2)floating-point number f l(x) : chopping and symmetric rounding.

Definition:
• Absolute error= |approximate value - true value|

• relative error= approximate value - true value

true value

2.1 Chopping
One takes

x∗ = chop(x), e∗ = e and b∗−k = b−k for k = 1, 2, . . . , t. (9)

Since chop(x) is the next floating-point number towards zero from x,

chop(x) is also called round towards zero.

Error incurred in chopping:

∞
∑
|x∗ − x| = | ± ( b∗−k 2−k )| · 2e (10)
k=t+1
∞
∑
−k
≤ ( 2 ) · 2e (∵ b−k = 1 in the worst case) (11)
k=t+1
−(t+1)
= 2 ·2e
(1 + 2−1 + 2−2 + . . .) (12)
1
= 2e · 2−(t+1) = 2e · 2−t . (13)
1 − 12

The absolute error is dependent on e (i.e., the magnitude of x). We may

prefer the relative error:
x∗ − x 2e · 2−t
| | ≤ ∑∞ ∗ −k (14)
x | ± ( k=1 b−k 2 )| · 2e
2e · 2−t
≤ (∵ b−1 = 1 and for all k ≥ 2, b−k = 0) (15)
2 ·2
1 e

= 2 · 2−t . (16)

3
H.W :1
x∗ −x
1. Proof : x ≤ 0;
x∗ −x
2. Proof : x ≥ −2 · 2−t .

The number 2−t is an important, machine-dependent quantity, called the

machine precision,
eps = 2−t , (17)
it determines the level of precision of any large-scale floating-point com-
puting. On the SUN SPARC workstation, where t = 23, we have eps≈
1.19 × 10−7 , corresponding to a precision of 6 or 7 significant decimal dig-
its.

2.2 Symmetric Rounding

In binary arithmetic, if b−t−1 = 1, the number rounds up or if b−t−1 = 0,
it rounds down. i.e.,
1 −t e
x∗ = rd(x), rd(x) = chop(x + · 2 · 2 ). (18)
2

x 99K ( . b−1 b−2 . . . b−t |b−t−1 )2

The procedure can be illustrated as: + ( . 0 0 . . . . . . 0| 1 )2
chop( . b′−1 b′−2 . . . b′−t |b′−t−1 )2
Since rd(x) is the nearest floating-point number to x; in case of a tie,
we use the floating-point number whose last stored digit is even. Because
of the latter property, this rule is also sometimes called round to even.

Example: two-decimal-digit floating-point numbers are used, then


2  (0.67)100 Symmetric Rounded
f l( ) = (19)
3  (0.66)100 chopped

and 
 −(0.84)103 Symmetric Rounded
f l(−838) = (20)
 −(0.83)103 chopped

Error incurred in Symmetric Rounding:

rd(x) − x
| | ≤ 2−t . (21)
x
H.W : Proof (21)
note that:
if b−t−1 = 0, then rd(x) = chop(x); (22)

 2−t · 2e x≥0
if b−t−1 = 1, then rd(x) = chop(x) + (23)
 −2−t · 2e x < 0

1
H.W means that you should work these problems at home, but no need to submit
it. However, these problems may appear in the quiz or midterm/final

4
Remark: Set
rd(x) − x
= ϵ, |ϵ| ≤ eps (24)
x
then, we have
rd(x) = x(1 + ϵ), |ϵ| ≤ eps (25)
and defers dealing with the inequality (for ϵ) to the very end.

3 Error propagation in arithmetic operations

3.1 Model of Machine Arithmetic
Each arithmetic operation O(= +, −, ×, /) may produce a result no longer
representable on the computer. However, other than Overflow and un-
derflow, we may assume a model of machine arithmetic that each arith-
metic operation produces a correct rounded result.

If x, y ∈ R(t, s) are floating-point numbers,

f l(xOy) = xOy(1 + ϵ), |ϵ| ≤ eps,

denotes the machine-produced result of the arithmetic operation xoy.

Note:
xOy : exact machine representation of the arithmetic operation o
(1 + ϵ) : imperfect execution and ϵ is a slightly perturbation of
the arithmetic operation o

3.2 Error Propagation and Cancellation

Note:
1. A single arithmetic operation: a small error that can be neglected;

2. A succession of arithmetic operations : may result in a significant

error, owing to error propagation;

3. The rounding error: widespread, but cause little harm unless there
are some weak spots on computations.
Assumption:
1. Arithmetic operations are carried out exactly and Operands x, y are
rounded and represented by

x∗ = x(1 + ϵx ) (26)
∗
y = y(1 + ϵy ). (27)

2. ϵx , ϵy are so small (|ϵx |, |ϵy | ≤eps) that quantities of second order

(ϵ2x , ϵx ϵy , ϵ2y ) and higher order can be neglected, i.e., |ϵ2x |, |ϵx ϵy |, |ϵ2y | ≪
eps.
Problem: what is the relative error in the operations:
x · y, x/y, x + y and x − y ?

5
3.2.1 Multiplication

f (x · y) = x∗ · y ∗ = x(1 + ϵx ) · y(1 + ϵy )
= x · y(1 + ϵx + ϵy + ϵx ϵy )
≈ x · y(1 + ϵx + ϵy ), (where ϵx ϵy are second order!) (28)

∴ the relative error in the product is:

f (x · y) − x · y
ϵx·y = = ϵx + ϵy (29)
x·y
Multiplication is a benign operation !

3.2.2 Division

x(1 + ϵx )
f (x/y) = x∗ /y ∗ =
y(1 + ϵy )
x(1 + ϵx )(1 − ϵy )
=
y(1 − ϵ2y )
x
≈ (1 + ϵx − ϵy − ϵx ϵy )
y
x
≈ (1 + ϵx − ϵy ), (where ϵx ϵy are second order!) (30)
y
∴ the relative error is:
f (x/y) − x/y
ϵx/y = = ϵx − ϵy (31)
x/y
Division is also a benign operation !

3.2.3 Addition and Subtraction

f (x + y) = x∗ + y ∗ = x(1 + ϵx ) + y(1 + ϵy )
= x + xϵx + y + yϵy
= x + y + xϵx + yϵy
assuming x + y ̸= 0. Therefore,
xϵx + yϵy
f (x + y) = (x + y)(1+ )
x+y
x y
= (x + y)(1+ ϵx + ϵy ).(32)
x+y x+y
∴ the relative error is:
f (x + y) − (x + y) x y
ϵx+y = = ϵx + ϵy . (33)
(x + y) x+y x+y

Note:
• The error of addition or subtraction in the result is a linear combi-
nation of the errors in the data, but the coefficient are no longer ±1,
but can be assumed arbitrary large.

6
• Two cases to be discussed in (33):

1. if x · y > 0, the weight in (33) satisfies:

x y
0< < 1 and 0< < 1, (34)
x+y x+y

∴ |ϵx+y | ≤ |ϵx | + |ϵy |.

So, as the multiplication and division, addition is benign oper-
ation.
2. if x · y < 0, the weight in (33) can be arbitrary large in abso-
lute value, since |x + y| can be arbitrary small compared to |x|
and |y| in particular if |x| ≈ |y|. The magnitude of error then
occurring in (32) is referred to as cancellation error(the only
weakness—the Achilles heel).

• cancellation error may not only appear in a single devastating

amounts, but also repeatedly over a long period of time involving
”small doses” of cancellation !

Illustration of the cancellation phenomenon[Gautschi(1997)]:

↓
x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b′ b′ g g g g e
′′ ′′
x−y = 0 0 0 0 0 0 0 0 0 b b g g g g e
′′ ′′
= b b g g g g ? ? ? ? ? ? ? ? ? e-9
↓

Notation:
′′
1. b, b′ and b : reliable binary digits

2. g : binary digit contaminated by error (also called ”garbage” digits)

The digits lost to cancellation are the most significant, leading

digits, whereas the digits lost in rounding are the least significant,
trailing digits. Because of this effect, computing a small quantity as a
difference of large quantities is generally a bad idea, for rounding error is
likely to dominate the result. For example, summing an alternating series,
such as [Conte and de Boor(1980)]

x2 x3
e−x = 1 − x + − + ... , x > 0,
2! 3!
may give disastrous results because of catastrophic cancellation. (Hint:
Using e−x = 1/ex ).

Examples:

7
(1) Algebra operations:[Stewart(1996)]
On a 5-decimal-digit computer:

a = 37654, b = 25.874 and c = 37679

using symmetric rounding of (a + b − c), we have:

f l(a + b − c) = f l(f l(a + b) − c),

f l(a + b) = 37680,
f l(a + b − c) = 1.0000.

instead of the true result:

a + b − c = 37654 + 25.874 − 37679 = 37679.874 − 37679 = 0.87400.

if we change the sequence of operations, i.e.,

f l(a − c + b) = f l(f l(a − c) + b),

f l(a − c) = −25.000,
f l(a + b − c) = 0.87400.

(2) Quadratic equation[Stewart(1996)]: finding the root of x2 − bx + c = 0

in the five-digit arithmetic, and

b = 3.6778, and c = 0.0020798.

With the root formula given by

√
b± b2 − 4c
r= ,
2
the root should be:

r1 = 3.67723441190 . . . and r2 = 0.00056558890 . . . . (35)

In a computer with 5-decimal-digit, the sequence of operations used in

calculating the smaller root are given as follows:
1. b2 : 1.3526 · 10+1 (True value: 1.352621284 · 10+1 )

2. 4c : 8.3192 · 10−3 (True value:8.3192 · 10−3 )

3. b2 − 4c : 1.3518 · 10+1 (True value:1.351789364 · 10+1 )

√
4. b2 − 4c : 3.6767 · 10+0 (True value:3.67666882313208 · 10+0 )
√
5. b − b2 − 4c : 1.1000 · 10−3 (True value:1.131176867918 · 10−3 )
√
b− b2 −4c
6. 2 : 5.5000 · 10−4 (True value:5.655880933958864 · 10−4 )
The computed value 5.5000·10−4 differs from the true value 0.00056558890 . . ..
if we do:
√
b + b2 − 4c = 7.3545 · 10+0 ;
√
b + b2 − 4c
r1 = = 3.6773 · 10+0 .
2

8
The result agrees with the true r1 . To calculate r2 , if we use:
c
r2 = ,
r1
we will have: r2 =√5.6558 · 10−4 , which is as accurate as we expect.
√
(3) Compute y = x + δ − x, x > 0 and |δ| ≪ 1.
Writing instead
δ
y=√ √ .
x+δ+ x
completely removes the cancellation error !
(4) Compute y = cos(x + δ) − cos(x) and |δ| ≪ 1.
Writing y in the equivalent form[Gautschi(1997)]
δ δ
y = −2 sin sin(x + ).
2 2
(5) Compute y = f (x + δ) − f (x) and |δ| ≪ 1, f is a function.
if f is smooth enough around x, we have:[Gautschi(1997), Heath(1997)]
1 ′′
y = f ′ (x)δ + f (x)δ 2 + . . . . (36)
2
The terms in this series decrease rapidly. So the cancellation is not a
problem any more.
Considering the finite difference approximation to the first derivative
f (x + δ) − f (x)
f ′ (x) ≈ . (37)
δ
′′
From the Eq.(36), there will be a truncation error ET = 12 |f (x′ )|δ in
Eq.(37) and with smaller δ, the truncation error becomes smaller. But it
is also known that , for (f (x + δ) − f (x)), there will be some cancellation
error Ec and to reduce the cancellation error, one needs δ larger. The total
error in Eq.(37)to calculate the first order derivative will be given as:
ET = Ec + ET . (38)
Thus, there is a trade-off between truncation error and rounding error in
choosing the size of δ. How to choose the optimized δ ?
′′
If we assume that: ∀x, both |f (x)| and |f (x)| are bounded by M , i.e.,
′′
|f (x)| ≤ M, |f (x)| ≤ M. (39)
Therefore, ET ≤ 21 M δ. The cancellation error Ec can be given as:
Ec = |f ′ (x) − f ′ (x)(1 + ϵ)| (40)
′
= |f (x)|ϵ (41)
f (x + δ) − f (x)
= | |ϵ (42)
δ
|f (x + δ)| + |f (x)|
≤ ϵ (43)
δ
2M ϵ
≤ . (44)
δ
The total error can be bounded by 21 M δ + 2M ϵ
δ , which is minimized when
√
δ = 2 ϵ. (45)
If we also assume that the function values are accurate to the machine
precision, we have:
√
δ ≈ 2 eps. (46)

9
4 A Condition of a Problem
A problem as a black box typically has an input and output shown in
Fig.2, where box P accepts some input x and then solve the problem for
this input to produce the output y.

Figure 2: Black box representation of a problem.(Courtesy of

[Gautschi(1997)]))

The problem can also be take as a map f , given by

f : Rm → Rn , y = f (x)

we are interested in:

1. How sensitive is the map f at some point x to a small per-

turbation of x ?

2. How much bigger (or smaller) the perturbation in y is com-

pared to the perturbation in x? In another word, with
∗
x∗ = x + δ and y ∗ = f (x∗ ), |δ| ≤ eps, how sensitive of y y−y
∗
to x x−x ?

3. How to quantify and measure the degree of the sensitivity?

——Condition number

Ex. 1[Dahlquist and Bjorck(2003)]:

Problem (P): The polynomial

p(x) = (x−10)4 +0.200(x−10)3 +0.0500(x−10)2 −0.00500(x−10)2 +0.00100,

(47)
can be written as
∑4
p(x) = ai xi .
i=0

Input (x): ai , i = 0..4;

Output(y): p(10.11).

For this problem, if the input coefficients ai , i = 0..4 are rounded to six
digits, the polynomial becomes

pe(x) = 0.100000×100 x4 −0.398000×102 x3 +0.594050×103 x2 −0.394100×104 x+0.980505×104 ;

Then, one finds p(10.11) = 0.0015, while pe(10.11) = −0.0481. The er-
roneous value is more than 30 times larger than the correct 0.0015. The
present problem, therefore, is so sensitive to small changes occurred in the
inputs. How to quantify and measure the degree of the sensitive?

10
Ex. 2[Vandenberghe and Boyd(2011)]:
Problem (P): To solve a linear algebraic equation: Ax = b, where
 
1 1 1
A=  
2 1 + 10−10 1 − 10−10

and  
1
b= .
1
Input (x): A, b;
Output(y): x.

In this problem, if we change b with △b, one gets:

 
△b1 − 10 (△b1 − △b2 )
10
△x = A−1 △b  .
△b1 + 1010 (△b1 − △b2 )

Therefore, small △b can lead to extremely large △x. How to quantify and
measure the degree of the sensitive?

4.1 Condition Number of a function

Assumption:
1. m = n = 1, y = f (x)

2. △x is a small perturbation, .i.e, △x << 1

3. x ̸= 0, y ̸= 0
Problem:

x∗ = x + △x,
y ∗ = f (x∗ ),
△y = y ∗ − y = f (x∗ ) − f (x) = f (x + △x) − f (x) ≈ f ′ (x)△x,
△y △x
how sensitive of y to x at some point x?

condition number:
def △y/y xf ′ (x)
∴ (Cond f )(x) : = lim = (48)
△x→0 △x/x f (x)
Remark:
1. if x = 0 and y ̸= 0, then the condition number can be given as:
△y/y f ′ (x)
(Cond f )(x) := lim = (49)
△x→0 △x f (x)

2. if x ̸= 0 and y = 0, then the condition number can be given as:

△y
(Cond f )(x) := lim = xf ′ (x) (50)
△x→0 △x/x

11
3. If x = 0 and y = 0, then the condition number will simply be:

△y
(Cond f )(x) := lim = f ′ (x) (51)
△x→0 △x

Note that:

• The condition of f is an inherent property of the map f and does

not depend on any algorithmic considerations concerning its
implementation.

• well conditioned: if a small changes in the input parameters x lead

to small changes in the output y, the map (or the problem) f is
well conditioned; In other words, the solution of a well conditioned
problem is insensitive to the changes in the input parameters.

• ill conditioned: if a small changes in the input parameters x can

cause large changes in the output y, the map (or the problem) f
is ill conditioned; In other words, the solution of a ill conditioned
problem is very sensitive to the changes in the input parameters.
i.e., if (Cond f )(x) >> 1, the map (or the problem) f is ill
conditioned.

Example:

1. For n ≥ 1, compute the condition number of the calculation of the

integration ∫ 1 n
t
In = dt, n ≥ 1.
0 t+5

Solution:

∫ 1
tn
In = dt,
0 t+5
∫ 1
t
= t(n−1) dt,
0 t+5
∫ 1
5
= t(n−1) (1 − )dt,
0 t + 5
∫ 1 ∫ 1
1
= t(n−1)
dt − 5 t(n−1) dt,
0 0 t+5
∫ 1 (n−1)
tn 1 t
= −5 dt,
n 0 0 t+5
1
= − 5In−1 .
n
(52)

12
Therefore, by recursing this equation, we have:
1
In = −5In−1 + , (53)
n
1
−5I(n−1) = (−5)2 In−2 + (−5) , (54)
n−1
..
. (55)
(n−1) n (n−1)
(−5) I1 = (−5) I0 + (−5) . (56)

The summation of all these equations yields:

In = (−5)n I0 + pn ,
∫1 t0
∫1 1
where pn is some number and independent of I0 (= 0 t+5 dt = 0 t+5 dt =
ln 65 ).

In this problem:

• Problem: In = (−5)n I0 + pn , i.e., In = fn (I0 )

• Input: x = I0
• Output: y = In

So the condition number can be given as:

xfn′ (x) I0 (−5)n I0 · 5n

(Con fn )(I0 ) = = = . (57)
fn (x) In In

Because t ∈ (0, 1), In should be decrease monotonically in n. There-

fore,
I0 · 5n I0 · 5n
(Con fn )(I0 ) = > = 5n . (58)
In I0
∵ n → 0, 5n >> 1,

∴ (Confn )(I0 ) >> 1, fn (I0 ) is severely ill-conditioned with large n.

How can we avoid the ill-conditioning ?

From Eq.(62), by recursing, it can be find:
1 1 1
In = (− )In+1 + ( ) , (59)
5 5 n+1
from which one can find that, recursing from In+1 to In , in this
problem:

• Problem: In = (− 15 )In+1 + ( 51 ) n+1

1
, i.e., In = f (In+1 )
• Input: x = In+1
• Output: y = In

and the condition number can be given as:

xfn′ (x) In+1 (− 15 ) In+1 · 1

5 In+1
(Con f )(In+1 ) = = = = . (60)
f (x) In In 5In

13
Because t ∈ (0, 1), In should be decrease monotonically in n

In+1 In 1
(Con f )(In+1 ) = < = < 1. (61)
5In 5In 5
∴ recursing from In+1 to In is well-conditioned.
This may tell us that, by reversing the recurrence, i.e., from ν to
n, ν > n, the problem may become well-conditioned. By recursing
Eq.(59), we have:
1 1 1
In = (− )In+1 + , (62)
5 5n+1
1 1 1 1 1
(− )In+1 = (− )2 In+2 + (− )( ) , (63)
5 5 5 5 n+2
..
. (64)
1 (ν−n−1) 1 (ν−n) 1 (ν−n−1) 1 1
(− ) Iν−1 = (− ) Iν + (− ) ( ) . (65)
5 5 5 5 µ
Summation of these equations yields:
1
In = (− )(ν−n) Iν + p̂n , (66)
5
where p̂n is some number and independent of Iν .
In this problem:

• Problem: In = (− 15 )(ν−n) Iν + p̂n , i.e., In = f (Iν )

• Input: x = Iν
• Output: y = In

So the condition number can be given as:

xf ′ (x) Iν (− 15 )(ν−n) Iν · ( 15 )(ν−n)

(Con f )(Iν ) = = = . (67)
f (x) In In

Because t ∈ (0, 1), In should be decrease monotonically in n. There-

fore,

Iν · ( 15 )(ν−n) In · ( 51 )(ν−n) 1
(Con f )(Iν ) = < = ( )(ν−n) . (68)
In In 5

∵ ν >> n, ( 51 )(ν−n) << 1,

∴ (Conf )(Iν ) << 1, f (Iν ) is well-conditioned.

√
2. Calculate the condition number of f (x) = x, x ≥ 0.[Conte and de Boor(1980)]

Solution: In this problem:

√
• Problem: f (x) = x, x ≥ 0
• Input: x
• Output: f (x)

14
∴ f ′ (x) = 1
√
2 x
. Hence, the condition number of f is

xf ′ (x) x 2√1 x 1
Conf (x) = | |= √ = , (69)
f (x) x 2
which indicates that taking square root is a well-conditioned process.

√ √
3. Calculate the condition number of f (x) = x + 1− x, x ≥ 0.[Conte and de Boor(1980)]

Solution: In this problem:

√ √
• Problem: f (x) = x + 1 − x, x ≥ 0
• Input: x
• Output: f (x)

∴ f ′ (x) = 12 ( √x+1
1
− √1 ).
x
Hence, the condition number of f is

xf ′ (x)
Conf (x) = | |,
f (x)
x 21 ( √x+1
1
− √1x )
= | √ √ |,
x+1− x
1 x
= √ √ ,
2 x+1 x
√
1 x
= . (70)
2 x+1

For “large” x, such as x ≈ 104 , the condition of f becomes

√
1 x 1
Conf (x) = ≈ . (71)
2 x+1 2
This says that, for “large” x, function
√ f is well-conditioned. However,
√
we know that for “large” x, f (x) = x + 1− x has large cancellation
error, which means that even with exact x as input, there still will be
large cancellation error in the output. Therefore, though function f is
well-conditioned, the present algorithm is instable. To be reconciled,
we can change the algorithm as:
√ √
f (x) = x + 1 − x,
1
= √ √ . (72)
x+1+ x

4.2 Condition Number of a Matrix(Optional)

4.2.1 Preliminaries
1. Vector Norms(p-norms)

def
∑
n
1
∥x∥p = ( |xi |p ) p (73)
i=1

Remark:(Important special cases)

15
• 1-norm:
∑
n
∥x∥1 = |xi |. (74)
i=1

• 2-norm:
∑
n
1
∥x∥2 = ( |xi |2 ) 2 (75)
i=1
• ∞-norm:
∥x∥∞ = max |xi |. (76)
1≤i≤n

properties: (where x and y are any vectors):

(a) ∥x∥ ≥ 0, if x ̸= 0;
(b) ∥γx∥ = |γ| · ∥x∥, for any scalar γ;
(c) ∥x + y∥ ≤ ∥x∥ + ∥y∥;
(d) ∥x − y∥ ≥ ∥x∥ − ∥y∥;

2. Matrix Norms
def ∥Ax∥
∥A∥ = max (77)
x̸=0 ∥x∥
Remark:(Important special cases)

• 1-norm: the maximum absolute column sum of the ma-

trix
∑
n
∥A∥1 = max |aij |. (78)
j
i=1

• 2-norm:
∥Ax∥2 √
∥A∥2 = max = max{ λ : λ is an eigenvalue of AT A}
x̸=0 ∥x∥2
(79)
• ∞-norm: the maximum absolute row sum of the matrix
∑
n
∥A∥∞ = max |aij |. (80)
i
j=1

properties: (where A and B are any matrices)

(a) ∥A∥ ≥ 0, if A ̸= 0;
(b) ∥γA∥ = |γ| · ∥A∥, for any scalar γ;
(c) ∥A + B∥ ≤ ∥A∥ + ∥B∥;
(d) ∥AB∥ ≤ ∥A∥ · ∥B∥;
(e) ∥Ax∥ ≤ ∥A∥ · ∥x∥, for any vector x;

16
4.2.2 Condition number of a vector function
Generally, for arbitrary m, n, namely,

x = [x1 , x2 , . . . , xm ]T ∈ Rm , y = [y1 , y2 , . . . , yn ]T ∈ Rn , (81)

and the map f is given as

y = f (x).
condition number:
def ∥x∥∞ ∥f ′ (x)∥∞
(Cond f )(x) = (82)
∥f (x)∥∞

where f ′ (x) is the Jacobian matrix and

 
∂f1 ∂f1 ∂f1
...
 ∂x1 ∂x2 ∂xm

 ∂f2 ∂f2 ∂f2 
∂f def  ∂x1 ... ∂xm 
= ∂x2
 ∈ Rn×m (83)
∂x  · · ... · 
 
∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xm

Remark: As an application of the condition number of a vector function,

the condition number of the matrix A can be given as:
def
Cond A : = ∥A∥ · |A−1 ∥ (84)

Example:

(1) Compute the condition number of matrix A

 
1.01 0.99
A= . (85)
0.99 1.01

The inverse of A is
 −1      
a b 1 d −b 1.01 −0.99 25.25 −24.75
A−1 =   =  = 1  = ,
c d ad − bc −c a 0.04 −0.99 1.01 −24.75 25.25
(86)
−1
∴ ∥A∥∞ = (1.01 + 0.99) = 2, and ∥A ∥∞ = (25.25 + 24.75) = 50. Hence,
the condition number of A is

Cond∞ A := ∥A∥∞ · |A−1 ∥∞ = 2 × 50 = 100. (87)

(2) Hilbert Matrix:

elements of n order Hilbert Matrix are given as
1
hij = , i, j = 1, 2, . . . , n,
i+j−1

17
i.e.,  
1 1
1 ...
 2 n 
1 1 1 
2 ... 
 3 n+1

· · · 
 ... 
Hn =   ∈ Rn×n (88)
· · ... · 
 
 
· · ... · 
 
1 1 1
n n+1 ... 2n−1

This matrix is symmetric and positive define (i.e., Hn = HnT , and xT Hn x >
0 for all x ∈ Rn ). The condition number of Hilbert matrices are shown in
Table 1.

Table 1: The condition number of Hilbert matrices

n Cond2 Hn
10 1.60 × 1013
20 2.45 × 1028
40 7.65 × 1058

For n = 10, the system can not be solved with any reliability in single
precision on a 14-decimal computer and, for double precision, if n = 20, it
will also “exhausted” by the time. In deed, it can be shown that
√
( (2) + 1)(4n+4)
Cond2 Hn ∼ √ as n → ∞. (89)
215/4 πn

(3) Vandermonde Matrix:

 
1 1 ... 1
 
 tn 
 t1 t2 ... 
 
 · · · 
 ... 
Vn =   ∈ Rn×n (90)
 · · ... · 
 
 
 · · ... · 
 
tn−1
1 tn−1
2 ... tn−1
n

where t1 , t2 , . . . , tn are parameters and assumed real. If the parameters are

equally spaced in [−1, 1], that is,

2(ν − 1)
tν = 1 − , ν = 1, 2, . . . , n. (91)
n−1
then,
1 − π n( π + 1 ln 2)
Cond∞ Vn ∼ e 4e 4 2 as n → ∞. (92)
π
With different n, the condition number of the Vandermonde matrices are
shown in Table 2. They are not growing as fast as it for the Hilbert matrix,

18
Table 2: The condition number of Vandermonde matrices

n Cond2 Hn
10 1.36 × 104
20 1.05 × 109
40 6.93 × 1018
80 3.15 × 1038

but still exponentially fast.

H.W : write a Matlab program to calculate the condition numbers of

Hilbert Matrix and Vandermonde Matrix.

Note:
1. Hilbert Matrix is very useful for the least-square problem in Chapter
2.
2. Vandermonde Matrix is very important to the interpolation problem
in Chapter 2.

5 Review questions
1. Review the floating-point representation in a binary computer:
a) write down the definition of the R(t, s) system;
b) write down the definition of normalization and give some reason
why we need do normalization;
c) Derive the largest and smallest magnitudes of a (normalized)
floating-point number in the R(t, s) system;
d) What are the meaning of the overflow and underflow ? In floating-
point arithmetic, which is generally more harmful, under-flow or
over-flow? Why?
2. Review the concept of rounding:
a) Explain the difference between the rounding rules “chopping(round
toward zero)” and “Symmetric rounding(round to nearest)” in
a floating-point system.
b) Which of these two rounding rules is more accurate?
c) write down the definition of the machine precision and explain
why the machine precision is machine dependent.
3. Review the definition of the model of machine arithmetic and error
propagation:
a) True or false(why ? Give an example): If two real numbers are
exactly representable as floating-point numbers, then the result
of an arithmetic operation O(+, −, ×, /) on them will also be
representable as a floating point number.

19
b)True or false(Give examples): Floating-point addition is associa-
tive but not commutative.
c)In floating-point arithmetic, which of the following operations on
two positive floating point operands can be benign operations:
• Multiplication
• Division
• Addition
• Subtraction
d)Explain why the cancellation that occurs when two numbers of
similar magnitude are subtracted is often bad even though the
result may be exactly correct for the actual operands involved.
(Hint: Lost some significant digits and information)
e) True or false(why ? Give an example):Computing a small quan-
tity as a difference of large quantities is generally a bad idea.

4. Review the concept of condition of a problem

a)what are definitions of the well conditioned and the ill conditioned
problems ?
b)True or false: A problem is ill conditioned if its solution is highly
sensitive to small changes in the problem data.
b) True or false: Using higher-precision arithmetic will make an
ill-conditioned problem better conditioned.
c)True or false: The conditioning of a problem depends on the al-
gorithm used to solve it.
d)True or false: A good algorithm will produce an accurate solution
regardless of the condition of the problem being solved.

References
[Gautschi(1997)] W. Gautschi, Numerical analysis: An intrtoduction, 1st
ed., Brikhauser Boston, Berlin, 1997.

[Conte and de Boor(1980)] S. D. Conte, C. de Boor, Elementary numer-

ical analysis: An algorithmic approach, 3rd ed., McGraw-Hill Book
Company, New York, 1980.

[Stewart(1996)] G. W. Stewart, Afternotes on numerical analysis, SIAM,

1996.

[Heath(1997)] M. T. Heath, Scientific computing: An introduction survey,

1rd ed., McGraw-Hill Book Company, New York, 1997.

[Dahlquist and Bjorck(2003)] G. Dahlquist, A. Bjorck, Numerical meth-

ods, 1st ed., Dover publication, INC, New York, 2003.

[Vandenberghe and Boyd(2011)] L. Vandenberghe, S. Boyd, Applied Nu-

merical Computing, UCLA Fall Quarter 2011-12(Notes) (2011).

CH 1
No ratings yet
CH 1
11 pages
CSE213 - Numerical Analysis Sheet 1 - Error Analysis To Be Submitted On The 25 of February
No ratings yet
CSE213 - Numerical Analysis Sheet 1 - Error Analysis To Be Submitted On The 25 of February
4 pages
Numerical Analysis
No ratings yet
Numerical Analysis
13 pages
Num Math PDF
No ratings yet
Num Math PDF
87 pages
Num Math
No ratings yet
Num Math
87 pages
Numerical Analysis 1 - 2022-2023 Courses
No ratings yet
Numerical Analysis 1 - 2022-2023 Courses
65 pages
ch1 PDF
No ratings yet
ch1 PDF
15 pages
ComputerArithmetic-and-Interpolation 2023
No ratings yet
ComputerArithmetic-and-Interpolation 2023
29 pages
Numerical Methods - Exercises: Introductory Lecture
No ratings yet
Numerical Methods - Exercises: Introductory Lecture
37 pages
Approximation and Round-Off Errors: Speed: 48.X Mileage: 87324.4X
No ratings yet
Approximation and Round-Off Errors: Speed: 48.X Mileage: 87324.4X
27 pages
20250128_annotated
No ratings yet
20250128_annotated
34 pages
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
No ratings yet
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
6 pages
Numerical+Analysis+Chapter+1 2
No ratings yet
Numerical+Analysis+Chapter+1 2
13 pages
26 PDF
No ratings yet
26 PDF
134 pages
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
No ratings yet
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
10 pages
Week 2 M1Lessons 2-3
No ratings yet
Week 2 M1Lessons 2-3
41 pages
Chapter 1 (5 Lectures)
No ratings yet
Chapter 1 (5 Lectures)
15 pages
Floating Point
No ratings yet
Floating Point
3 pages
Errors and Propagation
No ratings yet
Errors and Propagation
8 pages
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 16 - Rounding - Error Calculation
No ratings yet
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 16 - Rounding - Error Calculation
13 pages
tut 8s
No ratings yet
tut 8s
5 pages
Operations On Floating Point Numbers
No ratings yet
Operations On Floating Point Numbers
16 pages
Unit 4 - 2
No ratings yet
Unit 4 - 2
21 pages
Slide n2 Appendix Posted
No ratings yet
Slide n2 Appendix Posted
21 pages
Numerical Methods With MATLAB PDF
No ratings yet
Numerical Methods With MATLAB PDF
189 pages
CHAP 03e
No ratings yet
CHAP 03e
32 pages
Lecture Notes17
No ratings yet
Lecture Notes17
122 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
Maths All Notes
No ratings yet
Maths All Notes
122 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Mcse 004
No ratings yet
Mcse 004
241 pages
Lab 3
No ratings yet
Lab 3
5 pages
Dan Um Jay A
No ratings yet
Dan Um Jay A
561 pages
Lec 1 - 2 Introduction
No ratings yet
Lec 1 - 2 Introduction
40 pages
Note 1 - Approximations and Errors
No ratings yet
Note 1 - Approximations and Errors
3 pages
Lecture 4 Roundoff-Error
No ratings yet
Lecture 4 Roundoff-Error
43 pages
Lecture Notes On Numerical Analysis
No ratings yet
Lecture Notes On Numerical Analysis
68 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
14 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
No ratings yet
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
30 pages
Numerical Methods Notes
100% (1)
Numerical Methods Notes
553 pages
Floating Point Arithmethic - Error Analysis
No ratings yet
Floating Point Arithmethic - Error Analysis
30 pages
Numerical Methods
No ratings yet
Numerical Methods
31 pages
CH 1 - Basic Concepts in Error Estimation
100% (1)
CH 1 - Basic Concepts in Error Estimation
10 pages
macm316_week1
No ratings yet
macm316_week1
28 pages
Numerical Methods: Representing Numbers
No ratings yet
Numerical Methods: Representing Numbers
30 pages
Numerical Methods
No ratings yet
Numerical Methods
17 pages
Floating Point Formats: 0 1 1 P 1 (P 1) e I
No ratings yet
Floating Point Formats: 0 1 1 P 1 (P 1) e I
3 pages
MIT18 335JF10 Lec4 Hand PDF
No ratings yet
MIT18 335JF10 Lec4 Hand PDF
3 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
11 pages
Error
No ratings yet
Error
6 pages
errors2
No ratings yet
errors2
3 pages
Approximations and Errors in Numerical Computing
100% (2)
Approximations and Errors in Numerical Computing
12 pages
IEEE Arithmetic
No ratings yet
IEEE Arithmetic
6 pages
ECOR 2606 Lecture 12 w16
No ratings yet
ECOR 2606 Lecture 12 w16
13 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Sokomind_Analysis
No ratings yet
Sokomind_Analysis
5 pages
chapter1_Activity3
No ratings yet
chapter1_Activity3
1 page
CST 110 (E99A) - Group Presentation Role Reflection Essay
0% (1)
CST 110 (E99A) - Group Presentation Role Reflection Essay
2 pages
Interpersonal Communication Film Paper
No ratings yet
Interpersonal Communication Film Paper
6 pages
IPILAN ES Report On First Quarterly Assessment
No ratings yet
IPILAN ES Report On First Quarterly Assessment
45 pages
EECE 301 Signals & Systems: Prof. Mark Fowler
No ratings yet
EECE 301 Signals & Systems: Prof. Mark Fowler
9 pages
Fl20 Algebra2 Ipe 07
No ratings yet
Fl20 Algebra2 Ipe 07
50 pages
LS 3 - Arithmetic
100% (1)
LS 3 - Arithmetic
4 pages
Surds
No ratings yet
Surds
3 pages
HandBook of Biomedical
100% (1)
HandBook of Biomedical
501 pages
Robitaille, D., & Dirks, M. (1982) - Models For The Mathematics Curriculum. For The Learning of Mathematics, 2 (3), 3-21.
No ratings yet
Robitaille, D., & Dirks, M. (1982) - Models For The Mathematics Curriculum. For The Learning of Mathematics, 2 (3), 3-21.
20 pages
(Ebook) Applied Mathematics: For the Managerial, Life, and Social Sciences, Fifth Edition by Soo T. Tan ISBN 9780495559672, 0495559679 pdf download
No ratings yet
(Ebook) Applied Mathematics: For the Managerial, Life, and Social Sciences, Fifth Edition by Soo T. Tan ISBN 9780495559672, 0495559679 pdf download
48 pages
10.1007@978 94 007 4978 8@11 PDF
No ratings yet
10.1007@978 94 007 4978 8@11 PDF
99 pages
Wataru Mizumachi Electric Atomic Power Division, Tokyo, Japan
No ratings yet
Wataru Mizumachi Electric Atomic Power Division, Tokyo, Japan
17 pages
Assignment Module 7
No ratings yet
Assignment Module 7
12 pages
Gauss Elimination Matlab
No ratings yet
Gauss Elimination Matlab
14 pages
SECRET+PROMPT+FORMULA+PROMPT+AND+HOW+TO+USE
No ratings yet
SECRET+PROMPT+FORMULA+PROMPT+AND+HOW+TO+USE
5 pages
Comparative Study of Music and Architecture From T
No ratings yet
Comparative Study of Music and Architecture From T
18 pages
PT3 Revision
No ratings yet
PT3 Revision
26 pages
MITx 6.00.2x - Introduction To Computational Thinking and Data Science
No ratings yet
MITx 6.00.2x - Introduction To Computational Thinking and Data Science
12 pages
1705 - Week 7 Lesson Note For Jss3 Mathematics
No ratings yet
1705 - Week 7 Lesson Note For Jss3 Mathematics
9 pages
Math Lesson 2nd Observation
No ratings yet
Math Lesson 2nd Observation
3 pages
Y3 HL Summer Block 1 Fractions 2 2020
No ratings yet
Y3 HL Summer Block 1 Fractions 2 2020
8 pages
Model Fitting and Error Estimation: BSR 1803 Systems Biology: Biomedical Modeling
No ratings yet
Model Fitting and Error Estimation: BSR 1803 Systems Biology: Biomedical Modeling
34 pages
8MA0 - 01 Pure Mathematics MSterdhdrhgt
No ratings yet
8MA0 - 01 Pure Mathematics MSterdhdrhgt
31 pages
Satterthwaite 1941
No ratings yet
Satterthwaite 1941
8 pages
CH 02
No ratings yet
CH 02
32 pages
CH 2 Operation Sense
No ratings yet
CH 2 Operation Sense
10 pages
Lesson 4 (Fuzzy Hedges)
No ratings yet
Lesson 4 (Fuzzy Hedges)
12 pages
9th Geometrial Construction Study Guide Solved
No ratings yet
9th Geometrial Construction Study Guide Solved
53 pages
Cross Correlation
No ratings yet
Cross Correlation
10 pages
Unit 1: Ordinary Differential Equations: Objectives Main Content
No ratings yet
Unit 1: Ordinary Differential Equations: Objectives Main Content
33 pages
M4 Tos
No ratings yet
M4 Tos
11 pages
ITC-03
No ratings yet
ITC-03
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.