0% found this document useful (0 votes)
3 views

Rounding and Machine Arithmetic

The document discusses machine arithmetic, focusing on the representation of real numbers and machine numbers using binary systems. It explains concepts such as floating-point representation, rounding methods, and error propagation in arithmetic operations, emphasizing the importance of machine precision and the potential for errors during calculations. The document also highlights the differences in error behavior for various arithmetic operations, particularly noting the risks of cancellation errors in addition and subtraction.

Uploaded by

Emily Reyna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Rounding and Machine Arithmetic

The document discusses machine arithmetic, focusing on the representation of real numbers and machine numbers using binary systems. It explains concepts such as floating-point representation, rounding methods, and error propagation in arithmetic operations, emphasizing the importance of machine precision and the potential for errors during calculations. The document also highlights the differences in error behavior for various arithmetic operations, particularly noting the risks of cancellation errors in addition and subtraction.

Uploaded by

Emily Reyna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 1 Machine Arithmetic and

Related Matters

1 Real Numbers and Machine Numbers


1.1 Real number(Note that: the “exact” number)
binary number system:

x∈R iff x = ±(bn ·2n +bn−1 ·2n−1 +. . .+b0 +b−1 ·2−1 +b−2 ·2−2 +. . .), (1)

or
x = ±(bn bn−1 . . . b0 . b−1 b−2 . . .)2 , (2)
where n ≥ 0 is some integer and the “binary digits” bi :

bi = 0 or bi = 1, for all i.

Note:

• In general, to represent a real number, it may need infinitely many


binary digits;

• The representation in Eq. (2) is not unique; for instance, (.0111)2 =


(.1)2

• If we always insist on a finite representation and if one exists, we


regain uniqueness. (10011.01)2 = (19.25)10

• A finite decimal number may correspond a infinite binary represen-


tation. ( 15 )10 = (.0.00110011)2

Algorithm 1 : Algorithm for determining the fractions


Input: x ∈ (0, 1), integer β ≥ 2 and c0 = x;
i = 0;
while ci ̸= 0 .and. i ≤ Imax do
i = i + 1;
bi = (βci−1 )I ;
ci = (βci−1 )F ;
end while ∑
Output: x = ( . b1 b2 b3 . . . bi )β = ik=1 bk β −k .

1
1.2 Machine numbers:
Floating-point representation:(recall: scientific notation)
t : the number of binary digits of fractional part,
s : the number of binary digits in the exponent.

x ∈ R(t, s) iff x = f · 2e , (3)

where

f = ±(b−1 b−2 . . . b−t )2 and e = ±(cs−1 cs−2 . . . c0 )2 . (4)

f : the mantissa of x,
e: the integer e as the exponent of x.
2: the base.
t: t-digits precision.

Normalized: if in fraction f , we have b−1 = 1.


(Why do normalization ? Hint: 1. uniqueness of the representation; 2. To
keep more significant digits in mantissa with in t bits)

Figure 1: Packing of a floating point number in a register.

Note that: (bi = 0 or bi = 1, for all i)

1
2 ≤ |f | ≤ (2−1 + 2−1 + . . . + 2−t ) = 1 − 2−t ;

2−(1+2+...+2 = 2−(2
s−1 ) s −1) s−1 s −1)
≤ 2e ≤ 21+2+...+2 = 2(2 .

The largest magnitude of a (normalized) floating-point number:

max |x| = (1 − 2−t )22


s −1
(5)
x∈R(t,s)

The smallest magnitude of a (normalized) floating-point number:

min |x| = 2−2


s
(6)
x∈R(t,s)

Overflow: a real number whose modulus is not in the range determined


by (5) and (6), if its modulus is larger than maxx∈R(t,s) |x|;

underflow: a real number whose modulus is not in the range determined


by (5) and (6), if its modulus is smaller than minx∈R(t,s) |x|;

In order to increase the precision, one can use two machine registers to
represent a machine number. In effect, one then embeds R(t, s) ⊂ R(2t, s)
and call x ∈ R(2t, s) a double-precision number.

2
2 Rounding
For a real number in the machine register, if it is too long, its tail end
will be cut off or if it is too short, it is padded by zeros at the end. More
specifically, let


x ∈ R iff x = ±( b−k 2−k )2e , (7)
k=1
be the “exact” real number (in normalized floating-point form)
and
∑t

x∗ ∈ R iff x∗ = ±( b∗−k 2−k )2e , (8)
k=1
be the rounded number.
There are two commonly used ways of translating a given “ exact ”
real number x into an t − β-digit (for instance, in the binary system,
β = 2)floating-point number f l(x) : chopping and symmetric rounding.

Definition:
• Absolute error= |approximate value - true value|

• relative error= approximate value - true value


true value

2.1 Chopping
One takes

x∗ = chop(x), e∗ = e and b∗−k = b−k for k = 1, 2, . . . , t. (9)

Since chop(x) is the next floating-point number towards zero from x,


chop(x) is also called round towards zero.

Error incurred in chopping:




|x∗ − x| = | ± ( b∗−k 2−k )| · 2e (10)
k=t+1


−k
≤ ( 2 ) · 2e (∵ b−k = 1 in the worst case) (11)
k=t+1
−(t+1)
= 2 ·2e
(1 + 2−1 + 2−2 + . . .) (12)
1
= 2e · 2−(t+1) = 2e · 2−t . (13)
1 − 12

The absolute error is dependent on e (i.e., the magnitude of x). We may


prefer the relative error:
x∗ − x 2e · 2−t
| | ≤ ∑∞ ∗ −k (14)
x | ± ( k=1 b−k 2 )| · 2e
2e · 2−t
≤ (∵ b−1 = 1 and for all k ≥ 2, b−k = 0) (15)
2 ·2
1 e

= 2 · 2−t . (16)

3
H.W :1
x∗ −x
1. Proof : x ≤ 0;
x∗ −x
2. Proof : x ≥ −2 · 2−t .

The number 2−t is an important, machine-dependent quantity, called the


machine precision,
eps = 2−t , (17)
it determines the level of precision of any large-scale floating-point com-
puting. On the SUN SPARC workstation, where t = 23, we have eps≈
1.19 × 10−7 , corresponding to a precision of 6 or 7 significant decimal dig-
its.

2.2 Symmetric Rounding


In binary arithmetic, if b−t−1 = 1, the number rounds up or if b−t−1 = 0,
it rounds down. i.e.,
1 −t e
x∗ = rd(x), rd(x) = chop(x + · 2 · 2 ). (18)
2

x 99K ( . b−1 b−2 . . . b−t |b−t−1 )2


The procedure can be illustrated as: + ( . 0 0 . . . . . . 0| 1 )2
chop( . b′−1 b′−2 . . . b′−t |b′−t−1 )2
Since rd(x) is the nearest floating-point number to x; in case of a tie,
we use the floating-point number whose last stored digit is even. Because
of the latter property, this rule is also sometimes called round to even.

Example: two-decimal-digit floating-point numbers are used, then



2  (0.67)100 Symmetric Rounded
f l( ) = (19)
3  (0.66)100 chopped

and 
 −(0.84)103 Symmetric Rounded
f l(−838) = (20)
 −(0.83)103 chopped

Error incurred in Symmetric Rounding:


rd(x) − x
| | ≤ 2−t . (21)
x
H.W : Proof (21)
note that:
if b−t−1 = 0, then rd(x) = chop(x); (22)

 2−t · 2e x≥0
if b−t−1 = 1, then rd(x) = chop(x) + (23)
 −2−t · 2e x < 0

1
H.W means that you should work these problems at home, but no need to submit
it. However, these problems may appear in the quiz or midterm/final

4
Remark: Set
rd(x) − x
= ϵ, |ϵ| ≤ eps (24)
x
then, we have
rd(x) = x(1 + ϵ), |ϵ| ≤ eps (25)
and defers dealing with the inequality (for ϵ) to the very end.

3 Error propagation in arithmetic operations


3.1 Model of Machine Arithmetic
Each arithmetic operation O(= +, −, ×, /) may produce a result no longer
representable on the computer. However, other than Overflow and un-
derflow, we may assume a model of machine arithmetic that each arith-
metic operation produces a correct rounded result.

If x, y ∈ R(t, s) are floating-point numbers,

f l(xOy) = xOy(1 + ϵ), |ϵ| ≤ eps,

denotes the machine-produced result of the arithmetic operation xoy.


Note:
xOy : exact machine representation of the arithmetic operation o
(1 + ϵ) : imperfect execution and ϵ is a slightly perturbation of
the arithmetic operation o

3.2 Error Propagation and Cancellation


Note:
1. A single arithmetic operation: a small error that can be neglected;

2. A succession of arithmetic operations : may result in a significant


error, owing to error propagation;

3. The rounding error: widespread, but cause little harm unless there
are some weak spots on computations.
Assumption:
1. Arithmetic operations are carried out exactly and Operands x, y are
rounded and represented by

x∗ = x(1 + ϵx ) (26)

y = y(1 + ϵy ). (27)

2. ϵx , ϵy are so small (|ϵx |, |ϵy | ≤eps) that quantities of second order


(ϵ2x , ϵx ϵy , ϵ2y ) and higher order can be neglected, i.e., |ϵ2x |, |ϵx ϵy |, |ϵ2y | ≪
eps.
Problem: what is the relative error in the operations:
x · y, x/y, x + y and x − y ?

5
3.2.1 Multiplication

f (x · y) = x∗ · y ∗ = x(1 + ϵx ) · y(1 + ϵy )
= x · y(1 + ϵx + ϵy + ϵx ϵy )
≈ x · y(1 + ϵx + ϵy ), (where ϵx ϵy are second order!) (28)

∴ the relative error in the product is:


f (x · y) − x · y
ϵx·y = = ϵx + ϵy (29)
x·y
Multiplication is a benign operation !

3.2.2 Division

x(1 + ϵx )
f (x/y) = x∗ /y ∗ =
y(1 + ϵy )
x(1 + ϵx )(1 − ϵy )
=
y(1 − ϵ2y )
x
≈ (1 + ϵx − ϵy − ϵx ϵy )
y
x
≈ (1 + ϵx − ϵy ), (where ϵx ϵy are second order!) (30)
y
∴ the relative error is:
f (x/y) − x/y
ϵx/y = = ϵx − ϵy (31)
x/y
Division is also a benign operation !

3.2.3 Addition and Subtraction

f (x + y) = x∗ + y ∗ = x(1 + ϵx ) + y(1 + ϵy )
= x + xϵx + y + yϵy
= x + y + xϵx + yϵy
assuming x + y ̸= 0. Therefore,
xϵx + yϵy
f (x + y) = (x + y)(1+ )
x+y
x y
= (x + y)(1+ ϵx + ϵy ).(32)
x+y x+y
∴ the relative error is:
f (x + y) − (x + y) x y
ϵx+y = = ϵx + ϵy . (33)
(x + y) x+y x+y

Note:
• The error of addition or subtraction in the result is a linear combi-
nation of the errors in the data, but the coefficient are no longer ±1,
but can be assumed arbitrary large.

6
• Two cases to be discussed in (33):

1. if x · y > 0, the weight in (33) satisfies:


x y
0< < 1 and 0< < 1, (34)
x+y x+y

∴ |ϵx+y | ≤ |ϵx | + |ϵy |.


So, as the multiplication and division, addition is benign oper-
ation.
2. if x · y < 0, the weight in (33) can be arbitrary large in abso-
lute value, since |x + y| can be arbitrary small compared to |x|
and |y| in particular if |x| ≈ |y|. The magnitude of error then
occurring in (32) is referred to as cancellation error(the only
weakness—the Achilles heel).

• cancellation error may not only appear in a single devastating


amounts, but also repeatedly over a long period of time involving
”small doses” of cancellation !

Illustration of the cancellation phenomenon[Gautschi(1997)]:


x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b′ b′ g g g g e
′′ ′′
x−y = 0 0 0 0 0 0 0 0 0 b b g g g g e
′′ ′′
= b b g g g g ? ? ? ? ? ? ? ? ? e-9

Notation:
′′
1. b, b′ and b : reliable binary digits

2. g : binary digit contaminated by error (also called ”garbage” digits)

The digits lost to cancellation are the most significant, leading


digits, whereas the digits lost in rounding are the least significant,
trailing digits. Because of this effect, computing a small quantity as a
difference of large quantities is generally a bad idea, for rounding error is
likely to dominate the result. For example, summing an alternating series,
such as [Conte and de Boor(1980)]

x2 x3
e−x = 1 − x + − + ... , x > 0,
2! 3!
may give disastrous results because of catastrophic cancellation. (Hint:
Using e−x = 1/ex ).

Examples:

7
(1) Algebra operations:[Stewart(1996)]
On a 5-decimal-digit computer:

a = 37654, b = 25.874 and c = 37679

using symmetric rounding of (a + b − c), we have:

f l(a + b − c) = f l(f l(a + b) − c),


f l(a + b) = 37680,
f l(a + b − c) = 1.0000.

instead of the true result:

a + b − c = 37654 + 25.874 − 37679 = 37679.874 − 37679 = 0.87400.

if we change the sequence of operations, i.e.,

f l(a − c + b) = f l(f l(a − c) + b),


f l(a − c) = −25.000,
f l(a + b − c) = 0.87400.

(2) Quadratic equation[Stewart(1996)]: finding the root of x2 − bx + c = 0


in the five-digit arithmetic, and

b = 3.6778, and c = 0.0020798.

With the root formula given by



b± b2 − 4c
r= ,
2
the root should be:

r1 = 3.67723441190 . . . and r2 = 0.00056558890 . . . . (35)

In a computer with 5-decimal-digit, the sequence of operations used in


calculating the smaller root are given as follows:
1. b2 : 1.3526 · 10+1 (True value: 1.352621284 · 10+1 )

2. 4c : 8.3192 · 10−3 (True value:8.3192 · 10−3 )

3. b2 − 4c : 1.3518 · 10+1 (True value:1.351789364 · 10+1 )



4. b2 − 4c : 3.6767 · 10+0 (True value:3.67666882313208 · 10+0 )

5. b − b2 − 4c : 1.1000 · 10−3 (True value:1.131176867918 · 10−3 )

b− b2 −4c
6. 2 : 5.5000 · 10−4 (True value:5.655880933958864 · 10−4 )
The computed value 5.5000·10−4 differs from the true value 0.00056558890 . . ..
if we do:

b + b2 − 4c = 7.3545 · 10+0 ;

b + b2 − 4c
r1 = = 3.6773 · 10+0 .
2

8
The result agrees with the true r1 . To calculate r2 , if we use:
c
r2 = ,
r1
we will have: r2 =√5.6558 · 10−4 , which is as accurate as we expect.

(3) Compute y = x + δ − x, x > 0 and |δ| ≪ 1.
Writing instead
δ
y=√ √ .
x+δ+ x
completely removes the cancellation error !
(4) Compute y = cos(x + δ) − cos(x) and |δ| ≪ 1.
Writing y in the equivalent form[Gautschi(1997)]
δ δ
y = −2 sin sin(x + ).
2 2
(5) Compute y = f (x + δ) − f (x) and |δ| ≪ 1, f is a function.
if f is smooth enough around x, we have:[Gautschi(1997), Heath(1997)]
1 ′′
y = f ′ (x)δ + f (x)δ 2 + . . . . (36)
2
The terms in this series decrease rapidly. So the cancellation is not a
problem any more.
Considering the finite difference approximation to the first derivative
f (x + δ) − f (x)
f ′ (x) ≈ . (37)
δ
′′
From the Eq.(36), there will be a truncation error ET = 12 |f (x′ )|δ in
Eq.(37) and with smaller δ, the truncation error becomes smaller. But it
is also known that , for (f (x + δ) − f (x)), there will be some cancellation
error Ec and to reduce the cancellation error, one needs δ larger. The total
error in Eq.(37)to calculate the first order derivative will be given as:
ET = Ec + ET . (38)
Thus, there is a trade-off between truncation error and rounding error in
choosing the size of δ. How to choose the optimized δ ?
′′
If we assume that: ∀x, both |f (x)| and |f (x)| are bounded by M , i.e.,
′′
|f (x)| ≤ M, |f (x)| ≤ M. (39)
Therefore, ET ≤ 21 M δ. The cancellation error Ec can be given as:
Ec = |f ′ (x) − f ′ (x)(1 + ϵ)| (40)

= |f (x)|ϵ (41)
f (x + δ) − f (x)
= | |ϵ (42)
δ
|f (x + δ)| + |f (x)|
≤ ϵ (43)
δ
2M ϵ
≤ . (44)
δ
The total error can be bounded by 21 M δ + 2M ϵ
δ , which is minimized when

δ = 2 ϵ. (45)
If we also assume that the function values are accurate to the machine
precision, we have:

δ ≈ 2 eps. (46)

9
4 A Condition of a Problem
A problem as a black box typically has an input and output shown in
Fig.2, where box P accepts some input x and then solve the problem for
this input to produce the output y.

Figure 2: Black box representation of a problem.(Courtesy of


[Gautschi(1997)]))

The problem can also be take as a map f , given by

f : Rm → Rn , y = f (x)

we are interested in:

1. How sensitive is the map f at some point x to a small per-


turbation of x ?

2. How much bigger (or smaller) the perturbation in y is com-


pared to the perturbation in x? In another word, with

x∗ = x + δ and y ∗ = f (x∗ ), |δ| ≤ eps, how sensitive of y y−y

to x x−x ?

3. How to quantify and measure the degree of the sensitivity?


——Condition number

Ex. 1[Dahlquist and Bjorck(2003)]:


Problem (P): The polynomial

p(x) = (x−10)4 +0.200(x−10)3 +0.0500(x−10)2 −0.00500(x−10)2 +0.00100,


(47)
can be written as
∑4
p(x) = ai xi .
i=0

Input (x): ai , i = 0..4;


Output(y): p(10.11).

For this problem, if the input coefficients ai , i = 0..4 are rounded to six
digits, the polynomial becomes

pe(x) = 0.100000×100 x4 −0.398000×102 x3 +0.594050×103 x2 −0.394100×104 x+0.980505×104 ;

Then, one finds p(10.11) = 0.0015, while pe(10.11) = −0.0481. The er-
roneous value is more than 30 times larger than the correct 0.0015. The
present problem, therefore, is so sensitive to small changes occurred in the
inputs. How to quantify and measure the degree of the sensitive?

10
Ex. 2[Vandenberghe and Boyd(2011)]:
Problem (P): To solve a linear algebraic equation: Ax = b, where
 
1 1 1
A=  
2 1 + 10−10 1 − 10−10

and  
1
b= .
1
Input (x): A, b;
Output(y): x.

In this problem, if we change b with △b, one gets:


 
△b1 − 10 (△b1 − △b2 )
10
△x = A−1 △b  .
△b1 + 1010 (△b1 − △b2 )

Therefore, small △b can lead to extremely large △x. How to quantify and
measure the degree of the sensitive?

4.1 Condition Number of a function


Assumption:
1. m = n = 1, y = f (x)

2. △x is a small perturbation, .i.e, △x << 1

3. x ̸= 0, y ̸= 0
Problem:

x∗ = x + △x,
y ∗ = f (x∗ ),
△y = y ∗ − y = f (x∗ ) − f (x) = f (x + △x) − f (x) ≈ f ′ (x)△x,
△y △x
how sensitive of y to x at some point x?

condition number:
def △y/y xf ′ (x)
∴ (Cond f )(x) : = lim = (48)
△x→0 △x/x f (x)
Remark:
1. if x = 0 and y ̸= 0, then the condition number can be given as:
△y/y f ′ (x)
(Cond f )(x) := lim = (49)
△x→0 △x f (x)

2. if x ̸= 0 and y = 0, then the condition number can be given as:


△y
(Cond f )(x) := lim = xf ′ (x) (50)
△x→0 △x/x

11
3. If x = 0 and y = 0, then the condition number will simply be:

△y
(Cond f )(x) := lim = f ′ (x) (51)
△x→0 △x

Note that:

• The condition of f is an inherent property of the map f and does


not depend on any algorithmic considerations concerning its
implementation.

• well conditioned: if a small changes in the input parameters x lead


to small changes in the output y, the map (or the problem) f is
well conditioned; In other words, the solution of a well conditioned
problem is insensitive to the changes in the input parameters.

• ill conditioned: if a small changes in the input parameters x can


cause large changes in the output y, the map (or the problem) f
is ill conditioned; In other words, the solution of a ill conditioned
problem is very sensitive to the changes in the input parameters.
i.e., if (Cond f )(x) >> 1, the map (or the problem) f is ill
conditioned.

Example:

1. For n ≥ 1, compute the condition number of the calculation of the


integration ∫ 1 n
t
In = dt, n ≥ 1.
0 t+5

Solution:

∫ 1
tn
In = dt,
0 t+5
∫ 1
t
= t(n−1) dt,
0 t+5
∫ 1
5
= t(n−1) (1 − )dt,
0 t + 5
∫ 1 ∫ 1
1
= t(n−1)
dt − 5 t(n−1) dt,
0 0 t+5
∫ 1 (n−1)
tn 1 t
= −5 dt,
n 0 0 t+5
1
= − 5In−1 .
n
(52)

12
Therefore, by recursing this equation, we have:
1
In = −5In−1 + , (53)
n
1
−5I(n−1) = (−5)2 In−2 + (−5) , (54)
n−1
..
. (55)
(n−1) n (n−1)
(−5) I1 = (−5) I0 + (−5) . (56)

The summation of all these equations yields:

In = (−5)n I0 + pn ,
∫1 t0
∫1 1
where pn is some number and independent of I0 (= 0 t+5 dt = 0 t+5 dt =
ln 65 ).

In this problem:

• Problem: In = (−5)n I0 + pn , i.e., In = fn (I0 )


• Input: x = I0
• Output: y = In

So the condition number can be given as:

xfn′ (x) I0 (−5)n I0 · 5n


(Con fn )(I0 ) = = = . (57)
fn (x) In In

Because t ∈ (0, 1), In should be decrease monotonically in n. There-


fore,
I0 · 5n I0 · 5n
(Con fn )(I0 ) = > = 5n . (58)
In I0
∵ n → 0, 5n >> 1,

∴ (Confn )(I0 ) >> 1, fn (I0 ) is severely ill-conditioned with large n.

How can we avoid the ill-conditioning ?


From Eq.(62), by recursing, it can be find:
1 1 1
In = (− )In+1 + ( ) , (59)
5 5 n+1
from which one can find that, recursing from In+1 to In , in this
problem:

• Problem: In = (− 15 )In+1 + ( 51 ) n+1


1
, i.e., In = f (In+1 )
• Input: x = In+1
• Output: y = In

and the condition number can be given as:

xfn′ (x) In+1 (− 15 ) In+1 · 1


5 In+1
(Con f )(In+1 ) = = = = . (60)
f (x) In In 5In

13
Because t ∈ (0, 1), In should be decrease monotonically in n

In+1 In 1
(Con f )(In+1 ) = < = < 1. (61)
5In 5In 5
∴ recursing from In+1 to In is well-conditioned.
This may tell us that, by reversing the recurrence, i.e., from ν to
n, ν > n, the problem may become well-conditioned. By recursing
Eq.(59), we have:
1 1 1
In = (− )In+1 + , (62)
5 5n+1
1 1 1 1 1
(− )In+1 = (− )2 In+2 + (− )( ) , (63)
5 5 5 5 n+2
..
. (64)
1 (ν−n−1) 1 (ν−n) 1 (ν−n−1) 1 1
(− ) Iν−1 = (− ) Iν + (− ) ( ) . (65)
5 5 5 5 µ
Summation of these equations yields:
1
In = (− )(ν−n) Iν + p̂n , (66)
5
where p̂n is some number and independent of Iν .
In this problem:

• Problem: In = (− 15 )(ν−n) Iν + p̂n , i.e., In = f (Iν )


• Input: x = Iν
• Output: y = In

So the condition number can be given as:

xf ′ (x) Iν (− 15 )(ν−n) Iν · ( 15 )(ν−n)


(Con f )(Iν ) = = = . (67)
f (x) In In

Because t ∈ (0, 1), In should be decrease monotonically in n. There-


fore,

Iν · ( 15 )(ν−n) In · ( 51 )(ν−n) 1
(Con f )(Iν ) = < = ( )(ν−n) . (68)
In In 5

∵ ν >> n, ( 51 )(ν−n) << 1,

∴ (Conf )(Iν ) << 1, f (Iν ) is well-conditioned.


2. Calculate the condition number of f (x) = x, x ≥ 0.[Conte and de Boor(1980)]

Solution: In this problem:



• Problem: f (x) = x, x ≥ 0
• Input: x
• Output: f (x)

14
∴ f ′ (x) = 1

2 x
. Hence, the condition number of f is

xf ′ (x) x 2√1 x 1
Conf (x) = | |= √ = , (69)
f (x) x 2
which indicates that taking square root is a well-conditioned process.

√ √
3. Calculate the condition number of f (x) = x + 1− x, x ≥ 0.[Conte and de Boor(1980)]

Solution: In this problem:


√ √
• Problem: f (x) = x + 1 − x, x ≥ 0
• Input: x
• Output: f (x)

∴ f ′ (x) = 12 ( √x+1
1
− √1 ).
x
Hence, the condition number of f is

xf ′ (x)
Conf (x) = | |,
f (x)
x 21 ( √x+1
1
− √1x )
= | √ √ |,
x+1− x
1 x
= √ √ ,
2 x+1 x

1 x
= . (70)
2 x+1

For “large” x, such as x ≈ 104 , the condition of f becomes



1 x 1
Conf (x) = ≈ . (71)
2 x+1 2
This says that, for “large” x, function
√ f is well-conditioned. However,

we know that for “large” x, f (x) = x + 1− x has large cancellation
error, which means that even with exact x as input, there still will be
large cancellation error in the output. Therefore, though function f is
well-conditioned, the present algorithm is instable. To be reconciled,
we can change the algorithm as:
√ √
f (x) = x + 1 − x,
1
= √ √ . (72)
x+1+ x

4.2 Condition Number of a Matrix(Optional)


4.2.1 Preliminaries
1. Vector Norms(p-norms)

def

n
1
∥x∥p = ( |xi |p ) p (73)
i=1

Remark:(Important special cases)

15
• 1-norm:

n
∥x∥1 = |xi |. (74)
i=1

• 2-norm:

n
1
∥x∥2 = ( |xi |2 ) 2 (75)
i=1
• ∞-norm:
∥x∥∞ = max |xi |. (76)
1≤i≤n

properties: (where x and y are any vectors):

(a) ∥x∥ ≥ 0, if x ̸= 0;
(b) ∥γx∥ = |γ| · ∥x∥, for any scalar γ;
(c) ∥x + y∥ ≤ ∥x∥ + ∥y∥;
(d) ∥x − y∥ ≥ ∥x∥ − ∥y∥;

2. Matrix Norms
def ∥Ax∥
∥A∥ = max (77)
x̸=0 ∥x∥
Remark:(Important special cases)

• 1-norm: the maximum absolute column sum of the ma-


trix

n
∥A∥1 = max |aij |. (78)
j
i=1

• 2-norm:
∥Ax∥2 √
∥A∥2 = max = max{ λ : λ is an eigenvalue of AT A}
x̸=0 ∥x∥2
(79)
• ∞-norm: the maximum absolute row sum of the matrix

n
∥A∥∞ = max |aij |. (80)
i
j=1

properties: (where A and B are any matrices)

(a) ∥A∥ ≥ 0, if A ̸= 0;
(b) ∥γA∥ = |γ| · ∥A∥, for any scalar γ;
(c) ∥A + B∥ ≤ ∥A∥ + ∥B∥;
(d) ∥AB∥ ≤ ∥A∥ · ∥B∥;
(e) ∥Ax∥ ≤ ∥A∥ · ∥x∥, for any vector x;

16
4.2.2 Condition number of a vector function
Generally, for arbitrary m, n, namely,

x = [x1 , x2 , . . . , xm ]T ∈ Rm , y = [y1 , y2 , . . . , yn ]T ∈ Rn , (81)

and the map f is given as


y = f (x).
condition number:
def ∥x∥∞ ∥f ′ (x)∥∞
(Cond f )(x) = (82)
∥f (x)∥∞

where f ′ (x) is the Jacobian matrix and


 
∂f1 ∂f1 ∂f1
...
 ∂x1 ∂x2 ∂xm

 ∂f2 ∂f2 ∂f2 
∂f def  ∂x1 ... ∂xm 
= ∂x2
 ∈ Rn×m (83)
∂x  · · ... · 
 
∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xm

Remark: As an application of the condition number of a vector function,


the condition number of the matrix A can be given as:
def
Cond A : = ∥A∥ · |A−1 ∥ (84)

Example:

(1) Compute the condition number of matrix A


 
1.01 0.99
A= . (85)
0.99 1.01

The inverse of A is
 −1      
a b 1 d −b 1.01 −0.99 25.25 −24.75
A−1 =   =  = 1  = ,
c d ad − bc −c a 0.04 −0.99 1.01 −24.75 25.25
(86)
−1
∴ ∥A∥∞ = (1.01 + 0.99) = 2, and ∥A ∥∞ = (25.25 + 24.75) = 50. Hence,
the condition number of A is

Cond∞ A := ∥A∥∞ · |A−1 ∥∞ = 2 × 50 = 100. (87)

(2) Hilbert Matrix:


elements of n order Hilbert Matrix are given as
1
hij = , i, j = 1, 2, . . . , n,
i+j−1

17
i.e.,  
1 1
1 ...
 2 n 
1 1 1 
2 ... 
 3 n+1

· · · 
 ... 
Hn =   ∈ Rn×n (88)
· · ... · 
 
 
· · ... · 
 
1 1 1
n n+1 ... 2n−1

This matrix is symmetric and positive define (i.e., Hn = HnT , and xT Hn x >
0 for all x ∈ Rn ). The condition number of Hilbert matrices are shown in
Table 1.

Table 1: The condition number of Hilbert matrices

n Cond2 Hn
10 1.60 × 1013
20 2.45 × 1028
40 7.65 × 1058

For n = 10, the system can not be solved with any reliability in single
precision on a 14-decimal computer and, for double precision, if n = 20, it
will also “exhausted” by the time. In deed, it can be shown that

( (2) + 1)(4n+4)
Cond2 Hn ∼ √ as n → ∞. (89)
215/4 πn

(3) Vandermonde Matrix:


 
1 1 ... 1
 
 tn 
 t1 t2 ... 
 
 · · · 
 ... 
Vn =   ∈ Rn×n (90)
 · · ... · 
 
 
 · · ... · 
 
tn−1
1 tn−1
2 ... tn−1
n

where t1 , t2 , . . . , tn are parameters and assumed real. If the parameters are


equally spaced in [−1, 1], that is,

2(ν − 1)
tν = 1 − , ν = 1, 2, . . . , n. (91)
n−1
then,
1 − π n( π + 1 ln 2)
Cond∞ Vn ∼ e 4e 4 2 as n → ∞. (92)
π
With different n, the condition number of the Vandermonde matrices are
shown in Table 2. They are not growing as fast as it for the Hilbert matrix,

18
Table 2: The condition number of Vandermonde matrices

n Cond2 Hn
10 1.36 × 104
20 1.05 × 109
40 6.93 × 1018
80 3.15 × 1038

but still exponentially fast.

H.W : write a Matlab program to calculate the condition numbers of


Hilbert Matrix and Vandermonde Matrix.

Note:
1. Hilbert Matrix is very useful for the least-square problem in Chapter
2.
2. Vandermonde Matrix is very important to the interpolation problem
in Chapter 2.

5 Review questions
1. Review the floating-point representation in a binary computer:
a) write down the definition of the R(t, s) system;
b) write down the definition of normalization and give some reason
why we need do normalization;
c) Derive the largest and smallest magnitudes of a (normalized)
floating-point number in the R(t, s) system;
d) What are the meaning of the overflow and underflow ? In floating-
point arithmetic, which is generally more harmful, under-flow or
over-flow? Why?
2. Review the concept of rounding:
a) Explain the difference between the rounding rules “chopping(round
toward zero)” and “Symmetric rounding(round to nearest)” in
a floating-point system.
b) Which of these two rounding rules is more accurate?
c) write down the definition of the machine precision and explain
why the machine precision is machine dependent.
3. Review the definition of the model of machine arithmetic and error
propagation:
a) True or false(why ? Give an example): If two real numbers are
exactly representable as floating-point numbers, then the result
of an arithmetic operation O(+, −, ×, /) on them will also be
representable as a floating point number.

19
b)True or false(Give examples): Floating-point addition is associa-
tive but not commutative.
c)In floating-point arithmetic, which of the following operations on
two positive floating point operands can be benign operations:
• Multiplication
• Division
• Addition
• Subtraction
d)Explain why the cancellation that occurs when two numbers of
similar magnitude are subtracted is often bad even though the
result may be exactly correct for the actual operands involved.
(Hint: Lost some significant digits and information)
e) True or false(why ? Give an example):Computing a small quan-
tity as a difference of large quantities is generally a bad idea.

4. Review the concept of condition of a problem

a)what are definitions of the well conditioned and the ill conditioned
problems ?
b)True or false: A problem is ill conditioned if its solution is highly
sensitive to small changes in the problem data.
b) True or false: Using higher-precision arithmetic will make an
ill-conditioned problem better conditioned.
c)True or false: The conditioning of a problem depends on the al-
gorithm used to solve it.
d)True or false: A good algorithm will produce an accurate solution
regardless of the condition of the problem being solved.

References
[Gautschi(1997)] W. Gautschi, Numerical analysis: An intrtoduction, 1st
ed., Brikhauser Boston, Berlin, 1997.

[Conte and de Boor(1980)] S. D. Conte, C. de Boor, Elementary numer-


ical analysis: An algorithmic approach, 3rd ed., McGraw-Hill Book
Company, New York, 1980.

[Stewart(1996)] G. W. Stewart, Afternotes on numerical analysis, SIAM,


1996.

[Heath(1997)] M. T. Heath, Scientific computing: An introduction survey,


1rd ed., McGraw-Hill Book Company, New York, 1997.

[Dahlquist and Bjorck(2003)] G. Dahlquist, A. Bjorck, Numerical meth-


ods, 1st ed., Dover publication, INC, New York, 2003.

[Vandenberghe and Boyd(2011)] L. Vandenberghe, S. Boyd, Applied Nu-


merical Computing, UCLA Fall Quarter 2011-12(Notes) (2011).

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy