0% found this document useful (0 votes)
96 views

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1.

2 Round-off Errors and Computer Arithmetic 17

gives the probability that any one of a series of trials will lie within x units√
of the mean, assuming that
the trials have a normal distribution with mean 0 and standard deviation 2/2. This integral cannot
be evaluated in terms of elementary functions, so an approximating technique must be used.
a. Integrate the Maclaurin series for e−x to show that
2

2  (−1)k x 2k+1

erf(x) = √ .
π k=0 (2k + 1)k!

b. The error function can also be expressed in the form

2  ∞
2k x 2k+1
erf(x) = √ e−x
2
.
π k=0
1 · 3 · 5 · · · (2k + 1)

Verify that the two series agree for k = 1, 2, 3, and 4. [Hint: Use the Maclaurin series for e−x .]
2

c. Use the series in part (a) to approximate erf(1) to within 10−7 .


d. Use the same number of terms as in part (c) to approximate erf(1) with the series in part (b).
e. Explain why difficulties occur using the series in part (b) to approximate erf(x).
27. A function f : [a, b] → R is said to satisfy a Lipschitz condition with Lipschitz constant L on [a, b]
if, for every x, y ∈ [a, b], we have |f (x) − f (y)| ≤ L|x − y|.
a. Show that if f satisfies a Lipschitz condition with Lipschitz constant L on an interval [a, b], then
f ∈ C[a, b].
b. Show that if f has a derivative that is bounded on [a, b] by L, then f satisfies a Lipschitz condition
with Lipschitz constant L on [a, b].
c. Give an example of a function that is continuous on a closed interval but does not satisfy a
Lipschitz condition on the interval.
28. Suppose f ∈ C[a, b], that x1 and x2 are in [a, b].
a. Show that a number ξ exists between x1 and x2 with
f (x1 ) + f (x2 ) 1 1
f (ξ ) = = f (x1 ) + f (x2 ).
2 2 2
b. Suppose that c1 and c2 are positive constants. Show that a number ξ exists between x1 and x2
with
c1 f (x1 ) + c2 f (x2 )
f (ξ ) = .
c1 + c2
c. Give an example to show that the result in part b. does not necessarily hold when c1 and c2 have
opposite signs with c1 = −c2 .
29. Let f ∈ C[a, b], and let p be in the open interval (a, b).
a. Suppose f (p) = 0. Show that a δ > 0 exists with f (x) = 0, for all x in [p − δ, p + δ], with
[p − δ, p + δ] a subset of [a, b].
b. Suppose f (p) = 0 and k > 0 is given. Show that a δ > 0 exists with |f (x)| ≤ k, for all x in
[p − δ, p + δ], with [p − δ, p + δ] a subset of [a, b].

1.2 Round-off Errors and Computer Arithmetic


The arithmetic performed by a calculator or computer is different from the arithmetic in
algebra and calculus courses. You would likely
√ expect that we always have as true statements
things such as 2 +2 = 4, 4 ·8 = 32, and ( 3)2 = 3. However, with computer arithmetic
√ we
expect exact results for 2 + 2 = 4 and 4 · 8 = 32, but we will not have precisely ( 3)2 = 3.
To understand why this is true we must explore the world of finite-digit arithmetic.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
18 CHAPTER 1 Mathematical Preliminaries and Error Analysis

In our traditional mathematical world we permit √ numbers with an infinite number of


digits. The arithmetic we use in this world defines 3 as that unique positive number that
when multiplied by itself produces the integer 3. In the computational world, however, each
representable number has only a fixed and finite number of digits. This means, for example,
that
√ only rational numbers—and not even all of these—can be represented exactly. Since
3 is not rational, it is given an approximate representation, one whose square will not
be precisely 3, although it will likely be sufficiently close to 3 to be acceptable in most
situations. In most cases, then, this machine arithmetic is satisfactory and passes without
notice or concern, but at times problems arise because of this discrepancy.
Error due to rounding should be The error that is produced when a calculator or computer is used to perform real-
expected whenever computations number calculations is called round-off error. It occurs because the arithmetic per-
are performed using numbers that formed in a machine involves numbers with only a finite number of digits, with the re-
are not powers of 2. Keeping this sult that calculations are performed with only approximate representations of the actual
error under control is extremely numbers. In a computer, only a relatively small subset of the real number system is used
important when the number of
for the representation of all the real numbers. This subset contains only rational numbers,
calculations is large.
both positive and negative, and stores the fractional part, together with an exponential
part.

Binary Machine Numbers


In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called
Binary Floating Point Arithmetic Standard 754–1985. An updated version was published
in 2008 as IEEE 754-2008. This provides standards for binary and decimal floating point
numbers, formats for data interchange, algorithms for rounding arithmetic operations, and
for the handling of exceptions. Formats are specified for single, double, and extended
precisions, and these standards are generally followed by all microcomputer manufacturers
using floating-point hardware.
A 64-bit (binary digit) representation is used for a real number. The first bit is a sign
indicator, denoted s. This is followed by an 11-bit exponent, c, called the characteristic,
and a 52-bit binary fraction, f , called the mantissa. The base for the exponent is 2.
Since 52 binary digits correspond to between 16 and 17 decimal digits, we can assume
that a number represented in this system has at least 16 decimal digits of precision. The
exponent of 11 binary digits gives a range of 0 to 211 −1 = 2047. However, using only posi-
tive integers for the exponent would not permit an adequate representation of numbers with
small magnitude. To ensure that numbers with small magnitude are equally representable,
1023 is subtracted from the characteristic, so the range of the exponent is actually from
−1023 to 1024.
To save storage and provide a unique representation for each floating-point number, a
normalization is imposed. Using this system gives a floating-point number of the form

(−1)s 2c−1023 (1 + f ).

Illustration Consider the machine number

0 10000000011 1011100100010000000000000000000000000000000000000000.

The leftmost bit is s = 0, which indicates that the number is positive. The next 11 bits,
10000000011, give the characteristic and are equivalent to the decimal number

c = 1 · 210 + 0 · 29 + · · · + 0 · 22 + 1 · 21 + 1 · 20 = 1024 + 2 + 1 = 1027.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Round-off Errors and Computer Arithmetic 19

The exponential part of the number is, therefore, 21027−1023 = 24 . The final 52 bits specify
that the mantissa is
 1  3  4  5  8  12
1 1 1 1 1 1
f =1· +1· +1· +1· +1· +1· .
2 2 2 2 2 2

As a consequence, this machine number precisely represents the decimal number


  
1 1 1 1 1 1
(−1) 2 s c−1023
(1 + f ) = (−1) · 2 0 1027−1023
1+ + + + + +
2 8 16 32 256 4096
= 27.56640625.

However, the next smallest machine number is

0 10000000011 1011100100001111111111111111111111111111111111111111,

and the next largest machine number is

0 10000000011 1011100100010000000000000000000000000000000000000001.

This means that our original machine number represents not only 27.56640625, but also half
of the real numbers that are between 27.56640625 and the next smallest machine number,
as well as half the numbers between 27.56640625 and the next largest machine number. To
be precise, it represents any real number in the interval

[27.5664062499999982236431605997495353221893310546875,
27.5664062500000017763568394002504646778106689453125). 

The smallest normalized positive number that can be represented has s = 0, c = 1,


and f = 0 and is equivalent to

2−1022 · (1 + 0) ≈ 0.22251 × 10−307 ,

and the largest has s = 0, c = 2046, and f = 1 − 2−52 and is equivalent to

21023 · (2 − 2−52 ) ≈ 0.17977 × 10309 .

Numbers occurring in calculations that have a magnitude less than

2−1022 · (1 + 0)

result in underflow and are generally set to zero. Numbers greater than

21023 · (2 − 2−52 )

result in overflow and typically cause the computations to stop (unless the program has
been designed to detect this occurrence). Note that there are two representations for the
number zero; a positive 0 when s = 0, c = 0 and f = 0, and a negative 0 when s = 1,
c = 0 and f = 0.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
20 CHAPTER 1 Mathematical Preliminaries and Error Analysis

Decimal Machine Numbers


The use of binary digits tends to conceal the computational difficulties that occur when a
finite collection of machine numbers is used to represent all the real numbers. To examine
these problems, we will use more familiar decimal numbers instead of binary representation.
Specifically, we assume that machine numbers are represented in the normalized decimal
floating-point form
±0.d1 d2 . . . dk × 10n , 1 ≤ d1 ≤ 9, and 0 ≤ di ≤ 9,
for each i = 2, . . . , k. Numbers of this form are called k-digit decimal machine numbers.
Any positive real number within the numerical range of the machine can be normalized
to the form
y = 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n .
The error that results from The floating-point form of y, denoted f l(y), is obtained by terminating the mantissa of
replacing a number with its y at k decimal digits. There are two common ways of performing this termination. One
floating-point form is called method, called chopping, is to simply chop off the digits dk+1 dk+2 . . . . This produces the
round-off error regardless of floating-point form
whether the rounding or
chopping method is used. f l(y) = 0.d1 d2 . . . dk × 10n .
The other method, called rounding, adds 5 × 10n−(k+1) to y and then chops the result to
obtain a number of the form
f l(y) = 0.δ1 δ2 . . . δk × 10n .
For rounding, when dk+1 ≥ 5, we add 1 to dk to obtain f l(y); that is, we round up. When
dk+1 < 5, we simply chop off all but the first k digits; so we round down. If we round down,
then δi = di , for each i = 1, 2, . . . , k. However, if we round up, the digits (and even the
exponent) might change.

Example 1 Determine the five-digit (a) chopping and (b) rounding values of the irrational number π .
Solution The number π has an infinite decimal expansion of the form π = 3.14159265. . . .
Written in normalized decimal form, we have
π = 0.314159265 . . . × 101 .
The relative error is generally a (a) The floating-point form of π using five-digit chopping is
better measure of accuracy than
the absolute error because it takes f l(π ) = 0.31415 × 101 = 3.1415.
into consideration the size of the
number being approximated. (b) The sixth digit of the decimal expansion of π is a 9, so the floating-point form of
π using five-digit rounding is
f l(π ) = (0.31415 + 0.00001) × 101 = 3.1416.

The following definition describes two methods for measuring approximation errors.

Definition 1.15 Suppose that p∗ is an approximation to p. The absolute error is |p − p∗ |, and the relative
|p − p∗ |
error is , provided that p = 0.
|p|

Consider the absolute and relative errors in representing p by p∗ in the following


example.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Round-off Errors and Computer Arithmetic 21

Example 2 Determine the absolute and relative errors when approximating p by p∗ when
(a) p = 0.3000 × 101 and p∗ = 0.3100 × 101 ;
(b) p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3 ;
(c) p = 0.3000 × 104 and p∗ = 0.3100 × 104 .

Solution

(a) For p = 0.3000 × 101 and p∗ = 0.3100 × 101 the absolute error is 0.1, and the
relative error is 0.3333 × 10−1 .
We often cannot find an accurate
(b) For p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3 the absolute error is 0.1 × 10−4 ,
value for the true error in an
and the relative error is 0.3333 × 10−1 .
approximation. Instead we find a
bound for the error, which gives (c) For p = 0.3000 × 104 and p∗ = 0.3100 × 104 , the absolute error is 0.1 × 103 , and
us a “worst-case” error. the relative error is again 0.3333 × 10−1 .

This example shows that the same relative error, 0.3333 × 10−1 , occurs for widely varying
absolute errors. As a measure of accuracy, the absolute error can be misleading and the
relative error more meaningful, because the relative error takes into consideration the size
of the value.

The following definition uses relative error to give a measure of significant digits of
accuracy for an approximation.

Definition 1.16 The number p∗ is said to approximate p to t significant digits (or figures) if t is the largest
nonnegative integer for which
The term significant digits is
often used to loosely describe the
|p − p∗ |
number of decimal digits that ≤ 5 × 10−t .
appear to be accurate. The |p|
definition is more precise, and
provides a continuous concept. Table 1.1 illustrates the continuous nature of significant digits by listing, for the various
values of p, the least upper bound of |p − p∗ |, denoted max |p − p∗ |, when p∗ agrees with p
to four significant digits.

Table 1.1
p 0.1 0.5 100 1000 5000 9990 10000

max |p − p∗ | 0.00005 0.00025 0.05 0.5 2.5 4.995 5.

Returning to the machine representation of numbers, we see that the floating-point


representation f l(y) for the number y has the relative error
 
 y − f l(y) 
 .
 y 

If k decimal digits and chopping are used for the machine representation of

y = 0.d1 d2 . . . dk dk+1 . . . × 10n ,

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
22 CHAPTER 1 Mathematical Preliminaries and Error Analysis

then
   
 y − f l(y)   0.d1 d2 . . . dk dk+1 . . . × 10n − 0.d1 d2 . . . dk × 10n 
 = 
 y   0.d1 d2 . . . × 10n 
   
 0.dk+1 dk+2 . . . × 10n−k   0.dk+1 dk+2 . . . 
=  = 
  0.d d . . .  × 10 .
−k
0.d1 d2 . . . × 10n 1 2

Since d1 = 0, the minimal value of the denominator is 0.1. The numerator is bounded above
by 1. As a consequence,
 
 y − f l(y) 
  ≤ 1 × 10−k = 10−k+1 .
 y  0.1

In a similar manner, a bound for the relative error when using k-digit rounding arithmetic
is 0.5 × 10−k+1 . (See Exercise 24.)
Note that the bounds for the relative error using k-digit arithmetic are independent of the
number being represented. This result is due to the manner in which the machine numbers
are distributed along the real line. Because of the exponential form of the characteristic,
the same number of decimal machine numbers is used to represent each of the intervals
[0.1, 1], [1, 10], and [10, 100]. In fact, within the limits of the machine, the number of
decimal machine numbers in [10n , 10n+1 ] is constant for all integers n.

Finite-Digit Arithmetic
In addition to inaccurate representation of numbers, the arithmetic performed in a computer
is not exact. The arithmetic involves manipulating binary digits by various shifting, or
logical, operations. Since the actual mechanics of these operations are not pertinent to this
presentation, we shall devise our own approximation to computer arithmetic. Although our
arithmetic will not give the exact picture, it suffices to explain the problems that occur. (For
an explanation of the manipulations actually involved, the reader is urged to consult more
technically oriented computer science texts, such as [Ma], Computer System Architecture.)
Assume that the floating-point representations f l(x) and f l(y) are given for the real
numbers x and y and that the symbols ⊕, , ⊗,  .. represent machine addition, subtraction,
multiplication, and division operations, respectively. We will assume a finite-digit arithmetic
given by

x ⊕ y = f l(f l(x) + f l(y)), x ⊗ y = f l(f l(x) × f l(y)),


.. y = f l(f l(x) ÷ f l(y)).
x  y = f l(f l(x) − f l(y)), x 

This arithmetic corresponds to performing exact arithmetic on the floating-point repre-


sentations of x and y and then converting the exact result to its finite-digit floating-point
representation.
Rounding arithmetic is easily implemented in Maple. For example, the command
Digits := 5
causes all arithmetic to be rounded to 5 digits. To ensure that Maple uses√approximate rather
than exact arithmetic we use the evalf. For example, if x = π and y = 2 then
evalf (x); evalf (y)
produces 3.1416 and 1.4142, respectively. Then f l(f l(x) + f l(y)) is performed using
5-digit rounding arithmetic with
evalf (evalf (x) + evalf (y))

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Round-off Errors and Computer Arithmetic 23

which gives 4.5558. Implementing finite-digit chopping arithmetic is more difficult and
requires a sequence of steps or a procedure. Exercise 27 explores this problem.

Example 3 Suppose that x = 5


7
and y = 13 . Use five-digit chopping for calculating x + y, x − y, x × y,
and x ÷ y.
Solution Note that
5 1
x= = 0.714285 and y= = 0.3
7 3
implies that the five-digit chopping values of x and y are

f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 .

Thus


x ⊕ y = f l(f l(x) + f l(y)) = f l 0.71428 × 100 + 0.33333 × 100


= f l 1.04761 × 100 = 0.10476 × 101 .

The true value is x + y = = 2221


5
7
+ 1
, so we have
3
 
 22 
 1
Absolute Error =  − 0.10476 × 10  = 0.190 × 10−4
21
and
 
 0.190 × 10−4 
Relative Error =   = 0.182 × 10−4 .

22/21
Table 1.2 lists the values of this and the other calculations.

Table 1.2
Operation Result Actual value Absolute error Relative error

x⊕y 0.10476 × 101 22/21 0.190 × 10−4 0.182 × 10−4


xy 0.38095 × 100 8/21 0.238 × 10−5 0.625 × 10−5
x⊗y 0.23809 × 100 5/21 0.524 × 10−5 0.220 × 10−4
.. y
x 0.21428 × 101 15/7 0.571 × 10−4 0.267 × 10−4

The maximum relative error for the operations in Example 3 is 0.267 × 10−4 , so the
arithmetic produces satisfactory five-digit results. This is not the case in the following
example.

Example 4 Suppose that in addition to x = 5


7
and y = 1
3
we have

u = 0.714251, v = 98765.9, and w = 0.111111 × 10−4 ,

so that

f l(u) = 0.71425 × 100 , f l(v) = 0.98765 × 105 , and f l(w) = 0.11111 × 10−4 .
.. w, (x  u) ⊗ v, and u ⊕ v.
Determine the five-digit chopping values of x  u, (x  u) 

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
24 CHAPTER 1 Mathematical Preliminaries and Error Analysis

Solution These numbers were chosen to illustrate some problems that can arise with finite-
digit arithmetic. Because x and u are nearly the same, their difference is small. The absolute
error for x  u is

|(x − u) − (x  u)| = |(x − u) − (f l(f l(x) − f l(u)))|


  
 5


= − 0.714251 − f l 0.71428 × 10 − 0.71425 × 10 
0 0 
7


= 0.347143 × 10−4 − f l 0.00003 × 100  = 0.47143 × 10−5 .

This approximation has a small absolute error, but a large relative error
 
 0.47143 × 10−5 
 
 0.347143 × 10−4  ≤ 0.136.

The subsequent division by the small number w or multiplication by the large number v
magnifies the absolute error without modifying the relative error. The addition of the large
and small numbers u and v produces large absolute error but not large relative error. These
calculations are shown in Table 1.3.

Table 1.3
Operation Result Actual value Absolute error Relative error

xu 0.30000 × 10−4 0.34714 × 10−4 0.471 × 10−5 0.136


.. w
(x  u)  0.27000 × 101 0.31242 × 101 0.424 0.136
(x  u) ⊗ v 0.29629 × 101 0.34285 × 101 0.465 0.136
u⊕v 0.98765 × 105 0.98766 × 105 0.161 × 101 0.163 × 10−4

One of the most common error-producing calculations involves the cancelation of


significant digits due to the subtraction of nearly equal numbers. Suppose two nearly equal
numbers x and y, with x > y, have the k-digit representations

f l(x) = 0.d1 d2 . . . dp αp+1 αp+2 . . . αk × 10n ,

and

f l(y) = 0.d1 d2 . . . dp βp+1 βp+2 . . . βk × 10n .

The floating-point form of x − y is

f l(f l(x) − f l(y)) = 0.σp+1 σp+2 . . . σk × 10n−p ,

where

0.σp+1 σp+2 . . . σk = 0.αp+1 αp+2 . . . αk − 0.βp+1 βp+2 . . . βk .

The floating-point number used to represent x − y has at most k − p digits of significance.


However, in most calculation devices, x − y will be assigned k digits, with the last p being
either zero or randomly assigned. Any further calculations involving x−y retain the problem
of having only k − p digits of significance, since a chain of calculations is no more accurate
than its weakest portion.
If a finite-digit representation or calculation introduces an error, further enlargement of
the error occurs when dividing by a number with small magnitude (or, equivalently, when

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Round-off Errors and Computer Arithmetic 25

multiplying by a number with large magnitude). Suppose, for example, that the number z
has the finite-digit approximation z + δ, where the error δ is introduced by representation
or by previous calculation. Now divide by ε = 10−n , where n > 0. Then
 
z f l(z)
≈ fl = (z + δ) × 10n .
ε f l(ε)
The absolute error in this approximation, |δ| × 10n , is the original absolute error, |δ|, mul-
tiplied by the factor 10n .

Example 5 Let p = 0.54617 and q = 0.54601. Use four-digit arithmetic to approximate p − q and
determine the absolute and relative errors using (a) rounding and (b) chopping.
Solution The exact value of r = p − q is r = 0.00016.

(a) Suppose the subtraction is performed using four-digit rounding arithmetic. Round-
ing p and q to four digits gives p∗ = 0.5462 and q∗ = 0.5460, respectively, and
r ∗ = p∗ − q∗ = 0.0002 is the four-digit approximation to r. Since
|r − r ∗ | |0.00016 − 0.0002|
= = 0.25,
|r| |0.00016|
the result has only one significant digit, whereas p∗ and q∗ were accurate to four
and five significant digits, respectively.
(b) If chopping is used to obtain the four digits, the four-digit approximations to p, q,
and r are p∗ = 0.5461, q∗ = 0.5460, and r ∗ = p∗ − q∗ = 0.0001. This gives
|r − r ∗ | |0.00016 − 0.0001|
= = 0.375,
|r| |0.00016|
which also results in only one significant digit of accuracy.

The loss of accuracy due to round-off error can often be avoided by a reformulation of
the calculations, as illustrated in the next example.

Illustration The quadratic formula states that the roots of ax 2 + bx + c = 0, when a = 0, are
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = and x2 = . (1.1)
2a 2a
Consider this formula applied to the equation x 2 + 62.10x + 1 = 0, whose roots are
approximately

x1 = −0.01610723 and x2 = −62.08390.


The roots x1 and x2 of a general We will again use four-digit rounding arithmetic in the calculations to determine the root. In
quadratic equation are related to
this equation, b2 is much larger than 4ac, so the numerator in the calculation for x1 involves
the coefficients by the fact that
the subtraction of nearly equal numbers. Because
b
x 1 + x2 = −
a

b2 − 4ac = (62.10)2 − (4.000)(1.000)(1.000)
and √ √
c
x 1 x2 = . = 3856. − 4.000 = 3852. = 62.06,
a
This is a special case of Vièta’s we have
Formulas for the coefficients of
−62.10 + 62.06 −0.04000
polynomials. f l(x1 ) = = = −0.02000,
2.000 2.000

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
26 CHAPTER 1 Mathematical Preliminaries and Error Analysis

a poor approximation to x1 = −0.01611, with the large relative error


| − 0.01611 + 0.02000|
≈ 2.4 × 10−1 .
| − 0.01611|
On the other
√ hand, the calculation for x2 involves the addition of the nearly equal numbers
−b and − b2 − 4ac. This presents no problem since
−62.10 − 62.06 −124.2
f l(x2 ) = = = −62.10
2.000 2.000
has the small relative error
| − 62.08 + 62.10|
≈ 3.2 × 10−4 .
| − 62.08|
To obtain a more accurate four-digit rounding approximation for x1 , we change the form of
the quadratic formula by rationalizing the numerator:
√ √ 
−b + b2 − 4ac −b − b2 − 4ac b2 − (b2 − 4ac)
x1 = √ = √ ,
2a −b − b2 − 4ac 2a(−b − b2 − 4ac)
which simplifies to an alternate quadratic formula
−2c
x1 = √ . (1.2)
b+ b2 − 4ac
Using (1.2) gives
−2.000 −2.000
f l(x1 ) = = = −0.01610,
62.10 + 62.06 124.2
which has the small relative error 6.2 × 10−4 .

The rationalization technique can also be applied to give the following alternative quadratic
formula for x2 :
−2c
x2 = √ . (1.3)
b − b2 − 4ac
This is the form to use if b is a negative number. In the Illustration, however, the mistaken use
of this formula for x2 would result in not only the subtraction of nearly equal numbers, but
also the division by the small result of this subtraction. The inaccuracy that this combination
produces,
−2c −2.000 −2.000
f l(x2 ) = √ = = = −50.00,
b − b2 − 4ac 62.10 − 62.06 0.04000
has the large relative error 1.9 × 10−1 . 

• The lesson: Think before you compute!

Nested Arithmetic
Accuracy loss due to round-off error can also be reduced by rearranging calculations, as
shown in the next example.

Example 6 Evaluate f (x) = x 3 − 6.1x 2 + 3.2x + 1.5 at x = 4.71 using three-digit arithmetic.
Solution Table 1.4 gives the intermediate results in the calculations.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Round-off Errors and Computer Arithmetic 27

Table 1.4
x x2 x3 6.1x 2 3.2x

Exact 4.71 22.1841 104.487111 135.32301 15.072


Three-digit (chopping) 4.71 22.1 104. 134. 15.0
Three-digit (rounding) 4.71 22.2 105. 135. 15.1

To illustrate the calculations, let us look at those involved with finding x 3 using three-
digit rounding arithmetic. First we find

x 2 = 4.712 = 22.1841 which rounds to 22.2.

Then we use this value of x 2 to find

x 3 = x 2 · x = 22.2 · 4.71 = 104.562 which rounds to 105.

Also,

6.1x 2 = 6.1(22.2) = 135.42 which rounds to 135,

and

3.2x = 3.2(4.71) = 15.072 which rounds to 15.1.

The exact result of the evaluation is

Exact: f (4.71) = 104.487111 − 135.32301 + 15.072 + 1.5 = −14.263899.

Using finite-digit arithmetic, the way in which we add the results can effect the final result.
Suppose that we add left to right. Then for chopping arithmetic we have

Three-digit (chopping): f (4.71) = ((104. − 134.) + 15.0) + 1.5 = −13.5,

and for rounding arithmetic we have

Three-digit (rounding): f (4.71) = ((105. − 135.) + 15.1) + 1.5 = −13.4.

(You should carefully verify these results to be sure that your notion of finite-digit arithmetic
is correct.) Note that the three-digit chopping values simply retain the leading three digits,
with no rounding involved, and differ significantly from the three-digit rounding values.
The relative errors for the three-digit methods are
   
 −14.263899 + 13.5   −14.263899 + 13.4 
    ≈ 0.06.
Chopping: 
−14.263899  ≈ 0.05, and Rounding:  −14.263899 

Illustration As an alternative approach, the polynomial f (x) in Example 6 can be written in a nested
manner as
Remember that chopping (or
rounding) is performed after each
f (x) = x 3 − 6.1x 2 + 3.2x + 1.5 = ((x − 6.1)x + 3.2)x + 1.5.
calculation.
Using three-digit chopping arithmetic now produces

f (4.71) = ((4.71 − 6.1)4.71 + 3.2)4.71 + 1.5 = ((−1.39)(4.71) + 3.2)4.71 + 1.5


= (−6.54 + 3.2)4.71 + 1.5 = (−3.34)4.71 + 1.5 = −15.7 + 1.5 = −14.2.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
28 CHAPTER 1 Mathematical Preliminaries and Error Analysis

In a similar manner, we now obtain a three-digit rounding answer of −14.3. The new relative
errors are
 
 −14.263899 + 14.2 
Three-digit (chopping):   ≈ 0.0045;

−14.263899
 
 −14.263899 + 14.3 
Three-digit (rounding):   ≈ 0.0025.

−14.263899
Nesting has reduced the relative error for the chopping approximation to less than 10%
of that obtained initially. For the rounding approximation the improvement has been even
more dramatic; the error in this case has been reduced by more than 95%. 

Polynomials should always be expressed in nested form before performing an evalu-


ation, because this form minimizes the number of arithmetic calculations. The decreased
error in the Illustration is due to the reduction in computations from four multiplications
and three additions to two multiplications and three additions. One way to reduce round-off
error is to reduce the number of computations.

E X E R C I S E S E T 1.2
1. Compute the absolute error and relative error in approximations of p by p∗ .
a. p = π, p∗ = 22/7 b. p = π, ∗
√ p = 3.1416
c. p = e, p∗ = 2.718 d. p = 2, p∗ = 1.414
e. p = e10 , p∗ = 22000 f. p = 10π , p∗ =√1400

g. p = 8!, p = 39900 h. p = 9!, p∗ = 18π(9/e)9
2. Find the largest interval in which p must lie to approximate p with relative error at most 10−4 for

each value of p.
a. π √ b. √ e
3
c. 2 d. 7
3. Suppose p must approximate p with relative error at most 10−3 . Find the largest interval in which

p∗ must lie for each value of p.


a. 150 b. 900
c. 1500 d. 90
4. Perform the following computations (i) exactly, (ii) using three-digit chopping arithmetic, and (iii)
using three-digit rounding arithmetic. (iv) Compute the relative errors in parts (ii) and (iii).
4 1 4 1
a. + b. ·
5 3 5 3
   
1 3 3 1 3 3
c. − + d. + −
3 11 20 3 11 20
5. Use three-digit rounding arithmetic to perform the following calculations. Compute the absolute error
and relative error with the exact value determined to at least five digits.
a. 133 + 0.921 b. 133 − 0.499
c. (121 − 0.327) − 119 d. (121 − 119) − 0.327
13
− 67 3
e. 14 f. −10π + 6e −
2e − 5.4 62
   
2 9 π − 227
g. · h.
9 7 1
17
6. Repeat Exercise 5 using four-digit rounding arithmetic.
7. Repeat Exercise 5 using three-digit chopping arithmetic.
8. Repeat Exercise 5 using four-digit chopping arithmetic.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy