0% found this document useful (0 votes)

96 views

Lecture 3

Uploaded by

Irakli Marshanishvili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views

Lecture 3

Uploaded by

Irakli Marshanishvili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1.

2 Round-off Errors and Computer Arithmetic 17

gives the probability that any one of a series of trials will lie within x units√
of the mean, assuming that
the trials have a normal distribution with mean 0 and standard deviation 2/2. This integral cannot
be evaluated in terms of elementary functions, so an approximating technique must be used.
a. Integrate the Maclaurin series for e−x to show that
2

2 (−1)k x 2k+1
∞
erf(x) = √ .
π k=0 (2k + 1)k!

b. The error function can also be expressed in the form

2 ∞
2k x 2k+1
erf(x) = √ e−x
2
.
π k=0
1 · 3 · 5 · · · (2k + 1)

Verify that the two series agree for k = 1, 2, 3, and 4. [Hint: Use the Maclaurin series for e−x .]
2

c. Use the series in part (a) to approximate erf(1) to within 10−7 .

d. Use the same number of terms as in part (c) to approximate erf(1) with the series in part (b).
e. Explain why difficulties occur using the series in part (b) to approximate erf(x).
27. A function f : [a, b] → R is said to satisfy a Lipschitz condition with Lipschitz constant L on [a, b]
if, for every x, y ∈ [a, b], we have |f (x) − f (y)| ≤ L|x − y|.
a. Show that if f satisfies a Lipschitz condition with Lipschitz constant L on an interval [a, b], then
f ∈ C[a, b].
b. Show that if f has a derivative that is bounded on [a, b] by L, then f satisfies a Lipschitz condition
with Lipschitz constant L on [a, b].
c. Give an example of a function that is continuous on a closed interval but does not satisfy a
Lipschitz condition on the interval.
28. Suppose f ∈ C[a, b], that x1 and x2 are in [a, b].
a. Show that a number ξ exists between x1 and x2 with
f (x1 ) + f (x2 ) 1 1
f (ξ ) = = f (x1 ) + f (x2 ).
2 2 2
b. Suppose that c1 and c2 are positive constants. Show that a number ξ exists between x1 and x2
with
c1 f (x1 ) + c2 f (x2 )
f (ξ ) = .
c1 + c2
c. Give an example to show that the result in part b. does not necessarily hold when c1 and c2 have
opposite signs with c1 = −c2 .
29. Let f ∈ C[a, b], and let p be in the open interval (a, b).
a. Suppose f (p) = 0. Show that a δ > 0 exists with f (x) = 0, for all x in [p − δ, p + δ], with
[p − δ, p + δ] a subset of [a, b].
b. Suppose f (p) = 0 and k > 0 is given. Show that a δ > 0 exists with |f (x)| ≤ k, for all x in
[p − δ, p + δ], with [p − δ, p + δ] a subset of [a, b].

1.2 Round-off Errors and Computer Arithmetic

The arithmetic performed by a calculator or computer is different from the arithmetic in
algebra and calculus courses. You would likely
√ expect that we always have as true statements
things such as 2 +2 = 4, 4 ·8 = 32, and ( 3)2 = 3. However, with computer arithmetic
√ we
expect exact results for 2 + 2 = 4 and 4 · 8 = 32, but we will not have precisely ( 3)2 = 3.
To understand why this is true we must explore the world of ﬁnite-digit arithmetic.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
18 CHAPTER 1 Mathematical Preliminaries and Error Analysis

In our traditional mathematical world we permit √ numbers with an inﬁnite number of

digits. The arithmetic we use in this world defines 3 as that unique positive number that
when multiplied by itself produces the integer 3. In the computational world, however, each
representable number has only a fixed and finite number of digits. This means, for example,
that
√ only rational numbers—and not even all of these—can be represented exactly. Since
3 is not rational, it is given an approximate representation, one whose square will not
be precisely 3, although it will likely be sufficiently close to 3 to be acceptable in most
situations. In most cases, then, this machine arithmetic is satisfactory and passes without
notice or concern, but at times problems arise because of this discrepancy.
Error due to rounding should be The error that is produced when a calculator or computer is used to perform real-
expected whenever computations number calculations is called round-off error. It occurs because the arithmetic per-
are performed using numbers that formed in a machine involves numbers with only a finite number of digits, with the re-
are not powers of 2. Keeping this sult that calculations are performed with only approximate representations of the actual
error under control is extremely numbers. In a computer, only a relatively small subset of the real number system is used
important when the number of
for the representation of all the real numbers. This subset contains only rational numbers,
calculations is large.
both positive and negative, and stores the fractional part, together with an exponential
part.

Binary Machine Numbers

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called
Binary Floating Point Arithmetic Standard 754–1985. An updated version was published
in 2008 as IEEE 754-2008. This provides standards for binary and decimal floating point
numbers, formats for data interchange, algorithms for rounding arithmetic operations, and
for the handling of exceptions. Formats are specified for single, double, and extended
precisions, and these standards are generally followed by all microcomputer manufacturers
using floating-point hardware.
A 64-bit (binary digit) representation is used for a real number. The first bit is a sign
indicator, denoted s. This is followed by an 11-bit exponent, c, called the characteristic,
and a 52-bit binary fraction, f , called the mantissa. The base for the exponent is 2.
Since 52 binary digits correspond to between 16 and 17 decimal digits, we can assume
that a number represented in this system has at least 16 decimal digits of precision. The
exponent of 11 binary digits gives a range of 0 to 211 −1 = 2047. However, using only posi-
tive integers for the exponent would not permit an adequate representation of numbers with
small magnitude. To ensure that numbers with small magnitude are equally representable,
1023 is subtracted from the characteristic, so the range of the exponent is actually from
−1023 to 1024.
To save storage and provide a unique representation for each floating-point number, a
normalization is imposed. Using this system gives a floating-point number of the form

(−1)s 2c−1023 (1 + f ).

Illustration Consider the machine number

0 10000000011 1011100100010000000000000000000000000000000000000000.

The leftmost bit is s = 0, which indicates that the number is positive. The next 11 bits,
10000000011, give the characteristic and are equivalent to the decimal number

c = 1 · 210 + 0 · 29 + · · · + 0 · 22 + 1 · 21 + 1 · 20 = 1024 + 2 + 1 = 1027.

The exponential part of the number is, therefore, 21027−1023 = 24 . The ﬁnal 52 bits specify
that the mantissa is
1 3 4 5 8 12
1 1 1 1 1 1
f =1· +1· +1· +1· +1· +1· .
2 2 2 2 2 2

As a consequence, this machine number precisely represents the decimal number

1 1 1 1 1 1
(−1) 2 s c−1023
(1 + f ) = (−1) · 2 0 1027−1023
1+ + + + + +
2 8 16 32 256 4096
= 27.56640625.

However, the next smallest machine number is

0 10000000011 1011100100001111111111111111111111111111111111111111,

and the next largest machine number is

0 10000000011 1011100100010000000000000000000000000000000000000001.

This means that our original machine number represents not only 27.56640625, but also half
of the real numbers that are between 27.56640625 and the next smallest machine number,
as well as half the numbers between 27.56640625 and the next largest machine number. To
be precise, it represents any real number in the interval

[27.5664062499999982236431605997495353221893310546875,
27.5664062500000017763568394002504646778106689453125).

The smallest normalized positive number that can be represented has s = 0, c = 1,

and f = 0 and is equivalent to

2−1022 · (1 + 0) ≈ 0.22251 × 10−307 ,

and the largest has s = 0, c = 2046, and f = 1 − 2−52 and is equivalent to

21023 · (2 − 2−52 ) ≈ 0.17977 × 10309 .

Numbers occurring in calculations that have a magnitude less than

2−1022 · (1 + 0)

result in underﬂow and are generally set to zero. Numbers greater than

21023 · (2 − 2−52 )

result in overﬂow and typically cause the computations to stop (unless the program has
been designed to detect this occurrence). Note that there are two representations for the
number zero; a positive 0 when s = 0, c = 0 and f = 0, and a negative 0 when s = 1,
c = 0 and f = 0.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
20 CHAPTER 1 Mathematical Preliminaries and Error Analysis

Decimal Machine Numbers

The use of binary digits tends to conceal the computational difficulties that occur when a
finite collection of machine numbers is used to represent all the real numbers. To examine
these problems, we will use more familiar decimal numbers instead of binary representation.
Specifically, we assume that machine numbers are represented in the normalized decimal
floating-point form
±0.d1 d2 . . . dk × 10n , 1 ≤ d1 ≤ 9, and 0 ≤ di ≤ 9,
for each i = 2, . . . , k. Numbers of this form are called k-digit decimal machine numbers.
Any positive real number within the numerical range of the machine can be normalized
to the form
y = 0.d1 d2 . . . dk dk+1 dk+2 . . . × 10n .
The error that results from The floating-point form of y, denoted f l(y), is obtained by terminating the mantissa of
replacing a number with its y at k decimal digits. There are two common ways of performing this termination. One
floating-point form is called method, called chopping, is to simply chop off the digits dk+1 dk+2 . . . . This produces the
round-off error regardless of floating-point form
whether the rounding or
chopping method is used. f l(y) = 0.d1 d2 . . . dk × 10n .
The other method, called rounding, adds 5 × 10n−(k+1) to y and then chops the result to
obtain a number of the form
f l(y) = 0.δ1 δ2 . . . δk × 10n .
For rounding, when dk+1 ≥ 5, we add 1 to dk to obtain f l(y); that is, we round up. When
dk+1 < 5, we simply chop off all but the first k digits; so we round down. If we round down,
then δi = di , for each i = 1, 2, . . . , k. However, if we round up, the digits (and even the
exponent) might change.

Example 1 Determine the five-digit (a) chopping and (b) rounding values of the irrational number π .
Solution The number π has an infinite decimal expansion of the form π = 3.14159265. . . .
Written in normalized decimal form, we have
π = 0.314159265 . . . × 101 .
The relative error is generally a (a) The floating-point form of π using five-digit chopping is
better measure of accuracy than
the absolute error because it takes f l(π ) = 0.31415 × 101 = 3.1415.
into consideration the size of the
number being approximated. (b) The sixth digit of the decimal expansion of π is a 9, so the floating-point form of
π using five-digit rounding is
f l(π ) = (0.31415 + 0.00001) × 101 = 3.1416.

The following deﬁnition describes two methods for measuring approximation errors.

Deﬁnition 1.15 Suppose that p∗ is an approximation to p. The absolute error is |p − p∗ |, and the relative
|p − p∗ |
error is , provided that p = 0.
|p|

Consider the absolute and relative errors in representing p by p∗ in the following

example.

Example 2 Determine the absolute and relative errors when approximating p by p∗ when
(a) p = 0.3000 × 101 and p∗ = 0.3100 × 101 ;
(b) p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3 ;
(c) p = 0.3000 × 104 and p∗ = 0.3100 × 104 .

Solution

(a) For p = 0.3000 × 101 and p∗ = 0.3100 × 101 the absolute error is 0.1, and the
relative error is 0.3333 × 10−1 .
We often cannot ﬁnd an accurate
(b) For p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3 the absolute error is 0.1 × 10−4 ,
value for the true error in an
and the relative error is 0.3333 × 10−1 .
approximation. Instead we ﬁnd a
bound for the error, which gives (c) For p = 0.3000 × 104 and p∗ = 0.3100 × 104 , the absolute error is 0.1 × 103 , and
us a “worst-case” error. the relative error is again 0.3333 × 10−1 .

This example shows that the same relative error, 0.3333 × 10−1 , occurs for widely varying
absolute errors. As a measure of accuracy, the absolute error can be misleading and the
relative error more meaningful, because the relative error takes into consideration the size
of the value.

The following deﬁnition uses relative error to give a measure of signiﬁcant digits of
accuracy for an approximation.

Definition 1.16 The number p∗ is said to approximate p to t significant digits (or figures) if t is the largest
nonnegative integer for which
The term significant digits is
often used to loosely describe the
|p − p∗ |
number of decimal digits that ≤ 5 × 10−t .
appear to be accurate. The |p|
definition is more precise, and
provides a continuous concept. Table 1.1 illustrates the continuous nature of significant digits by listing, for the various
values of p, the least upper bound of |p − p∗ |, denoted max |p − p∗ |, when p∗ agrees with p
to four significant digits.

Table 1.1
p 0.1 0.5 100 1000 5000 9990 10000

max |p − p∗ | 0.00005 0.00025 0.05 0.5 2.5 4.995 5.

Returning to the machine representation of numbers, we see that the ﬂoating-point

representation f l(y) for the number y has the relative error

y − f l(y)
.
y

If k decimal digits and chopping are used for the machine representation of

y = 0.d1 d2 . . . dk dk+1 . . . × 10n ,

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
22 CHAPTER 1 Mathematical Preliminaries and Error Analysis

then

y − f l(y) 0.d1 d2 . . . dk dk+1 . . . × 10n − 0.d1 d2 . . . dk × 10n
=
y 0.d1 d2 . . . × 10n

0.dk+1 dk+2 . . . × 10n−k 0.dk+1 dk+2 . . .
= =
0.d d . . . × 10 .
−k
0.d1 d2 . . . × 10n 1 2

Since d1 = 0, the minimal value of the denominator is 0.1. The numerator is bounded above
by 1. As a consequence,

y − f l(y)
≤ 1 × 10−k = 10−k+1 .
y 0.1

In a similar manner, a bound for the relative error when using k-digit rounding arithmetic
is 0.5 × 10−k+1 . (See Exercise 24.)
Note that the bounds for the relative error using k-digit arithmetic are independent of the
number being represented. This result is due to the manner in which the machine numbers
are distributed along the real line. Because of the exponential form of the characteristic,
the same number of decimal machine numbers is used to represent each of the intervals
[0.1, 1], [1, 10], and [10, 100]. In fact, within the limits of the machine, the number of
decimal machine numbers in [10n , 10n+1 ] is constant for all integers n.

Finite-Digit Arithmetic
In addition to inaccurate representation of numbers, the arithmetic performed in a computer
is not exact. The arithmetic involves manipulating binary digits by various shifting, or
logical, operations. Since the actual mechanics of these operations are not pertinent to this
presentation, we shall devise our own approximation to computer arithmetic. Although our
arithmetic will not give the exact picture, it suffices to explain the problems that occur. (For
an explanation of the manipulations actually involved, the reader is urged to consult more
technically oriented computer science texts, such as [Ma], Computer System Architecture.)
Assume that the floating-point representations f l(x) and f l(y) are given for the real
numbers x and y and that the symbols ⊕, , ⊗, .. represent machine addition, subtraction,
multiplication, and division operations, respectively. We will assume a finite-digit arithmetic
given by

x ⊕ y = f l(f l(x) + f l(y)), x ⊗ y = f l(f l(x) × f l(y)),

.. y = f l(f l(x) ÷ f l(y)).
x y = f l(f l(x) − f l(y)), x

This arithmetic corresponds to performing exact arithmetic on the ﬂoating-point repre-

sentations of x and y and then converting the exact result to its ﬁnite-digit ﬂoating-point
representation.
Rounding arithmetic is easily implemented in Maple. For example, the command
Digits := 5
causes all arithmetic to be rounded to 5 digits. To ensure that Maple uses√approximate rather
than exact arithmetic we use the evalf. For example, if x = π and y = 2 then
evalf (x); evalf (y)
produces 3.1416 and 1.4142, respectively. Then f l(f l(x) + f l(y)) is performed using
5-digit rounding arithmetic with
evalf (evalf (x) + evalf (y))

which gives 4.5558. Implementing ﬁnite-digit chopping arithmetic is more difﬁcult and
requires a sequence of steps or a procedure. Exercise 27 explores this problem.

Example 3 Suppose that x = 5

7
and y = 13 . Use ﬁve-digit chopping for calculating x + y, x − y, x × y,
and x ÷ y.
Solution Note that
5 1
x= = 0.714285 and y= = 0.3
7 3
implies that the ﬁve-digit chopping values of x and y are

f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 .

Thus

x ⊕ y = f l(f l(x) + f l(y)) = f l 0.71428 × 100 + 0.33333 × 100

= f l 1.04761 × 100 = 0.10476 × 101 .

The true value is x + y = = 2221

5
7
+ 1
, so we have
3

22
1
Absolute Error = − 0.10476 × 10 = 0.190 × 10−4
21
and

0.190 × 10−4
Relative Error = = 0.182 × 10−4 .

22/21
Table 1.2 lists the values of this and the other calculations.

Table 1.2
Operation Result Actual value Absolute error Relative error

x⊕y 0.10476 × 101 22/21 0.190 × 10−4 0.182 × 10−4

xy 0.38095 × 100 8/21 0.238 × 10−5 0.625 × 10−5
x⊗y 0.23809 × 100 5/21 0.524 × 10−5 0.220 × 10−4
.. y
x 0.21428 × 101 15/7 0.571 × 10−4 0.267 × 10−4

The maximum relative error for the operations in Example 3 is 0.267 × 10−4 , so the
arithmetic produces satisfactory ﬁve-digit results. This is not the case in the following
example.

Example 4 Suppose that in addition to x = 5

7
and y = 1
3
we have

u = 0.714251, v = 98765.9, and w = 0.111111 × 10−4 ,

so that

f l(u) = 0.71425 × 100 , f l(v) = 0.98765 × 105 , and f l(w) = 0.11111 × 10−4 .
.. w, (x u) ⊗ v, and u ⊕ v.
Determine the ﬁve-digit chopping values of x u, (x u)

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
24 CHAPTER 1 Mathematical Preliminaries and Error Analysis

Solution These numbers were chosen to illustrate some problems that can arise with ﬁnite-
digit arithmetic. Because x and u are nearly the same, their difference is small. The absolute
error for x u is

|(x − u) − (x u)| = |(x − u) − (f l(f l(x) − f l(u)))|

5

= − 0.714251 − f l 0.71428 × 10 − 0.71425 × 10
0 0
7

= 0.347143 × 10−4 − f l 0.00003 × 100 = 0.47143 × 10−5 .

This approximation has a small absolute error, but a large relative error

0.47143 × 10−5

0.347143 × 10−4 ≤ 0.136.

The subsequent division by the small number w or multiplication by the large number v
magniﬁes the absolute error without modifying the relative error. The addition of the large
and small numbers u and v produces large absolute error but not large relative error. These
calculations are shown in Table 1.3.

Table 1.3
Operation Result Actual value Absolute error Relative error

xu 0.30000 × 10−4 0.34714 × 10−4 0.471 × 10−5 0.136

.. w
(x u) 0.27000 × 101 0.31242 × 101 0.424 0.136
(x u) ⊗ v 0.29629 × 101 0.34285 × 101 0.465 0.136
u⊕v 0.98765 × 105 0.98766 × 105 0.161 × 101 0.163 × 10−4

One of the most common error-producing calculations involves the cancelation of

signiﬁcant digits due to the subtraction of nearly equal numbers. Suppose two nearly equal
numbers x and y, with x > y, have the k-digit representations

f l(x) = 0.d1 d2 . . . dp αp+1 αp+2 . . . αk × 10n ,

and

f l(y) = 0.d1 d2 . . . dp βp+1 βp+2 . . . βk × 10n .

The ﬂoating-point form of x − y is

f l(f l(x) − f l(y)) = 0.σp+1 σp+2 . . . σk × 10n−p ,

where

0.σp+1 σp+2 . . . σk = 0.αp+1 αp+2 . . . αk − 0.βp+1 βp+2 . . . βk .

The ﬂoating-point number used to represent x − y has at most k − p digits of signiﬁcance.

However, in most calculation devices, x − y will be assigned k digits, with the last p being
either zero or randomly assigned. Any further calculations involving x−y retain the problem
of having only k − p digits of signiﬁcance, since a chain of calculations is no more accurate
than its weakest portion.
If a ﬁnite-digit representation or calculation introduces an error, further enlargement of
the error occurs when dividing by a number with small magnitude (or, equivalently, when

multiplying by a number with large magnitude). Suppose, for example, that the number z
has the ﬁnite-digit approximation z + δ, where the error δ is introduced by representation
or by previous calculation. Now divide by ε = 10−n , where n > 0. Then

z f l(z)
≈ fl = (z + δ) × 10n .
ε f l(ε)
The absolute error in this approximation, |δ| × 10n , is the original absolute error, |δ|, mul-
tiplied by the factor 10n .

Example 5 Let p = 0.54617 and q = 0.54601. Use four-digit arithmetic to approximate p − q and
determine the absolute and relative errors using (a) rounding and (b) chopping.
Solution The exact value of r = p − q is r = 0.00016.

(a) Suppose the subtraction is performed using four-digit rounding arithmetic. Round-
ing p and q to four digits gives p∗ = 0.5462 and q∗ = 0.5460, respectively, and
r ∗ = p∗ − q∗ = 0.0002 is the four-digit approximation to r. Since
|r − r ∗ | |0.00016 − 0.0002|
= = 0.25,
|r| |0.00016|
the result has only one significant digit, whereas p∗ and q∗ were accurate to four
and five significant digits, respectively.
(b) If chopping is used to obtain the four digits, the four-digit approximations to p, q,
and r are p∗ = 0.5461, q∗ = 0.5460, and r ∗ = p∗ − q∗ = 0.0001. This gives
|r − r ∗ | |0.00016 − 0.0001|
= = 0.375,
|r| |0.00016|
which also results in only one significant digit of accuracy.

The loss of accuracy due to round-off error can often be avoided by a reformulation of
the calculations, as illustrated in the next example.

Illustration The quadratic formula states that the roots of ax 2 + bx + c = 0, when a = 0, are
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = and x2 = . (1.1)
2a 2a
Consider this formula applied to the equation x 2 + 62.10x + 1 = 0, whose roots are
approximately

x1 = −0.01610723 and x2 = −62.08390.

The roots x1 and x2 of a general We will again use four-digit rounding arithmetic in the calculations to determine the root. In
quadratic equation are related to
this equation, b2 is much larger than 4ac, so the numerator in the calculation for x1 involves
the coefﬁcients by the fact that
the subtraction of nearly equal numbers. Because
b
x 1 + x2 = −
a

b2 − 4ac = (62.10)2 − (4.000)(1.000)(1.000)
and √ √
c
x 1 x2 = . = 3856. − 4.000 = 3852. = 62.06,
a
This is a special case of Vièta’s we have
Formulas for the coefﬁcients of
−62.10 + 62.06 −0.04000
polynomials. f l(x1 ) = = = −0.02000,
2.000 2.000

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
26 CHAPTER 1 Mathematical Preliminaries and Error Analysis

a poor approximation to x1 = −0.01611, with the large relative error

| − 0.01611 + 0.02000|
≈ 2.4 × 10−1 .
| − 0.01611|
On the other
√ hand, the calculation for x2 involves the addition of the nearly equal numbers
−b and − b2 − 4ac. This presents no problem since
−62.10 − 62.06 −124.2
f l(x2 ) = = = −62.10
2.000 2.000
has the small relative error
| − 62.08 + 62.10|
≈ 3.2 × 10−4 .
| − 62.08|
To obtain a more accurate four-digit rounding approximation for x1 , we change the form of
the quadratic formula by rationalizing the numerator:
√ √
−b + b2 − 4ac −b − b2 − 4ac b2 − (b2 − 4ac)
x1 = √ = √ ,
2a −b − b2 − 4ac 2a(−b − b2 − 4ac)
which simpliﬁes to an alternate quadratic formula
−2c
x1 = √ . (1.2)
b+ b2 − 4ac
Using (1.2) gives
−2.000 −2.000
f l(x1 ) = = = −0.01610,
62.10 + 62.06 124.2
which has the small relative error 6.2 × 10−4 .

The rationalization technique can also be applied to give the following alternative quadratic
formula for x2 :
−2c
x2 = √ . (1.3)
b − b2 − 4ac
This is the form to use if b is a negative number. In the Illustration, however, the mistaken use
of this formula for x2 would result in not only the subtraction of nearly equal numbers, but
also the division by the small result of this subtraction. The inaccuracy that this combination
produces,
−2c −2.000 −2.000
f l(x2 ) = √ = = = −50.00,
b − b2 − 4ac 62.10 − 62.06 0.04000
has the large relative error 1.9 × 10−1 .

• The lesson: Think before you compute!

Nested Arithmetic
Accuracy loss due to round-off error can also be reduced by rearranging calculations, as
shown in the next example.

Example 6 Evaluate f (x) = x 3 − 6.1x 2 + 3.2x + 1.5 at x = 4.71 using three-digit arithmetic.
Solution Table 1.4 gives the intermediate results in the calculations.

Table 1.4
x x2 x3 6.1x 2 3.2x

Exact 4.71 22.1841 104.487111 135.32301 15.072

Three-digit (chopping) 4.71 22.1 104. 134. 15.0
Three-digit (rounding) 4.71 22.2 105. 135. 15.1

To illustrate the calculations, let us look at those involved with ﬁnding x 3 using three-
digit rounding arithmetic. First we ﬁnd

x 2 = 4.712 = 22.1841 which rounds to 22.2.

Then we use this value of x 2 to ﬁnd

x 3 = x 2 · x = 22.2 · 4.71 = 104.562 which rounds to 105.

Also,

6.1x 2 = 6.1(22.2) = 135.42 which rounds to 135,

and

3.2x = 3.2(4.71) = 15.072 which rounds to 15.1.

The exact result of the evaluation is

Exact: f (4.71) = 104.487111 − 135.32301 + 15.072 + 1.5 = −14.263899.

Using ﬁnite-digit arithmetic, the way in which we add the results can effect the ﬁnal result.
Suppose that we add left to right. Then for chopping arithmetic we have

Three-digit (chopping): f (4.71) = ((104. − 134.) + 15.0) + 1.5 = −13.5,

and for rounding arithmetic we have

Three-digit (rounding): f (4.71) = ((105. − 135.) + 15.1) + 1.5 = −13.4.

(You should carefully verify these results to be sure that your notion of ﬁnite-digit arithmetic
is correct.) Note that the three-digit chopping values simply retain the leading three digits,
with no rounding involved, and differ signiﬁcantly from the three-digit rounding values.
The relative errors for the three-digit methods are

−14.263899 + 13.5 −14.263899 + 13.4
≈ 0.06.
Chopping:
−14.263899 ≈ 0.05, and Rounding: −14.263899

Illustration As an alternative approach, the polynomial f (x) in Example 6 can be written in a nested
manner as
Remember that chopping (or
rounding) is performed after each
f (x) = x 3 − 6.1x 2 + 3.2x + 1.5 = ((x − 6.1)x + 3.2)x + 1.5.
calculation.
Using three-digit chopping arithmetic now produces

f (4.71) = ((4.71 − 6.1)4.71 + 3.2)4.71 + 1.5 = ((−1.39)(4.71) + 3.2)4.71 + 1.5

= (−6.54 + 3.2)4.71 + 1.5 = (−3.34)4.71 + 1.5 = −15.7 + 1.5 = −14.2.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
28 CHAPTER 1 Mathematical Preliminaries and Error Analysis

In a similar manner, we now obtain a three-digit rounding answer of −14.3. The new relative
errors are

−14.263899 + 14.2
Three-digit (chopping): ≈ 0.0045;

−14.263899

−14.263899 + 14.3
Three-digit (rounding): ≈ 0.0025.

−14.263899
Nesting has reduced the relative error for the chopping approximation to less than 10%
of that obtained initially. For the rounding approximation the improvement has been even
more dramatic; the error in this case has been reduced by more than 95%.

Polynomials should always be expressed in nested form before performing an evalu-

ation, because this form minimizes the number of arithmetic calculations. The decreased
error in the Illustration is due to the reduction in computations from four multiplications
and three additions to two multiplications and three additions. One way to reduce round-off
error is to reduce the number of computations.

E X E R C I S E S E T 1.2
1. Compute the absolute error and relative error in approximations of p by p∗ .
a. p = π, p∗ = 22/7 b. p = π, ∗
√ p = 3.1416
c. p = e, p∗ = 2.718 d. p = 2, p∗ = 1.414
e. p = e10 , p∗ = 22000 f. p = 10π , p∗ =√1400
∗
g. p = 8!, p = 39900 h. p = 9!, p∗ = 18π(9/e)9
2. Find the largest interval in which p must lie to approximate p with relative error at most 10−4 for
∗

each value of p.
a. π √ b. √ e
3
c. 2 d. 7
3. Suppose p must approximate p with relative error at most 10−3 . Find the largest interval in which
∗

p∗ must lie for each value of p.

a. 150 b. 900
c. 1500 d. 90
4. Perform the following computations (i) exactly, (ii) using three-digit chopping arithmetic, and (iii)
using three-digit rounding arithmetic. (iv) Compute the relative errors in parts (ii) and (iii).
4 1 4 1
a. + b. ·
5 3 5 3

1 3 3 1 3 3
c. − + d. + −
3 11 20 3 11 20
5. Use three-digit rounding arithmetic to perform the following calculations. Compute the absolute error
and relative error with the exact value determined to at least ﬁve digits.
a. 133 + 0.921 b. 133 − 0.499
c. (121 − 0.327) − 119 d. (121 − 119) − 0.327
13
− 67 3
e. 14 f. −10π + 6e −
2e − 5.4 62

2 9 π − 227
g. · h.
9 7 1
17
6. Repeat Exercise 5 using four-digit rounding arithmetic.
7. Repeat Exercise 5 using three-digit chopping arithmetic.
8. Repeat Exercise 5 using four-digit chopping arithmetic.

MTH 375 Handout
0% (1)
MTH 375 Handout
67 pages
Splines
No ratings yet
Splines
16 pages
Numerical+Analysis+Chapter+1 2
No ratings yet
Numerical+Analysis+Chapter+1 2
13 pages
Round-Off Errors and Computer Arithmetic
No ratings yet
Round-Off Errors and Computer Arithmetic
19 pages
Round-Off Errors and Computer Arithmetic
No ratings yet
Round-Off Errors and Computer Arithmetic
19 pages
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
No ratings yet
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
6 pages
1.2 Round Off Errors and Computer Arithmetic
No ratings yet
1.2 Round Off Errors and Computer Arithmetic
23 pages
1.2 Round Off Errors and Computer Arithmetic
No ratings yet
1.2 Round Off Errors and Computer Arithmetic
15 pages
(Turner) - Applied Scientific Computing - Chap - 02
No ratings yet
(Turner) - Applied Scientific Computing - Chap - 02
19 pages
CHAP 03e
No ratings yet
CHAP 03e
32 pages
f31 Book Arith Pres Pt5
No ratings yet
f31 Book Arith Pres Pt5
93 pages
ComputerArithmetic-and-Interpolation 2023
No ratings yet
ComputerArithmetic-and-Interpolation 2023
29 pages
L3 Source of Error, Floating-Point
No ratings yet
L3 Source of Error, Floating-Point
26 pages
Numerical Method: Dr. Ali A. F. Al-Hamadani
No ratings yet
Numerical Method: Dr. Ali A. F. Al-Hamadani
13 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
Floating Point
No ratings yet
Floating Point
3 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Numerical Analysis
No ratings yet
Numerical Analysis
13 pages
Numerical Analysis 1 - 2022-2023 Courses
No ratings yet
Numerical Analysis 1 - 2022-2023 Courses
65 pages
Numerical Methods
No ratings yet
Numerical Methods
17 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
Mcse 004
No ratings yet
Mcse 004
241 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
NAChapter-1
No ratings yet
NAChapter-1
24 pages
Numerical Analysis - Patel
No ratings yet
Numerical Analysis - Patel
16 pages
Numerical Analysis: Lecture - 1
No ratings yet
Numerical Analysis: Lecture - 1
9 pages
(3.) Approximation and Errors (Part 1)
No ratings yet
(3.) Approximation and Errors (Part 1)
24 pages
Chapter 1: Introduction and Mathematical Preliminaries: Evy Kersal e
No ratings yet
Chapter 1: Introduction and Mathematical Preliminaries: Evy Kersal e
49 pages
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
No ratings yet
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
10 pages
3. Floating_Point_Number
No ratings yet
3. Floating_Point_Number
36 pages
Num Math
No ratings yet
Num Math
87 pages
Num Math PDF
No ratings yet
Num Math PDF
87 pages
Chapter2
No ratings yet
Chapter2
30 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
14 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
3. IECM
No ratings yet
3. IECM
16 pages
Approximation and Round-Off Errors: Speed: 48.X Mileage: 87324.4X
No ratings yet
Approximation and Round-Off Errors: Speed: 48.X Mileage: 87324.4X
27 pages
ch1 PDF
No ratings yet
ch1 PDF
15 pages
Error Analysis (SR.02) PDF
No ratings yet
Error Analysis (SR.02) PDF
87 pages
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
No ratings yet
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
17 pages
EEPC-102-Module-1
No ratings yet
EEPC-102-Module-1
6 pages
Lab 3
No ratings yet
Lab 3
5 pages
Chapter 2: Computer Memory and Storage, Representing Numbers, Random Numbers
No ratings yet
Chapter 2: Computer Memory and Storage, Representing Numbers, Random Numbers
7 pages
Lec1 2
No ratings yet
Lec1 2
14 pages
macm316_week1
No ratings yet
macm316_week1
28 pages
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
No ratings yet
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
2 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
CH 1
No ratings yet
CH 1
11 pages
Cit335 PDF
No ratings yet
Cit335 PDF
102 pages
IEEE Arithmetic
No ratings yet
IEEE Arithmetic
6 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
Week 2 M1Lessons 2-3
No ratings yet
Week 2 M1Lessons 2-3
41 pages
CSC340 - HW3
No ratings yet
CSC340 - HW3
28 pages
Numerical Methods Notes
100% (1)
Numerical Methods Notes
553 pages
Numerical Analysis: Lecture - 1
No ratings yet
Numerical Analysis: Lecture - 1
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 3

Uploaded by

Lecture 3

Uploaded by

1.

2 Round-off Errors and Computer Arithmetic 17

b. The error function can also be expressed in the form

c. Use the series in part (a) to approximate erf(1) to within 10−7 .

1.2 Round-off Errors and Computer Arithmetic

In our traditional mathematical world we permit √ numbers with an inﬁnite number of

Binary Machine Numbers

Illustration Consider the machine number

c = 1 · 210 + 0 · 29 + · · · + 0 · 22 + 1 · 21 + 1 · 20 = 1024 + 2 + 1 = 1027.

As a consequence, this machine number precisely represents the decimal number

However, the next smallest machine number is

and the next largest machine number is

The smallest normalized positive number that can be represented has s = 0, c = 1,

2−1022 · (1 + 0) ≈ 0.22251 × 10−307 ,

and the largest has s = 0, c = 2046, and f = 1 − 2−52 and is equivalent to

21023 · (2 − 2−52 ) ≈ 0.17977 × 10309 .

Numbers occurring in calculations that have a magnitude less than

Decimal Machine Numbers

Consider the absolute and relative errors in representing p by p∗ in the following

max |p − p∗ | 0.00005 0.00025 0.05 0.5 2.5 4.995 5.

Returning to the machine representation of numbers, we see that the ﬂoating-point

y = 0.d1 d2 . . . dk dk+1 . . . × 10n ,

x ⊕ y = f l(f l(x) + f l(y)), x ⊗ y = f l(f l(x) × f l(y)),

This arithmetic corresponds to performing exact arithmetic on the ﬂoating-point repre-

Example 3 Suppose that x = 5

f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 .

The true value is x + y = = 2221

x⊕y 0.10476 × 101 22/21 0.190 × 10−4 0.182 × 10−4

Example 4 Suppose that in addition to x = 5

u = 0.714251, v = 98765.9, and w = 0.111111 × 10−4 ,

|(x − u) − (x  u)| = |(x − u) − (f l(f l(x) − f l(u)))|

xu 0.30000 × 10−4 0.34714 × 10−4 0.471 × 10−5 0.136

One of the most common error-producing calculations involves the cancelation of

f l(x) = 0.d1 d2 . . . dp αp+1 αp+2 . . . αk × 10n ,

f l(y) = 0.d1 d2 . . . dp βp+1 βp+2 . . . βk × 10n .

The ﬂoating-point form of x − y is

f l(f l(x) − f l(y)) = 0.σp+1 σp+2 . . . σk × 10n−p ,

0.σp+1 σp+2 . . . σk = 0.αp+1 αp+2 . . . αk − 0.βp+1 βp+2 . . . βk .

The ﬂoating-point number used to represent x − y has at most k − p digits of signiﬁcance.

x1 = −0.01610723 and x2 = −62.08390.

a poor approximation to x1 = −0.01611, with the large relative error

• The lesson: Think before you compute!

Exact 4.71 22.1841 104.487111 135.32301 15.072

x 2 = 4.712 = 22.1841 which rounds to 22.2.

Then we use this value of x 2 to ﬁnd

x 3 = x 2 · x = 22.2 · 4.71 = 104.562 which rounds to 105.

6.1x 2 = 6.1(22.2) = 135.42 which rounds to 135,

3.2x = 3.2(4.71) = 15.072 which rounds to 15.1.

The exact result of the evaluation is

Exact: f (4.71) = 104.487111 − 135.32301 + 15.072 + 1.5 = −14.263899.

Three-digit (chopping): f (4.71) = ((104. − 134.) + 15.0) + 1.5 = −13.5,

and for rounding arithmetic we have

Three-digit (rounding): f (4.71) = ((105. − 135.) + 15.1) + 1.5 = −13.4.

f (4.71) = ((4.71 − 6.1)4.71 + 3.2)4.71 + 1.5 = ((−1.39)(4.71) + 3.2)4.71 + 1.5

Polynomials should always be expressed in nested form before performing an evalu-

p∗ must lie for each value of p.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

|(x − u) − (x u)| = |(x − u) − (f l(f l(x) − f l(u)))|

xu 0.30000 × 10−4 0.34714 × 10−4 0.471 × 10−5 0.136