0% found this document useful (0 votes)
11 views

Lecture 2

Uploaded by

ryuu.ducat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 2

Uploaded by

ryuu.ducat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Carnegie Mellon

Floating Point Numbers


N. Navet - Computing Infrastructure 1 / Lecture 2

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition


Carnegie Mellon

IEEE Floating Point standard


 IEEE 754 Standard
▪ Established in 1985 as uniform standard for floating point arithmetic
Before that, many proprietary formats, leading thus to non-portable

applications
▪ Intel’s hired in the mid-1970s prof. Kahan (Berkeley) to devise a floating
point coprocessor (8087) for the 8086 processor → work re-used later in
IEEE standard
▪ Nowadays, IEEE 754 is supported in HW by virtually all CPUs (that have a
floating point unit, otherwise it can be implemented in SW)
 Driven by numerical concerns
▪ Good standards for rounding, overflow, underflow
▪ Hard to make fast in hardware
▪ Numerical analysts predominated over hardware designers in defining
the standard

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 2
Carnegie Mellon

Principles of floating point numbers


 Basis for the support (of an approximation) of arithmetic with real
numbers
 A floating point number is a rational number (i.e., quotient of two
integers)
 Real numbers that cannot be represented as floating points will be
approximated leading to numerical imprecisions (real numbers
form a continuum, floating points do not → rounding to the
nearest value that can be expressed needed)
 floating point is a number of the form 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑑 ∙ 𝑏𝑎𝑠𝑒 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 ,
where significand, exponent and base are all integers, e.g. in base
10, 5.367 = 5367 ∙ 10−3
 “floating point” because the point can “float”, it can be placed
anywhere relative to the significant digits of the number (depending
on the value of the exponent), e.g. 536.7 ∙ 10−2 = 5367 ∙ 10−3

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 3
Carnegie Mellon

Principles of floating point numbers


 As there is more than one way to represent a number, we need
a single standardized representation
 Familiar base-10 (normalized) scientific notation used in
physics, math and engineering: n = f *10e where
▪ f is the fraction (aka mantissa or significand) with one non-zero decimal
digit before the decimal point
▪ e is a positive or negative number called the exponent

Normalized scientific notation


on the right

 Range is determined by the number of digits of the exponent


 Precision by the number of digits in the fraction
 In computers, the base is 2, floating-point representation
encodes rational numbers of the form V = x × 2y
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 4
Carnegie Mellon

Tiny Floating Point Example #1


 Base 10
 Signed 3-digit significand that can be either 0, or (0.1 ≤ 𝑓 < 1) or (−1 < 𝑓 ≤
− 0.1 )
 Signed 2-digit exponent Min and max exponent ?
 Range over nearly 200 orders of magnitude: −0.999 ∙ 1099 to +0.999 ∙ 1099
 The separation between expressible numbers is not constant: e.g., the
separation between +0. 998 × 1099 and +0. 999 × 1099 is >> than the
separation between +0. 998 × 100 and +0. 999 × 100

 But the relative error introduced by rounding is about the same (i.e., the
separation between a number and its successor expressed as a percentage of
that number is approximatively the same over the whole range)
How to increase the accuracy of representation ?
How to increase the range of expressible numbers ?
Course reading – “Structured Computer Organization”:
Appendix B: floating point numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 5
Carnegie Mellon

Example #1: the real line is divided up into seven regions


1. Large negative numbers less than −0. 999 × 1099.
2. Negative between −0.999 × 1099 and −0.100×10−99.
Not possible to
express any
3. Small negative, between -0.100×10−99 and zero
number in
4. Zero regions1,3,5,7
5. −99
Small positive, between 0 and 0.100×10 .
1060×1060 =10120
6. Positive between 0.100×10−99 and 0.999×1099. →positive overflow
7. Large positive numbers greater than 0.999×1099.

−0.999 ∙ 1099 −0.1 ∙ 10−99 0.1 ∙ 10−99 0.999 ∙ 1099

Nb: underflow errors is less serious than overflow since 0


is usually a satisfactory approximation in regions 3 and 5
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 6
Carnegie Mellon

Normalized numbers and hidden bits


 “Normalized” format is for representing all numbers but the
ones close to 0 that are represented with “denormalized” format
(will be seen later in the lecture)
 312.25 can be represented with the integer 31225 as the
significand and 10-2 as power term, but many other ways ..
 Its normalized scientific notation in base 10 is 3.1225 * 102 that
is with one non-zero decimal digit before the decimal point
 Same principle for normalized form in base 2: 1.xxx * 2y
 As the most significant bit is always a 1, it is not necessary to
store it → this is the hidden bit
 IEEE754 double precision: size of the significand is 52 bits not
including the hidden bit, 53 bits with it

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 7
Carnegie Mellon

Floating Point Representation – normalized numbers


 IEEE 754 standard represents FP numbers having the following form:
(–1)s M 2E
▪ Sign bit s determines whether number is negative or positive
▪ Significand M (except in special cases) a fractional binary number in range
[1.0,2.0) (interval starts at 1 because of leading 1: 1.xxxx…x * 2^E )
▪ Exponent E weights value by a power of two
How to express 0?
 Encoding of a FP number is done over 3 fields:
▪ Most Significant Bit s is sign bit s
▪ exp field encodes E (but is not equal to E)
▪ frac field encodes M (but is not equal to M)

s exp frac

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 8


Carnegie Mellon

As a programmer, you can expect a precision of


Precision options 7 decimal digits in single precision and 15 in
double precision. Except for good reasons, you
 Single precision: 32 bits should always use double precision numbers.

s exp frac
1 8-bits 23-bits

 Double precision: 64 bits

s exp frac
1 11-bits 52-bits
 Extended precision: 80 bits (not supported by all CPUs and
compilers) – out of the scope of the course
s exp frac
1 15-bits 64-bits

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 9


Carnegie Mellon

3 types of floating point encodings


 Determined by the value of the exponent – here we consider
single precision numbers, that is with an exponent of 8 bits

denormalized numbers are a “sub-format" within the IEEE-754 floating-point format

Not A Number (NaN): a value that is undefined


examples: 0/0, −5

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 10


Carnegie Mellon

Visualization: Floating Point Encodings


Cannot be represented

−Normalized −Denorm +Denorm +Normalized


− +

−0 +0

Denormalized encoding is for 0 and


numbers that are very close to 0

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 11


Carnegie Mellon

Case 1: “Normalized” Values v = (–1)s M 2E


 Most common case: when bit pattern of exp ≠ 000…0 and ≠
111…1 (i.e., 255 for single precision and 2047 for double)
 Exponent coded as a biased value: E = Exp – Bias
▪ Exp: unsigned value of exp field of the floating point number
▪ Bias = 2k-1 - 1, where k is number of exponent bits
▪ Single precision: bias=127 (Exp: 1…254, E: -126…127)
▪ Double precision: bias=1023 (Exp: 1…2046, E: -1022…1023)

 Significand coded with implied leading 1: M = 1.xxx…x2


▪ xxx…x: bits of frac field Beyond the lecture’s scope:
▪ Minimum when frac=000…0 (M = 1.0) thanks to the bias, exp field can be
encoded as unsigned (as it is
▪ Maximum when frac=111…1 (M = 2.0 – ε)
positive) and not in two’s
▪ Get extra leading bit for “free” (hidden bit) complement, which allows for
faster comparison of FP numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 12
Carnegie Mellon

Normalized Encoding : example


v = (–1)s M 2E
in single precision E = Exp – Bias
 Value: float F = 15213.0;
▪ 1521310 = 11101101101101.02 x 20 5 steps: a) (unsigned) binary form b)
= 1.11011011011012 x 213 normalized form c) encode significand
d) encode exponent 5) sign bit
 Significand
M = 1.11011011011012
frac field (23bits)= 110110110110100000000002

Single precision
 Exponent
E = 13
Bias = 127
Exp field (8bits) = 140 = 100011002

 Result: Bit
Bit 31 22 Bit 0
0 10001100 11011011011010000000000
s exp frac
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 13
Carnegie Mellon

v = (–1)s M 2E
Example #2 E = Exp – Bias
http://www.binaryconvert.com/convert_float.html

1) Write 4.0 as v = (–1)s M 2E 4 = (–1)0 · 1.0 ·22


2) Encode 4.0 as a floating point
number (single precision)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 14


Carnegie Mellon

v = (–1)s M 2E
Example #2 E = Exp – Bias
http://www.binaryconvert.com/convert_float.html

4 = (–1)0 · 1.0 ·22

32 bits = 4 bytes

Bit Bit
22 0

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 15


Carnegie Mellon

v = (–1)s M 2E
Example #3 E = Exp – Bias

Encode 4.75 as a floating point number


in single precision format

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 16


Carnegie Mellon

v = (–1)s M 2E
Example #4 E = Exp – Bias
Encode 1.0 in IEEE754
single precision format

1 = (–1)0 · (1+0) · 20

How would 1.0 be encoded without the BIAS?

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 17


Carnegie Mellon

Case 2 : Denormalized numbers v = (–1)s M 2E


E = 1 – Bias
 exp = 000…0 indicates a denormalized number
 Purpose: represent 0 and numbers very close to 0 that normalized
numbers cannot represent
 Exponent value is constant : E = 1 – Bias (i.e., E = -126 in single
precision or E=-1022 in double precision)
 Significand coded with implied leading 0: M = 0.xxx…x2
▪ xxx…x: bits of frac
Why 0 cannot be represented
 Cases
with normalized encoding?
▪ exp = 000…0, frac = 000…0
Represents the value zero

▪ Two distinct values: +0 and –0 (all bits are zero possibly except sign bit)
▪ exp = 000…0, frac ≠ 000…0
▪ Numbers are equi-spaced in that range as the exponent is constant

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 18
Carnegie Mellon

v = (–1)s M 2E
Example #5 E = -126

a) Encode of the smallest strictly positive denormalized number in


single precision floating point b) Express this value as a power of 2

= (–1)0 · 2-23 · 2-126 = 2-149

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 19


Carnegie Mellon

v = (–1)s M 2E
Example #6 E = -126

Single precision floating point: encoding of the largest positive


denormalized number in binary ?

= (–1)0 · (2-1 +2-2 + …+ 2-22 +2-23) · 2-126


= 2-126 · (1 - 2-23)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 20


Carnegie Mellon

Case 3: Special Values

 Condition: exp = 111…1

 Case: exp = 111…1, frac = 000…0


▪ Represents value  (infinity)
▪ Can be used as an operand and behaves according to the usual
mathematical rules for 
▪ As expected, both positive and negative 
▪ E.g., 1.0/0.0 = −1.0/−0.0 = +, 1.0/−0.0 = −

 Case: exp = 111…1, frac ≠ 000…0


▪ Not-a-Number (NaN)
▪ Represents case when no numeric value can be determined
▪ E.g., sqrt(–1),  − ,   0

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 21
Carnegie Mellon

IEEE 754: a recap

≠0 and ≠ 111…1

 Floating Point Zero Same as Integer Zero


▪ All bits = 0

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 22


Carnegie Mellon

Supplementary material
Outside the scope of the course

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 23


Carnegie Mellon

Tiny Floating Point Example #2


s exp frac
1 4-bits 3-bits

 8-bit Floating Point Representation


▪ the sign bit is in the most significant bit
▪ the next four bits are the exponent, with a
bias of 7 v = (–1)s M 2E
▪ the last three bits are the frac Normalized : E = Exp – Bias
Denormalized : E = 1 – Bias
 Same general form as IEEE Format
a) what is the smallest strictly positive
▪ normalized, denormalized
normalized number and what is the
▪ representation of 0, NaN, infinity
largest ?
b) List all positive denormalized
numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 24
Carnegie Mellon

v = (–1)s M 2E
Range (Positive Only) Normalized : E = Exp – Bias
s exp frac E Value
Denormalized : E = 1 – Bias

0 0000 001 -6 1/8*1/64 = 1/512 closest to zero


Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
smallest norm
0 0001 000 -6 8/8*1/64 = 8/512
0 0001 001 -6 9/8*1/64 = 9/512

0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized
0 0111 000 0 8/8*1 = 1
numbers
0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8

0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 25


Carnegie Mellon

Tiny Floating Point Example #3

 6-bit IEEE-like format


▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 23-1-1 = 3 1 3-bits 2-bits

 Notice how the distribution gets denser toward zero.


8 values

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 26


Carnegie Mellon

Distribution of Values (close-up view)


 6-bit IEEE-like format
▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 3 1 3-bits 2-bits

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy