15 - Floating Point Encoding
15 - Floating Point Encoding
1
Reading and Exercises
• P & H: Section 3.5
2
Objective
At the end of this section, you will understand
1. How floating-point numbers are represented
3
Floating-Point Numbers
• Most modern architectures support floating point
representations for fractional quantities
▪ Usually implement the IEEE 754 standard
• Most CPUs now include floating-point units
(FPUs)
▪ Have instructions to do floating-point arithmetic
• Very quick
▪ If missing, FPU operation is simulated in software
• Very slow
4
Floating-Point Numbers (cont’d)
• A floating point number stored in a fixed size
register may only approximate a real value
▪ Beware: precision errors may accumulate when doing
repeated FP calculations
5
Fixed-Point Numbers
• With integers, the binary point is assumed to be to
the right of the LSB
▪ Eg: 4-bit register
0 1 0 1 .
9
Floating-Point Single Format (cont’d)
• IEEE 754 Standard:
s e[7:0] f[22:0]
31 30 23 22 0
▪ Uses 4 bytes
▪ Number represented is: N = (-1)s × 1.f × 2e−127
• s: sign bit
• e: biased exponent
▪ =127 (0x7f) + unbiased exponent
• f: fractional part of the significand
10
Floating-Point Single Format (cont’d)
▪ Largest biased exponent allowed is 254 (0xfe)
• 255 (0xff) is used to represent quantities that are not
numbers (so-called NaNs)
• Thus, the largest possible unbiased exponent is 127
▪ The smallest biased exponent allowed is 1
• 0 is used for subnormal numbers (tiny fractional quantities)
• Thus, the smallest possible unbiased exponent is -126
▪ Range of magnitudes: 1.0 × 2-126 to (2.0 - ε) × 2127
• 1.17549435e-38 to 3.40282346e+38
11
Floating-Point Single Format (cont’d)
▪ Eg: 0x3e800000
0 01111101 00000000000000000000000
31 30 23 22 0
• f is 0
• e is 125
• s is 0
• (-1)0 × 1.0 × 2125−127 = 1 × 1.0 × 2-2 = +0.25
12
Floating-Point Double Format
• IEEE 754 Standard:
s e[10:0] f[51:0]
63 62 52 51 0
▪ Uses 8 bytes
▪ Number represented is: N = (-1)s × 1.f × 2e−1023
• s: sign bit
• e: biased exponent
▪ Created by adding 1023 (0x3ff) to the unbiased exponent
• f: fractional part of the significand
13
Floating-Point Double Format
(cont’d)
▪ Range of magnitudes: ~2.2e-308 to ~1.8e+308
▪ Has about 17 decimal digits of precision
▪ NaNs are represented using a biased exponent of 2047
(0x7ff)
14
Floating-Point NaNs
• A NaN (“Not a Number”) is an entity that cannot
be represented using conventional numbers
• In single format, NaNs use a biased exponent of
0xff
▪ positive ∞ : 0x7f800000
▪ negative ∞ : 0xff800000
▪ √-1 : 0x7fffffff
15
Floating-Point NaNs (cont’d)
• In double format, NaNs use a biased exponent of
0x7ff
▪ positive ∞ : 0x7ff00000 00000000
▪ negative ∞ : 0xfff00000 00000000
▪ √-1 : 0x7fffffff ffffffff
• Result from dividing by 0 or taking the square
root of a negative number
16
Floating-Point NaNs (cont’d)
• Using a NaN as an operand to most instructions
causes an exception
▪ However, ±∞ can be compared to a conventional
number using fcmp
▪ May also be used as a valid argument to some
functions
• Eg: atan(∞) returns π / 2
17