0% found this document useful (0 votes)
11 views17 pages

15 - Floating Point Encoding

Uploaded by

ranbir singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

15 - Floating Point Encoding

Uploaded by

ranbir singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Floating-point Number Encoding

IEEE 754 standard

1
Reading and Exercises
• P & H: Section 3.5

2
Objective
At the end of this section, you will understand
1. How floating-point numbers are represented

3
Floating-Point Numbers
• Most modern architectures support floating point
representations for fractional quantities
▪ Usually implement the IEEE 754 standard
• Most CPUs now include floating-point units
(FPUs)
▪ Have instructions to do floating-point arithmetic
• Very quick
▪ If missing, FPU operation is simulated in software
• Very slow
4
Floating-Point Numbers (cont’d)
• A floating point number stored in a fixed size
register may only approximate a real value
▪ Beware: precision errors may accumulate when doing
repeated FP calculations

5
Fixed-Point Numbers
• With integers, the binary point is assumed to be to
the right of the LSB
▪ Eg: 4-bit register
0 1 0 1 .

• Gives the value 5.0:


0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 5
• Fractional quantities can be represented by
moving the binary point to a new position
▪ i.e. scaling the integer by a power of 2
6
Fixed-Point Numbers (cont’d)
▪ Eg:
0 1 1 1
.
• Gives the value 1.75:
0 × 21 + 1 × 20 + 1 × 2-1 + 1 × 2-2 =
0 + 1 + 0.5 + .25 = 1.75

• The position of the point is established by


convention
• Fixed point is used for the significand in floating
point representations
7
Floating-Point Single Format
• Any number N can be represented as:
N = (-1)s × 1.f × 2u
▪ s = sign bit
▪ 1.f = significand or mantissa
▪ u = unbiased exponent
• 11.01 = (-1)0 × 1.101× 21
• -101.011 = (-1)1 × 1.01011× 22
• That is, we need only know s, f, and u
8
Floating-Point Single Format
• Analogous to scientific notation
▪ The differences are:
• Base 2 instead of base 10
• Uses a sign bit s
• The significand and exponent are in binary
• The number is normalized so that: 1.0 ≤ significand < 2.0
▪ The significand is stored in fixed point format
▪ Since the MSB is always 1, it is not stored
• Provides one more bit of precision
• The exponent is biased by adding the constant 127
▪ Ensures it will always be stored as a positive integer

9
Floating-Point Single Format (cont’d)
• IEEE 754 Standard:
s e[7:0] f[22:0]
31 30 23 22 0

▪ Uses 4 bytes
▪ Number represented is: N = (-1)s × 1.f × 2e−127
• s: sign bit
• e: biased exponent
▪ =127 (0x7f) + unbiased exponent
• f: fractional part of the significand

10
Floating-Point Single Format (cont’d)
▪ Largest biased exponent allowed is 254 (0xfe)
• 255 (0xff) is used to represent quantities that are not
numbers (so-called NaNs)
• Thus, the largest possible unbiased exponent is 127
▪ The smallest biased exponent allowed is 1
• 0 is used for subnormal numbers (tiny fractional quantities)
• Thus, the smallest possible unbiased exponent is -126
▪ Range of magnitudes: 1.0 × 2-126 to (2.0 - ε) × 2127
• 1.17549435e-38 to 3.40282346e+38

11
Floating-Point Single Format (cont’d)
▪ Eg: 0x3e800000
0 01111101 00000000000000000000000
31 30 23 22 0

• f is 0
• e is 125
• s is 0
• (-1)0 × 1.0 × 2125−127 = 1 × 1.0 × 2-2 = +0.25

12
Floating-Point Double Format
• IEEE 754 Standard:
s e[10:0] f[51:0]
63 62 52 51 0

▪ Uses 8 bytes
▪ Number represented is: N = (-1)s × 1.f × 2e−1023
• s: sign bit
• e: biased exponent
▪ Created by adding 1023 (0x3ff) to the unbiased exponent
• f: fractional part of the significand

13
Floating-Point Double Format
(cont’d)
▪ Range of magnitudes: ~2.2e-308 to ~1.8e+308
▪ Has about 17 decimal digits of precision
▪ NaNs are represented using a biased exponent of 2047
(0x7ff)

14
Floating-Point NaNs
• A NaN (“Not a Number”) is an entity that cannot
be represented using conventional numbers
• In single format, NaNs use a biased exponent of
0xff
▪ positive ∞ : 0x7f800000
▪ negative ∞ : 0xff800000
▪ √-1 : 0x7fffffff

15
Floating-Point NaNs (cont’d)
• In double format, NaNs use a biased exponent of
0x7ff
▪ positive ∞ : 0x7ff00000 00000000
▪ negative ∞ : 0xfff00000 00000000
▪ √-1 : 0x7fffffff ffffffff
• Result from dividing by 0 or taking the square
root of a negative number

16
Floating-Point NaNs (cont’d)
• Using a NaN as an operand to most instructions
causes an exception
▪ However, ±∞ can be compared to a conventional
number using fcmp
▪ May also be used as a valid argument to some
functions
• Eg: atan(∞) returns π / 2

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy