0% found this document useful (0 votes)

64 views50 pages

Lecture 10 (Temp)

Floating point numbers allow for a wider range of values to be represented compared to fixed point numbers by allowing the radix or decimal point to vary its position. The IEEE-754 standard defines common floating point formats including 32-bit single precision and 64-bit double precision that specify the layout of the sign, exponent, and significand fields. Special values like infinities, zeros, and NaN are also defined. Floating point arithmetic operations like addition, subtraction, multiplication and division follow specific steps involving aligning operands, performing operations on significands, and adjusting exponents.

Uploaded by

Anton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views50 pages

Lecture 10 (Temp)

Uploaded by

Anton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Lecture 10: Floating-Point

EEE 105 Computer Organization

1st Semester AY 2019-20
Floating Point Numbers
• A lot of values in scientific calculations that cannot be
represented as integers
• Floating-point can also represent a higher range of numbers
compared to fixed-point numbers
• Radix point can vary in position (floating point)
• Examples:
◦ 6.0247x1023 mol-1 (Avogadro’s Number)
◦ 1.6022x10-19 Coul (magnitude of electron charge)
Floating Point Numbers
• A number is said to be normalized when the radix point is
placed right to the first non-zero significant digit
• For a decimal system
±𝑋1 . 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6 𝑋7 × 10±𝑌1 𝑌2
◦ Number of significant digits = 7
◦ Range of exponent = ±99
◦ Mantissa: 𝑋2 𝑋3 … 𝑋6 𝑋7
▪ Set of digits or binary numbers following the radix point
IEEE-754 Standard
• Defines the functionality of floating-point representation and
arithmetic
• Specifies the basic and extended floating-point number
formats
◦ Five basic formats based on the definition of the standard:
▪ Three binary floating-point formats (encoded w/ 32, 64, and 128 bits)
▪ Two decimal floating-point formats (encoded w/ 64 and 128 bits)
◦ Two most-common formats:
▪ 32-bit single precision format (binary32)
▪ 64-bit double precision format (binary64)
IEEE-754 Standard
• 32-bit Single Precision Format

content S E M
sign exponent mantissa
size (bits) 1 8 23

Conversion: 𝑁 = −1 𝑆 × 2𝐸−127 × 1. 𝑀
implied 1
◦ Float format
IEEE-754 Standard
• 64-bit Extended/Double Precision Format

content S E M
sign exponent “mantissa"
size (bits) 1 11 52

Conversion: 𝑁 = −1 𝑆 × 2𝐸−1023 × 1. 𝑀
implied 1
◦ Double format
◦ To allow for extended range and precision
◦ Reduces round-off errors during intermediate calculations
IEEE -754 Standard
• Implied one is used to maximize the bits in encoding
◦ Since for a non-zero number, the first non-zero (binary) digit is
always one
• However, implied one is used only when E is between max(E) and
min(E)
◦ Max(E) = all 1’s
◦ Min(E) = all 0’s

• Exponent bias (i.e. 127 for single and 1023 for double) allows for the
representation of very small and very large numbers
◦ Max(E) and min(E) has special meanings – different interpretation
◦ Effective range of E
▪ -126 to 127 (single)
▪ -1022 to 1023 (double)
Special Numbers
• Zero
◦ 𝐸 = 0 (min(E) or all 0’s), 𝑀 = 0
◦ Can have positive and negative zero

• Subnormal/Denormal
◦ 𝐸 = 0 (min€ or all 0’s), 𝑀 ≠ 0
◦ No implied one (a zero is used)
◦ Resulting representation is not normalized
◦ Actual exponent is -126 instead of -127
◦ −1 𝑆 × 0. 𝑀 × 2−126
Special Numbers
• Infinities
◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 = 0
◦ Also used as replacement for the result when overflow occurs

• Not a number (NaN)

◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 ≠ 0
◦ Result of invalid operations like
▪ 0/0
▪ Sqrt(-1)
▪ Infinity*0
▪ [+infinity] + [-infinity]
Subnormal Numbers
• Example (single precision but assuming that M is only 2 bits long)
◦ 𝐸 = 110 (exponent factor is 21−127 = 2−126 )
𝑴 Significand Value
112 1.112 1.112 × 2−126
102 1.102 1.102 × 2−126
012 1.012 1.012 × 2−126
continuation
002 1.002 1.002 × 2−126 towards zero

◦ 𝐸 = 010 (exponent factor is still 2−126 but no implied 1)

𝑴 Significand Value
112 0.112 0.112 × 2−126
102 0.102 0.102 × 2−126
012 0. 012 0.012 × 2−126
002 0. 002 0.002 × 2−126
Examples
Minimum and maximum
numbers (binary32)
• Good to identify these numbers in identifying possible cases for
overflow and underflow

◦ Maximum magnitude (𝐸 = max 𝐸 − 1, 𝑀 = all 1’s)

▪ 1. 𝑀 × 2127 = 224 − 1 2−23 2127 ⇒ ~𝟐𝟏𝟐𝟖
▪ Must be infinity when the multiplier reaches 2128 (since 𝐸 = 𝑚𝑎𝑥 𝐸 )
▪ Note: normalized

◦ Minimum non-zero magnitude (𝐸 = 0, 𝑀 = 0. . 01)

▪ 0.0 … 01 × 2−126 = 2−23 2−126 = 𝟐−𝟏𝟒𝟗
▪ Note: denormal
Floating-Point Operations
Some Required Operations in the IEEE-754 standard
• Add
• Subtract
• Multiply
• Divide
• Square Root
• Remainder
• Comparisons
• Conversion between formats
Floating-Point Exceptions
• Causes:
◦ Invalid floating-point operations (e.g. square root of neg.)
◦ Division by zero – result must be “infinity”
◦ Overflow – resulting 𝐸 becomes more than max 𝐸
◦ Underflow – resulting 𝐸 becomes less than 0
◦ Inexact result
▪ By default, result is rounded-off to fit into the format
Floating Point Conversion
• If 𝜋 is approximately 3.14159265359, how do you represent
this in single precision format?
Floating-Point Addition and
Subtraction
• Has more steps (because of formatting) than fixed point
addition or subtraction
• Addition of significands (implied one (if any) with mantissa) is
the same as fixed point addition
• Similar procedure is used for subtraction
Floating-Point Addition and
Subtraction
1. Compare exponents
2. Significand of the number with smaller exponent must be
shifted right by the difference in exponents
◦ To align the radix points before addition/subtraction
◦ Possible shifting out of some significant digits
◦ Example: 𝐸1 = 100 and 𝐸2 = 95
▪ Then, 2nd significand must be shifted to the right 5 times

Significand2 = 1.00001111000011110001011 Note: simplification only

(assumes NO guard bits)
Significand2 = 0.0000100001111000011110001011
Floating-Point Addition and
Subtraction
3. Sign bit is used to determine whether true addition or
subtraction must be done
4. Addition/subtraction of aligned significands is performed
5. Normalize the result (and adjust the exponent 𝐸) if needed
◦ Limit the exponent to -126 (single) or -1022 (double)
6. Round-off the resulting mantissa before placing into the final
format
◦ Extract the part of the mantissa that must be encoded in 𝑀
◦ Possibly need to remove an implied 1 in encoding
Floating Point Addition and
Subtraction
Additional Notes:
• Addition
◦ Addition of significands may require normalization (shifting to the right)
▪ Resulting sum may have more than 1 significant digit on the left of the radix point
◦ Easiest to add magnitudes (positive numbers)

• Subtraction
◦ Subtraction of significands may require normalization (shifting to the
left)
▪ Resulting difference may have its first non-zero digit at the left of the radix point
◦ Result of subtraction may be negative – need to get 2’s complement and
update the sign bit of the result
▪ Do the 2’s complement before normalizing
Floating-Point Addition
+1.111 0010 0000 0000 0000 0010 x 24
+1.100 0000 0000 0001 1000 0101 x 22

• Need to align radix points first

+ 1.111 0010 0000 0000 0000 0010 x 24
+ 0.011 0000 0000 0000 0110 0001 01 x 24
+10.010 0010 0000 0000 0110 0011 01 x 24

• Need to normalize to identify the correct 𝐸 and 𝑀 to use

+ 1.001 0001 0000 0000 0011 0001 101 x 25
𝑀 = 001 0001 0000 0000 0011 00012 while 𝐸 = 5 + 127 = 132
Floating-Point Subtraction
1.000 0000 0101 1000 1000 1101 x 2-6
- 1.000 0000 0000 0000 1001 1010 x 2-1

• Align radix points first

0.000 0100 0000 0010 1100 0100 01101 x 2-1
- 1.000 0000 0000 0000 1001 1010 x 2-1

• Perform subtraction (or perform 2’s complement then add)

00.000 0100 0000 0010 1100 0100 01101 x 2-1
+10.111 1111 1111 1111 0110 0110 x 2-1
11.000 0100 0000 0010 0010 1010 01101 x 2-1
Floating-Point Subtraction
• After subtraction of aligned operands:
11.000 0100 0000 0010 0010 1010 01101 x 2-1

• Perform 2’s complement (since the result is negative and we encode mag)
- 0.111 1011 1111 1101 1101 1101 10011 x 2-1

• Normalize the result

- 1.111 0111 1111 1011 1011 1011 0011 x 2-2
Floating-Point Addition and
Subtraction
• Assuming that both operands are positive
◦ Addition may require normalization of the sum by shifting to
the right
▪ Magnitude of answer is larger
◦ Subtraction may require normalization of the difference by
shifting to the left
▪ Magnitude of answer is smaller
Floating-Point Multiplication
and Division
1. Determine the sign of the result
2. Determine the initial exponent of the result
◦ For multiplication: 𝐸 = 𝐸1 + 𝐸2 − 𝑏𝑖𝑎𝑠
◦ For division: 𝐸 = 𝐸1 − 𝐸2 + 𝑏𝑖𝑎𝑠
3. Perform multiplication and division of significands (similar to
integer multiplication and division)
4. Normalize result for the mantissa if needed
5. Round-off the resulting mantissa before placing into the final
format
Floating-Point Multiplication
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow

• Normal and subnormal

◦ Result: normal, subnormal, underflow

• Both subnormal
◦ Result: underflow
Floating-Point Multiplication
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 × 𝑁2 = 1. 𝑀1 × 2𝐸1 −127 1. 𝑀2 × 2𝐸2 −127
𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• If both operands are normal, initial 𝐸 of the result is

◦ 𝐸1 + 𝐸2 − 127
• If one operand is a subnormal number, initial 𝐸 is either
◦ −126 + 𝐸1 or −126 + 𝐸2
• Note: Initial exponent cannot fully identify if overflow or
underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• Significands are 24-bits wide each

◦ Whole product must be 48-bits
◦ Radix point is after the 2nd MSB
• When both operands are normalized, the first 1 can either be
at the
◦ 1st MSB (47th bit): extract bits 46 to 24
▪ Happens with last carry out in summing partial products
◦ 2nd MSB (46th bit): extract bits 45 to 23
▪ Happens without last carry out in summing partial products
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

With subnormal: 𝑁1 × 𝑁2 = 0. 𝑀1 × 1. 𝑀2 2 −126+𝐸2 −127

• When one operand is subnormal, the first 1 can appear almost

“anywhere” in the 48-bit product
◦ First 1 can be at 47th bit (MSB) down to 23rd bit
◦ Possibly need to store the LSBs of the 48-bit of product
▪ When first 1 is at 23rd bit, need bits 22 to 0 for the mantissa
◦ Do not truncate partial sums! Store whole 48-bit product.

• Is this the only option?

Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2𝐸𝑎𝑐𝑡𝑢𝑎𝑙,1 +𝐸𝑎𝑐𝑡𝑢𝑎𝑙,2
where 𝑏1,2 can be a 1 or 0

Possible Strategies:
1. Multiply as is (same as previous slide)
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can be “anywhere”
2. Move radix points of to the right of both mantissas
▪ Adjust exponent/s
▪ Operands will always look like 𝑋23 𝑋22 … 𝑋0 . and 𝑌23 𝑌22 … 𝑌0 .
▪ Radix point of 48-bit product is always after the LSB
▪ First 1 of the 48-bit product can be “anywhere”
Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

Possible Strategies:
3. Normalize subnormal operands before multiplication
▪ Adjust exponent/s
▪ Will always be similar to having “1. 𝑀1 ” and “1. 𝑀2 ”
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can only be at the MSB (bit 47) or next
MSB (bit 46)
▫ Need only at most 1 shift to normalize (if result can be normal)
▫ Otherwise, shift right until exponent is -126 (for subnormal results)
▫ Allows the LSBs of the partial sum to be shifted out in the addition of
partial products (no need to store the whole 48 bits throughout)
Floating-Point Division
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow

• Dividend is normal, divisor is subnormal

◦ Result: normal, overflow

• Dividend is subnormal, divisor is normal

◦ Result: normal, subnormal, underflow

• Both subnormal
◦ Result: normal
Floating-Point Division
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 /𝑁2 = 1. 𝑀1 × 2𝐸1 −127 / 1. 𝑀2 × 2𝐸2 −127
𝑁1 /𝑁2 = 1. 𝑀1 /1. 𝑀2 2 𝐸1 −𝐸2 +127 −127

• If both operands are normal, initial 𝐸 of the result is

◦ 𝐸1 − 𝐸2 + 127
• Adjust initial 𝐸 when at least one operand is subnormal

• Note: Initial exponent cannot fully identify if overflow or

underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Division
Possible Cases for 𝑁1 /𝑁2
• N/N: 1. 𝑀1 /1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• N/D: 1. 𝑀1 /0. 𝑀2 2 𝐸1 +126 −127

• D/N: 0. 𝑀1 /1. 𝑀2 2 128−𝐸2 −127

• D/D: 0. 𝑀1 /0. 𝑀2 2 127 −127

Legend: (N) normal, (D) denormal

Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 . 𝑀1 /𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

• Goal: Divide the WITHOUT REMAINDER

◦ May need to extend the dividend to the left with zeros
• Quotient can be 1 bit long only to infinitely long!
• Moving the radix for both operands will have no effect
𝑏1 . 𝑀1 𝑏1 𝑀1
◦ i.e. =
𝑏2 . 𝑀2 𝑏2 𝑀2
• Radix point is after the “last” bit of the mantissas
• However, in general, the first 1 of the quotient can appear
anywhere (to the right or to the left ) of the radix point.
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2𝐸𝑖𝑛𝑖𝑡−127
where 𝑏1,2 can be a 1 or 0

Possible Strategy:
• Proceed with the division process discussed (without remainder) with the
following changes
◦ Assume that the first resulting quotient bit is a 1 and that the result will
be a normal number
◦ Given this assumption, it will take 24 cycles of the division to complete
the significand of the quotient (to get a 23-bit mantissa)
▪ Quotient becomes 24-bits (with an MSB of 1 as assumed)
◦ For this to quotient, the radix point must be shifted to the left 23 times
▪ 23-bit LSB of the quotient must be used as result mantissa (𝑀)
▪ 23 must be added to the initial exponent
▫ Radix point moved from the end of the LSB to after the MSB of the 24-bit quotient
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

Possible Strategy:
• Adjust the approach to cover all cases (first quotient bit is not a 1)
• To simplify things, let 𝑋 represent the numerator 𝑏1 𝑀1 and 𝑌 represent
the denominator 𝑏2 𝑀2
• Initially we assumed:
? × 2𝐸𝑖𝑛𝑖𝑡−127 1𝑥𝑥𝑥𝑥 … 𝑥. × 2𝐸𝑖𝑛𝑖𝑡−127
𝑏2 𝑀2 𝑏1 𝑀1 𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0

1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+23−127
Floating-Point Division
• What if the first 1 appears at the middle of the quotient?
01𝑥𝑥𝑥 … 𝑥. 𝑥 × 2𝐸𝑖𝑛𝑖𝑡−127 1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+22−127
𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0 . 0

◦ Need to extend the dividend to the right to get 23 bits of

quotient after the first 1
◦ Subtract 1 from the exponent for every additional cycle
◦ Use the last 23 bits of the quotient as the mantissa of the
result
▪ 23 bits does not include the first/implied one
Floating-Point Division
• The described approach can also be used even when the final
quotient must be subnormal
◦ The algorithm stops only when 24 significant digits are obtained
◦ If the actual exponent becomes less than -126
▪ Shift the quotient to the right (throw the LSB and shift in a zero)
▪ Add 1 to the exponent
▪ Repeat until the exponent is -126
◦ If 23 shifts are done and the exponent is still less than -126, then
an underflow occurred (return a floating point zero as the
answer)

• Further optimization?
◦ Identify the exponent
◦ Identify along the way if the result is subnormal?
Floating-Point Division
• Considering 1 bit at a time (from the MSB side) of the dividend is
like shifting in the bits into a “current dividend” register
◦ Recall: (non-)restoring division and division hardware
◦ When extending the dividend, if needed, simply shift in a zero
• A 24-bit quotient is needed where results are placed one bit at a
time (placing new results at the LSB) and shifting old results to the
left
◦ When a 1 is shifted (left) into the MSB of this quotient, then the
last division cycle must have completed the 24-bit result
◦ Indicator that the division process can be stopped
• However, the exponent can also be used to flag the division process
to stop (when the answer is a subnormal)
◦ How to do this?
Floating-Point Division
• From previous described process, every additional division
cycle (extension of the dividend) requires the exponent to be
increased by 1
◦ This process can be traced back from the initial division
without waiting for the 24-bit quotient to be completed!

• This time, assume that every new result from a division cycle is
the last bit of the mantissa even the from the first division
◦ Simplifies the process by just extracting always the LSBs (23
bits) of the current quotient
◦ Additional step: add 47 to the initial exponent
Floating-Point Division
• Revised process
1. Determine the sign of the result
2. Compute the initial exponent 𝐸𝑖𝑛𝑖𝑡
◦ Note the case (N/N, N/D, D/N, or D/D)
3. Add 47 to the 𝐸𝑖𝑛𝑖𝑡
4. Perform division of significands (see details next slide)
◦ Exponent, mantissa, and normalization is done on this step
5. Place into the final registers
Floating-Point Division
• Division of significands
1. Shift a bit of the dividend (from the MSB side) into a
current dividend (also “remainder”) register
2. Perform division as with restoring division
▪ Subtract, evaluate and shift in a new quotient bit, restore if needed
▪ Check if 23rd bit of quotient is already a 1
▫ If yes, then 24-bit significand is already complete (last cycle)
· Still perform the step below
3. Subtract 1 from the exponent
▪ Check if the actual exponent is -126
▫ If yes, then result is subnormal. Extract the 23 LSB of the quotient and save
as Mantissa.
4. Repeat if not yet done
Rounding off
• Intermediate values of significands may need to be
represented using more bits to minimize errors
• Guard bits – additional bits in the mantissa retained during
intermediate steps or operations
◦ Maintains additional accuracy in final results
• Three common-rounding off methods
◦ Truncation
◦ Von Neumann rounding
◦ Round to neares (even)
Rounding Off (Truncation)
Truncation
• Extra bits are simply discarded
• Biased approximation since error range is not symmetrical
around in between numbers
• “Rounding towards 0”
• Example: truncate from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.010110 → 0.010
Rounding Off (Von Neumann)
Von Neumann rounding
• If bits to be removed are all 0, truncate
• If any of the bits to be removed is 1, set LSB of retained bits to 1
• Unbiased approximation (symmetric error)
• Larger error range
• Example: from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.001001 → 0.001
◦ 0.010001 → 0.011
Rounding Off (Nearest Even)
Round to nearest (even)
• Requires 3 guard bits
◦ First two guard bits are part of the mantissa to be removed
◦ Third bit is the OR of all bits beyond the first two (at least
one 1)
• Achieves least range of error but is also most difficult to
implement because of addition and possible re-normalization
Rounding Off (Nearest Even)
Round to nearest (even)
• Do the following for each case of guard bits
◦ 0xx – truncate
◦ 100 – round to nearest (even)
▪ +1 to LSB if LSB is 1 (odd)
▪ Truncate if LSB is 0 (already even)
◦ 101, 110, 111 – round up (+1 to LSB)

• Example: From 6 bits to 3 bits

◦ 0.010101 → 0.010 + 0.001 = 0.011 (+1 LSB)
◦ 0.001101 → 0.001 + 0.001 = 0.010 (+1 LSB)
◦ 0.010011 → 0.010 (truncate)
◦ 0.001100 → 0.001 + 0.001 = 0.010 (nearest even, +1 LSB)
◦ 0.010100 → 0.110 (nearest even, truncate)
Rounding Off (Nearest Even)
• Consider the following:
1.000 x 25 (32.00)
- 1.111 x 21 - ( 3.75)
28.35

What happens with infinite precision?

What happens with nearest even rounding off?
What happens with 2 guard bits?
Rounding Off (Nearest Even)
• With infinite precision for partial operands
1.000 x 25 1.000 x 25 01.000 0000 x 25
- 1.111 x 21 - 0.000 1111 x 25 + 11.111 0001 x 25
00.111 0001 x 25
normalized complete answer: 1.110 001 x 24 (28.25)
normalized rounded answer: 1.110 x 24 (28)

• With nearest even rounding off (3 guard bits)

1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 111 x 25 + 11.111 001 x 25
00.111 001 x 25
normalized answer: 1.110 01 x 24 (28.2510)
rounded answer: 1.110 x 24 (28)
Rounding Off (Nearest Even)
• With infinite precision (2 guard bits)
◦ Last guard bit (2nd guard bit) acts as a sticky bit
1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 11 x 25 + 11.111 01 x 25
00.111 01 x 25
normalized answer: 1.110 1 x 24
rounded answer: 1.111 x 24 (30)

Floating Point Representation
No ratings yet
Floating Point Representation
5 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
RD sharma ch 3
No ratings yet
RD sharma ch 3
149 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Telematicaa
No ratings yet
Telematicaa
162 pages
BCSE205L-Module 2 Division and Floating Point Arithmetic
No ratings yet
BCSE205L-Module 2 Division and Floating Point Arithmetic
36 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
1.Identify different types of numbers Part 1 QP
No ratings yet
1.Identify different types of numbers Part 1 QP
9 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
ME 554 Lecture Slides Trademark
No ratings yet
ME 554 Lecture Slides Trademark
27 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
Floating Points
No ratings yet
Floating Points
31 pages
Part 5 Floating Point Add Sub Mul
No ratings yet
Part 5 Floating Point Add Sub Mul
20 pages
COA-Module6-FloatingPoint
No ratings yet
COA-Module6-FloatingPoint
17 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
Ques. Determinants (Part 1)
No ratings yet
Ques. Determinants (Part 1)
9 pages
Arithmetic & Logic Unit
No ratings yet
Arithmetic & Logic Unit
58 pages
COA
No ratings yet
COA
14 pages
181
No ratings yet
181
11 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
Lab 1
100% (1)
Lab 1
10 pages
Module2.1 of nothing
No ratings yet
Module2.1 of nothing
7 pages
ME 554 Lecture 3
No ratings yet
ME 554 Lecture 3
13 pages
ME 554 Lecture 1
No ratings yet
ME 554 Lecture 1
13 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Chapter 1 - 2
No ratings yet
Chapter 1 - 2
59 pages
IEEE Paper On Floating Point
No ratings yet
IEEE Paper On Floating Point
28 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
Chap-03 Computer Arithmetics
No ratings yet
Chap-03 Computer Arithmetics
16 pages
Ece552 10 Floating Point
No ratings yet
Ece552 10 Floating Point
15 pages
Cs2100 9 Floating Point
No ratings yet
Cs2100 9 Floating Point
32 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Division: Check For 0 Divisor Long Division Approach
No ratings yet
Division: Check For 0 Divisor Long Division Approach
27 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
ME 554 Lecture 2
No ratings yet
ME 554 Lecture 2
9 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
How To Represent Real Numbers: - in Decimal Scientific Notation
No ratings yet
How To Represent Real Numbers: - in Decimal Scientific Notation
16 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
11 pages
Floating Point Numbers - Representation & Arithmetic: Dr. Arunachalam V Associate Professor, SENSE
No ratings yet
Floating Point Numbers - Representation & Arithmetic: Dr. Arunachalam V Associate Professor, SENSE
14 pages
Computer Organization
No ratings yet
Computer Organization
22 pages
Year 7 Lesson 7 Amd 8 Square and Square Root
No ratings yet
Year 7 Lesson 7 Amd 8 Square and Square Root
8 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
31 pages
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
No ratings yet
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
18 pages
lecture_slides_02_026-IEEEfloats
No ratings yet
lecture_slides_02_026-IEEEfloats
8 pages
Lab 3
No ratings yet
Lab 3
5 pages
Lecture 2a: Conductor Ampacity - Terminations and Derating Factors
No ratings yet
Lecture 2a: Conductor Ampacity - Terminations and Derating Factors
62 pages
Review: How To Represent Real Numbers
No ratings yet
Review: How To Represent Real Numbers
9 pages
Floating Point: - We Need A Way To Represent
No ratings yet
Floating Point: - We Need A Way To Represent
14 pages
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
No ratings yet
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
16 pages
Gr3(Ex 14.1)
No ratings yet
Gr3(Ex 14.1)
4 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
Frequency Response: EEE 51: Second Semester 2018 - 2019
No ratings yet
Frequency Response: EEE 51: Second Semester 2018 - 2019
31 pages
Module 03 Transmission Line Models - With Notes
No ratings yet
Module 03 Transmission Line Models - With Notes
48 pages
Lecture 2b PDF
No ratings yet
Lecture 2b PDF
91 pages
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Binary Arithmetic
No ratings yet
Binary Arithmetic
54 pages
Record Book BCSL-022
No ratings yet
Record Book BCSL-022
52 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Eee25 13 CLT
No ratings yet
Eee25 13 CLT
29 pages
8.1.4 Data representation - Floatng point numbers
No ratings yet
8.1.4 Data representation - Floatng point numbers
3 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Floating Point Package User's Guide
No ratings yet
Floating Point Package User's Guide
13 pages
St10Assignment 2
No ratings yet
St10Assignment 2
5 pages
Operation of Integers
No ratings yet
Operation of Integers
38 pages
Lecture 16
No ratings yet
Lecture 16
7 pages
4.16. Floating Point
No ratings yet
4.16. Floating Point
5 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
7 pages
Lecture 14
No ratings yet
Lecture 14
6 pages
Lecture 13
No ratings yet
Lecture 13
5 pages
Lec 1 Power
No ratings yet
Lec 1 Power
15 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Math d4
No ratings yet
Math d4
3 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
1.4a - Prime Factorization
No ratings yet
1.4a - Prime Factorization
49 pages
11 Ramp Generators PDF
100% (1)
11 Ramp Generators PDF
50 pages
Lecture 9
No ratings yet
Lecture 9
3 pages
15 Middle Problems
No ratings yet
15 Middle Problems
3 pages
Complete Prime Factorization - 4024
No ratings yet
Complete Prime Factorization - 4024
3 pages
Ch08 Transformers 1s09
No ratings yet
Ch08 Transformers 1s09
9 pages
C++ Lab Work Sheet 1
No ratings yet
C++ Lab Work Sheet 1
9 pages
FREE Ebook - 10 Math Topics That Appear in Civil Service Exam
No ratings yet
FREE Ebook - 10 Math Topics That Appear in Civil Service Exam
22 pages
SQ & SQ Roots106
No ratings yet
SQ & SQ Roots106
20 pages
Number System and Logic Gates
No ratings yet
Number System and Logic Gates
22 pages
Homework 3 Computer Architecture
No ratings yet
Homework 3 Computer Architecture
4 pages
Fractions Notes
No ratings yet
Fractions Notes
3 pages
ECE 151 Syllabus AY1819s2
No ratings yet
ECE 151 Syllabus AY1819s2
2 pages
Fermat's Little Theorem & Lucas Primality Test
No ratings yet
Fermat's Little Theorem & Lucas Primality Test
5 pages
121 Syllabus
No ratings yet
121 Syllabus
2 pages
New Refund Form
No ratings yet
New Refund Form
1 page
Module 02 Generator and Transformer Models - With Notes
No ratings yet
Module 02 Generator and Transformer Models - With Notes
39 pages
For all problems, α = 0.05. Show screenshots from JMP as needed
No ratings yet
For all problems, α = 0.05. Show screenshots from JMP as needed
1 page
5-Digit Number Challenges 1: Use The Digits 3, 1, 2, 6, 9 To Make A 5-Digit Number Each Time
0% (1)
5-Digit Number Challenges 1: Use The Digits 3, 1, 2, 6, 9 To Make A 5-Digit Number Each Time
2 pages
Permutations and Combination Notes For Class 11 Maths Chapter 7 PDF
100% (1)
Permutations and Combination Notes For Class 11 Maths Chapter 7 PDF
8 pages
PrimeNumberPracticeSets
No ratings yet
PrimeNumberPracticeSets
4 pages
Class Viii To X Ioqm Number Theory
No ratings yet
Class Viii To X Ioqm Number Theory
12 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
WMI Preliminary Round (GRADE 6A) 2021
100% (1)
WMI Preliminary Round (GRADE 6A) 2021
7 pages
Assignment Real Numbers Class X: Answers
No ratings yet
Assignment Real Numbers Class X: Answers
1 page
Detailed Lesson Plan in Operation On Integer
100% (10)
Detailed Lesson Plan in Operation On Integer
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 10 (Temp)

Uploaded by

Lecture 10 (Temp)

Uploaded by

Lecture 10: Floating-Point

EEE 105 Computer Organization

• Not a number (NaN)

◦ 𝐸 = 010 (exponent factor is still 2−126 but no implied 1)

◦ Maximum magnitude (𝐸 = max 𝐸 − 1, 𝑀 = all 1’s)

◦ Minimum non-zero magnitude (𝐸 = 0, 𝑀 = 0. . 01)

Significand2 = 1.00001111000011110001011 Note: simplification only

• Need to align radix points first

• Need to normalize to identify the correct 𝐸 and 𝑀 to use

• Align radix points first

• Perform subtraction (or perform 2’s complement then add)

• Normalize the result

• Normal and subnormal

• If both operands are normal, initial 𝐸 of the result is

• Significands are 24-bits wide each

With subnormal: 𝑁1 × 𝑁2 = 0. 𝑀1 × 1. 𝑀2 2 −126+𝐸2 −127

• When one operand is subnormal, the first 1 can appear almost

• Is this the only option?

where 𝑏1,2 can be a 1 or 0

• Dividend is normal, divisor is subnormal

• Dividend is subnormal, divisor is normal

• If both operands are normal, initial 𝐸 of the result is

• Note: Initial exponent cannot fully identify if overflow or

• N/D: 1. 𝑀1 /0. 𝑀2 2 𝐸1 +126 −127

• D/N: 0. 𝑀1 /1. 𝑀2 2 128−𝐸2 −127

• D/D: 0. 𝑀1 /0. 𝑀2 2 127 −127

Legend: (N) normal, (D) denormal

where 𝑏1,2 can be a 1 or 0

• Goal: Divide the WITHOUT REMAINDER

where 𝑏1,2 can be a 1 or 0

◦ Need to extend the dividend to the right to get 23 bits of

• Example: From 6 bits to 3 bits

What happens with infinite precision?

• With nearest even rounding off (3 guard bits)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.