Multiplying Floating Point Numbers
Multiplying Floating Point Numbers
Multiplying Floating Point Numbers
Introduction
We'll do addition using the one byte floating point representation discussed in the
other class notes. IEEE 754 single precision has so many bits to work with, that it's
simply easier to explain how floating point addition works using a small float
representation.
Multiplication is simple. Suppose you want to multiply two floating point
numbers, X and Y.
Here's how to multiply floating point numbers.
1. First, convert the two representations to scientific notation. Thus, we explicitly
represent the hidden 1.
2. Let x be the exponent of X. Let y be the exponent of Y. The resulting exponent
(call it z) is the sum of the two exponents. z may need to be adjusted after the
next step.
3. Multiply the mantissa of X to the mantissa of Y. Call this result m.
4. If m is does not have a single 1 left of the radix point, then adjust the radix
point so it does, and adjust the exponent z to compensate.
5. Add the sign bits, mod 2, to get the sign of the resulting multiplication.
6. Convert back to the one byte floating point representation, truncating bits if
needed.
Example
Let's multiply the following two numbers:
Variable
sign
exponent
fraction
1001
010
0111
110
sign
exponent
fraction
X*Y
1010
000
Negative Values
Unlike floating point addition, negative values are simple to take care of in floating
point multiplication. Treat the sign bit as 1 bit UB, and add modulo 2. This is the same
as XORing the sign bit.
Bias
sign
exponent
fraction
1001
110
0111
000
sign
exponent
fraction
X+Y
1010
000
Example 2
Let's add the following two numbers:
Variable
sign
exponent
fraction
1001
110
0110
110
However, for simplicity, we're going to truncate the additional two bits. After
truncating, we get 1.111 x 22. We convert this back to floating point.
Sum
sign
exponent
fraction
X+Y
1010
111
This example illustrates what happens if the exponents are separated by too much. In
fact, if the exponent differs by 4 or more, then effectively, you are adding 0 to the
larger of the two numbers.
Negative Values
So far, we've only considered adding two non-negative numbers. What happens with
negative values?
If you're doing it on paper, then you proceed with the sum as usual. Just do normal
addition or subtraction.
If it's in hardware, you would probably convert the mantissas to two's complement,
and perform the addition, while keeping track of the radix point (read about fixed
point representation.
Bias
Does the bias representation help us in floating point addition? The main difficulty
lies in computing the differences in the exponent. Still, that's not so bad because we
can just do unsigned subtraction. For the most part, the bias doesn't pose too many
problems.
Overflow/Underflow
It's possible for a result to overflow (a result that's too large to be represented) or
underflow (smaller in magnitude than the smallest denormal, but not zero). Real
hardware has rules to handle this. We won't worry about it much, except to
acknowledge that it can happen.
Summary
Adding two floating point values isn't so difficult. It basically consists of adjusting the
number with the smaller exponent (call this Y) to that of the larger (call it X), and
shifting the radix point of the mantissa of the Y left to compensate.
Once the addition is done, we may have to renormalize and to truncate bits if there are
too many bits to be represented.
If the differences in the exponent is too great, then the adding X + Y effectively
results in X.
Real floating point hardware uses more sophisticated means to round the summed
result. We take the simplification of truncating bits if there are more bits than can be
represented.