Floating Point Representation
Floating Point Representation
Coder's Corner
These numbers are called floating points because the binary point is not fixed. Up until
about 1980s different computer manufacturers used different formats for representing
floating point numbers, but with the introduction of IEEE standard 754, nowadays
almost all the computers follow the said standards which greatly increased the
portability of floating point data.
and with this standard, floating point numbers are represented in the form,
s represents the sign of the number. When s=1, floating point number is negative and
when s=0 it is positive. F represent the fraction (which is also called mantissa) and E is
the exponent.
Structure of the two most commonly used formats are shown below.
Now let’s see how we can convert a given decimal number to a floating point binary
representation. Lets take -4.40625 as an example.
Step 1:
First convert the integral part which is 4 to binary.
Step 2:
Then we can multiply the fractional part repeatedly by 2 and pick the bit that appears on
Then we can multiply the fractional part repeatedly by 2 and pick the bit that appears on
the left of the decimal to get the binary representation of the fractional part. For
example,
Step 3:
Now we need to normalize the number by moving the binary point so that it takes the
form,
One very important thing to remember here is, that the leading 1 bit does not need to be
stored since it is implied. That is a clever trick used by the standard to get additional
space for fractional part. So for the fractional part(23 or 52 bits) we need only to save
0001101 in this case and fill the rest of the its bits on to the right with 0s.
Step 4:
Now the exponent is represented as a integer in biased form. So if the exponent has k-
bits then the bias equals to,
Add this bias to the exponent and place it in the exponent section. With single precision,
k has 8 bits so the exponent value in this example equals to,
Finally the sign bit is set according to the original sign of the number. That is 1 for
negative and 0 for positive.
Infinity
NaN — When exponent bits are all ones but the fraction value is non zero then the
resulting value is said to be NaN which is short for Not a Number. You get this value
when you perform invalid operations like dividing zero by zero, subtracting infinity from
infinity etc…
NaN
References
• http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html
• https://www.amazon.de/Computer-Systems-Programmers-Perspective-
Global/dp/1292101768
• https://www.amazon.com/Computer-Organization-Design-MIPS-
Fifth/dp/0124077269
• https://www.amazon.com/Structured-Computer-Organization-Andrew-
Tanenbaum/dp/0132916525
5
claps
Follow
Rukshani Athapathu
Coder's Corner Follow
0100111101101111011011110111000001110011