0% found this document useful (0 votes)
38 views5 pages

Floating Point Representation

IEEE standard defines three floating point number formats: single precision (32 bits), double precision (64 bits), and extended precision (80 bits). Floating point numbers represent numbers in the form (-1)^s * m * 2^e, where s is the sign bit, m is the mantissa, and e is the exponent. The document then explains how to convert a decimal number like -4.40625 to its 32-bit single precision floating point binary representation using a multi-step process that normalizes the number and encodes the exponent in biased form. Special cases like infinity and NaN (Not a Number) are also discussed.

Uploaded by

mohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

Floating Point Representation

IEEE standard defines three floating point number formats: single precision (32 bits), double precision (64 bits), and extended precision (80 bits). Floating point numbers represent numbers in the form (-1)^s * m * 2^e, where s is the sign bit, m is the mantissa, and e is the exponent. The document then explains how to convert a decimal number like -4.40625 to its 32-bit single precision floating point binary representation using a multi-step process that normalizes the number and encodes the exponent in biased form. Special cases like infinity and NaN (Not a Number) are also discussed.

Uploaded by

mohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Get started

Coder's Corner

Floating Point Representation


Rukshani Athapathu
May 22, 2018 · 4 min read

Image Courtesy: Pexels

Numbers with fractions that can be put in the form,

can be represented as floating point numbers in computers.

These numbers are called floating points because the binary point is not fixed. Up until
about 1980s different computer manufacturers used different formats for representing
floating point numbers, but with the introduction of IEEE standard 754, nowadays
almost all the computers follow the said standards which greatly increased the
portability of floating point data.

IEEE Floating Point Representation


IEEE Floating Point Representation
IEEE standard defines three formats for representing floating point numbers,

1. Single Precision (32 bits)

2. Double Precision (64 bits)

3. Extended Precision (80 bits)

and with this standard, floating point numbers are represented in the form,

s represents the sign of the number. When s=1, floating point number is negative and
when s=0 it is positive. F represent the fraction (which is also called mantissa) and E is
the exponent.

Structure of the two most commonly used formats are shown below.

Single Precision (32-bit)

Double Precision (64-bit)

Now let’s see how we can convert a given decimal number to a floating point binary
representation. Lets take -4.40625 as an example.

Step 1:
First convert the integral part which is 4 to binary.

Step 2:
Then we can multiply the fractional part repeatedly by 2 and pick the bit that appears on
Then we can multiply the fractional part repeatedly by 2 and pick the bit that appears on
the left of the decimal to get the binary representation of the fractional part. For
example,

Step 3:
Now we need to normalize the number by moving the binary point so that it takes the
form,

One very important thing to remember here is, that the leading 1 bit does not need to be
stored since it is implied. That is a clever trick used by the standard to get additional
space for fractional part. So for the fractional part(23 or 52 bits) we need only to save
0001101 in this case and fill the rest of the its bits on to the right with 0s.

Step 4:
Now the exponent is represented as a integer in biased form. So if the exponent has k-
bits then the bias equals to,

Add this bias to the exponent and place it in the exponent section. With single precision,
k has 8 bits so the exponent value in this example equals to,

Finally the sign bit is set according to the original sign of the number. That is 1 for
negative and 0 for positive.

So the decimal value -4.40625 in binary form can be represented as,


Special Cases — Infinity and NaN
Infinity — When the exponent bits are all ones and the fraction bits are all 0 then the
resulting value represents infinity.

Infinity

NaN — When exponent bits are all ones but the fraction value is non zero then the
resulting value is said to be NaN which is short for Not a Number. You get this value
when you perform invalid operations like dividing zero by zero, subtracting infinity from
infinity etc…

NaN

References
• http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html

• https://www.amazon.de/Computer-Systems-Programmers-Perspective-
Global/dp/1292101768

• https://www.amazon.com/Computer-Organization-Design-MIPS-
Fifth/dp/0124077269

• https://www.amazon.com/Structured-Computer-Organization-Andrew-
Tanenbaum/dp/0132916525

Programming Floating Points Binary Bits Fundamentals

5
claps

Follow
Rukshani Athapathu
Coder's Corner Follow
0100111101101111011011110111000001110011

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy