0% found this document useful (0 votes)
58 views41 pages

Floating Point Numbers: CS101 Introduction To Computing

Floating point numbers use a binary representation scheme called IEEE 754 that represents numbers as a sign bit, exponent field, and mantissa to support a wide range of values much larger and smaller than can be represented with integers. Floating point numbers have limited precision due to their fixed field sizes, resulting in a non-uniform density of representable values across their range. Arithmetic and logical operations on floating point numbers in programming languages like C may require special handling of conversions and type casting between integer and floating point types.

Uploaded by

Mihir Kumar Mech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views41 pages

Floating Point Numbers: CS101 Introduction To Computing

Floating point numbers use a binary representation scheme called IEEE 754 that represents numbers as a sign bit, exponent field, and mantissa to support a wide range of values much larger and smaller than can be represented with integers. Floating point numbers have limited precision due to their fixed field sizes, resulting in a non-uniform density of representable values across their range. Arithmetic and logical operations on floating point numbers in programming languages like C may require special handling of conversions and type casting between integer and floating point types.

Uploaded by

Mihir Kumar Mech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CS101 Introduction to computing 

Floating Point Numbers

A. Sahu amd S. V .Rao


Dept of Comp. Sc. & Engg.
Dept of Comp. Sc. & Engg.
Indian Institute of Technology Guwahati

1
Outline
• Need to floating point number 
• Number representation : IEEE 754
• Floating point range
Floating point range 
• Floating point density 
–Accuracy 
• Arithmetic
Arithmetic  and Logical Operation on 
and Logical Operation on
FP 
• Conversions  and type casting in C
C i d i i C
2
Need to go beyond integers
Need to go beyond integers
complex
• integer    7
integer 7
• rational    5/8 real
• l √3
real          √3 rationall
• complex   2 ‐ 3 i integer

Extremely large and small values:
distance pluto ‐ sun = 5.9 1012 m
mass of electron = 9 1 x 10‐28 gm
mass of electron = 9.1 x 10
Representing fractions
Representing fractions
• Integer pairs (for rational numbers)
Integer pairs (for rational numbers)
5 8 = 5/8
St i
Strings with explicit decimal point
ith li it d i l i t
‐ 2 4 7 . 0 9
Implicit point at a fixed position
0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1
Floating point implicit point
l

x base power
fraction x base 
fraction
Numbers with binary point
Numbers with binary point
11 = 1x22 + 0x2
101.11 = 1x2
101 + 0x21 + 1x2
+ 1x20 +  +1x2‐1 + 1x2
+ .  +1x2 + 1x2‐2
= 4 + 1 + .+  0.5 + 0.25 = 5.7510
0 6 0 00 00 00 00 00
0.6 = 0.10011001100110011001.....
.6 x 2 = 1 + .2
.2 x 2 = 0 + .4
.4 x 2 
4 x 2 = 0 + .8
0+ 8
.8 x 2 = 1 + .6
Numeric Data Type
Numeric  Data Type
• char, short,  int, long int
– char : 8 bit number (1 byte=1B)
– short: 16 bit number (2 byte)
– int : 32 bit number (4B)
– long int : 64 bit number (8B)
• float, double, long double
– float : 32 bit number (4B)
– double : 64 bit number (8B)
– long double : 128 bit number (16B)
6
Numeric Data Type
Numeric Data Type
unsigned char
char
g
unsigned short 
short 

Unsigned int

int
7
Numeric Data Type
Numeric  Data Type
• char, short,  int, long int
– We have : Signed and unsigned version
W h Si d d i d i
– char  (8 bit)
• char : 128 to 127 we have +0 and 0 ☺ ☺ Fun
char : ‐128 to 127, we have +0 and ‐0 ☺
• unsigned char: 0 to 255
– int : ‐231 to  231‐1
– unsigned int : 0  to  232‐1
• float, double, l
double, long double
ong double
– For fractional, real number data
– All these numbered are signed and get stored in
All these numbered are signed and get stored in 
different format
8
Sign bit
Sign bit Numeric Data Type
Numeric Data Type

Exponent Mantissa
float

Exponent Mantiss‐1

Mantissa‐2
double
9
FP numbers with base = 10
FP numbers with base  10
((‐1)
1)S x F x 10
x F x 10E
S = Sign
F = Fraction (fixed point number)
(f d b )
usually called Mantissa or Significand
E = Exponent (positive or negative integer)
Example          5.9x10
p 12 ,,  ‐2.6x103  9.1 x 10‐28

Only one non‐zero digit left to the point
FP numbers with base = 2
FP numbers with base  2
(‐1) S x F x 2 E
S = Sign
F = Fraction (fixed point number)
y
usually called Mantissa or Significand
g
E = Exponent (positive or negative integer)

• How to divide a word into S, F and E?
How to divide a word into S F and E?
• How to represent S, F and E?

Example    1.0101x212 ,    ‐1.11012x103       1.101 x 2‐18
Only one non‐zero digit left to the point: default it 
will be 1 incase of binary
will be 1 incase of binary
So no need to store this
IEEE 754 standard
IEEE 754 standard
Single precision numbers
Single precision numbers
1        8                                       23
0 0101 1101 0110 1011 0001 0110 110
1011  0101
1011 1101 0110 1011 0001 0110 110
S        E                                           F
Double precision numbers
Double precision numbers
1         11                          20+32
0 0101 111 1101 0110 1011 0001 0110
1011  0101 111
1011

S          E                              F
1011 0001 0110 1100 1011 0101 1101 0110
Representing F in IEEE 754
Representing F in IEEE 754
Single precision numbers
23
1. 110101101011000101101101
F
Double precision numbers
20+32
1. 101101011000101101101
F
101100010110110010110101110101101

Only one non‐zero digit left to the point: default it will be 1 incase 
of binary. So no need to store this bit
Value Range for F
Value Range for F
Single precision numbers
Single precision numbers
1 ≤ F ≤ 2 ‐ 2‐23 or  1 ≤ F < 2
Double precision numbers
Double precision numbers
1 ≤ F ≤ 2 ‐ 2‐52 or         1 ≤ F < 2

These are “normalized”


These are  normalized .
Representing E in IEEE 754
Representing E in IEEE 754
Single precision numbers
Single precision numbers
8
10110101
E               bias 127
Double precision numbers
Double precision numbers
11
10110101110
E              bias 1023
Floating point values
Floating point values
• E=E 127 V =( 1)s  
E=E’‐127, V =(‐1)s x 1 .M x 2 
x 1 M x 2 EE’‐127
127

• V=  1101 x 2 ((40‐127))=1.1101.. x 2


V= 1.1101…  x 2  1101 x 2‐87
Single precision numbers
Single precision numbers
1        8                                       23
0 1000 1101 0110 1011 0001 0110 110
0010 1000
0010 1101 0110 1011 0001 0110 110
S        E’                                           F

16
Floating point values
Floating point values
• E=E’‐127, V =(‐1)s  x 1 .M x 2 E’‐127
• V= ‐1.1 x 2 (126‐127)=‐1.1 x 2‐1 =‐0.11x20
= ‐0.11 = ‐11/2210=‐3/410=‐0.7510
Single precision numbers
Single precision numbers
1        8                                       23
1 1110 1000 0000 0000 0000 0000 000
0111 1110
0111 1000 0000 0000 0000 0000 000
S        E’                                           F

17
Value Range for E
Value Range for E
Single precision numbers
Single precision numbers
‐126 ≤ E ≤ 127  
(all 0’s and all 1’s have special meanings)
Double precision numbers
Double precision numbers
‐1022 ≤ E ≤ 1023  
( ll 0’
(all 0’s and all 1’s have special meanings)
d ll 1’ h i l i )
Floating point demo applet on the 
webb
• https://www
https://www.hh‐
schmidt.net/FloatConverter/IEEE754.html

• Google “Float applet” to get the above link

19
Overflow and underflow
Overflow and underflow
largest positive/negative number (SP) = 
g p / g ( )
±(2 ‐ 2‐23) x 2127 ≅ ± 2 x 1038
smallest positive/negative number (SP) = 
p / g ( )
± 1 x 2‐126 ≅ ± 2 x 10 ‐38

Largest positive/negative number (DP) = 
( 2‐52)) x 21023 ≅ ± 2 x 10308
±(2 ‐
Smallest positive/negative number (DP) = 
± 1 x 2‐1022 ≅ ± 2 x 10 ‐308
Density of int vs float
Density of  int float 
Int : 32 bit 
: 32 bit

Exponent Mantissa
Float : 32 bit
Float : 32 bit
• Number of number can be represented 
) 32
– Both the cases (float, int) : 2
(
• Range  
– int (‐231 to 231‐1)   
( 2‐23
– float  Large ±(2 ‐
fl 23) x 2
) 127    
127 Small±
ll 1 x 2‐126 
126

• 50% of float numbers are  Small  (less then ±1 ) 21
Density of Floating Points
Density of Floating Points
• 256 Persons in Room of Capacity 256     (Range)
8 bi i
8  bit integer :   256/256 = 1 
256/256 1
• 256 person in Room of Capacity  200000 
(Range)
– 1st Row should be filled with 128 person
– 50% number  with negative power are ‐1 < N > +1
• Density of Floating point number is 
y gp
– Dense towards  0   
Sparse towards ∞
Sparse towards  
‐–∞                            ‐2   ‐1   0    +1   +2                         + ∞
2 1 0 1 2
22
Expressible Numbers(int and float)
Expressible integers
Expressible integers

‐ overflow
fl +
+ overflow
fl
‐231 0 231‐1
‐ underflow
Expressible Float
+ underflow
+ underflow

‐ overflow
fl + overflow
fl
0
(1‐2‐24)x2128 ‐0.5x2‐127 0.5x2‐127 (1‐2‐24)x2128
Distribution of Values
• 6‐bit IEEE‐like format
– e = 3 exponent bits
3 bi
– f = 2 fraction bits
– Bias is 3

• Notice how the distribution gets denser 
-15 -10 -5 0 5 10 15
toward zero. 
Denormalized Normalized Infinity
Distribution of Values
( l
(close‐up view)
i )
• 6
6‐bit
bit IEEE
IEEE‐like
like format
format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 3
Bi i 3

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Density of 32 bit float SP
Density of 32 bit float SP
• Fraction/mantissa is 23 bit
• Number of different number can be stored for 
Number of different number can be stored for
particular value of exponent
– Assume for  exp
Assume for exp=1, 223=8x1024x1024
1,   2 8x106
8x1024x1024 ≈8x10
– Between 1‐2 we can store 8x106 numbers
• Similarly 
y
– for exp=2, between 2‐4, 8x106 number of number can 
be stored
3 b 8 8 06 number of number can 
– for exp=3, between 4‐8, 8x10
f b f b
be stored
– for exp
for exp=4,
4, between 8
between 8‐16, 8x106 number of number 
16, 8x10 number of number
can be stored
26
Density of 32 bit float SP
Density of 32 bit float SP
• Similarly 
– for exp=23, between 2
f 23 b 222‐2
223, 8x10
8 106 number of 
b f
number can be stored
223‐2
– ffor exp=24, between 2
24 b t 224, 8x10
8 106 number of 
b f
number can be stored OK

for exp=25 between 224‐2


– for exp=25, between 2 225, 8x10
8x106 number of 
number of
number can be stored 
• 224‐225  >8 x10
>8 x106 BAD

–…
for exp=127 between 2126‐2
– for exp=127, between 2 2127, 8x10
8x106 number of 
number of
number can be stored WROST 27
Density of 32 bit float SP
Density of 32 bit float SP
• 223=8x1024x1024 ≈8x106

0 1 2     4             8                                  16 

28
Numbers in float format
Numbers in float format
• largest positive/negative number (SP) = 
±(2 ‐ 2‐23) x 2
±(2  ) x 2127 ≅ ± 2 x 10
2 x 1038
Second largest number : 
±(2 ‐ 2‐22) x 2
±(2  ) x 2127 

Difference Largest FP ‐ 2nd largest FP


= (2‐23‐2‐22)x2127=2x2105=2x1032

Smallest positive/negative number (SP) = 
± 1 x 2‐126 ≅ ± 2 x 10 ‐38

29
Addition/Sub of Floating Point
Addition/Sub of  Floating Point
3.2 x 10 8 ± 2.8 x 10 6
Step I: 
Align Exponents
g p x 10 6 ± 2.8
320 x 10 
320 x 10 6
2.8 x 10 

Step 2: 
Step 2:
Add Mantissas
322.8 x 10 6

Step 3: 
Normalize 3 228 x 108
3.228 x 10
30
Floating point operations: ADD
Floating point operations: ADD
• Add/subtract      A = A1 ± A2
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2 x 2E2]
suppose E1 > E2, then we can write it as
pp ,
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2’ x 2E1]
where F2 = F2 / 2E1‐E2,
where F2’ = F2 / 2 3 2 x 10 8 ± 2.8 x 10 
3.2 x 10  2 8 x 10 6
320 x 10 6 ± 2.8 x 10 6
The result is 
Th lt i 322 8 10 6
322.8 x 10 
(‐1)S1 x (F1 ± F2’) x 2E1 3.228 x 108
It may need to be normalized
Testing Associatively with FP
Testing Associatively with FP
• X=  
X= ‐1 5x1038,  Y=1.5x10
1.5x10 Y=1 5x1038,  z=1000.0
z=1000 0
• X+(Y+Z) = ‐1.5x1038 + (1.5x1038 + 1000.0)
= ‐1.5x10038 + 1.5x10038 
38

=0
• (X+Y)+Z = (‐1.5x1038 + 1.5x1038 ) + 1000.0
=  0.0 + 1000.0
0 0 + 1000 0
=1000 

32
Multiply Floating Point
Multiply Floating Point
3.2 x 10 8 X   5.8 x 10 6
Step I: 
Multiply Mantissas
py 3.2  X  5.8    X  108 x 10 6 

Step 2:
Step 2: 56 10 14
18.56 x 10 
18
Add Exponents 

Step 3: 
Normalize 1 856 x 1015
1.856 x 10
33
Floating point operations
Floating point operations
• Multiply
[(‐1)S1 x F1 x 2E1] x [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1xF2) x 2E1+E2
Since 1 ≤ (F1xF2) < 4,
( ) ,
the result may need to be normalized
2 x 10 8 X   5.8 x 10 
3.2 x 10 
3 X 5 8 x 10 6
3.2  X  5.8    X  108 x 10 6 
18 56 x 10 14
18.56 x 10 
1.856 x 1015
Floating point operations
Floating point operations
• Divide
[(‐1)S1 x F1 x 2E1] ÷ [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1 ÷ F2) x 2E1‐E2
Since .5 <
Since (F1 ÷ F2) < 2,
5 < (F1  F2) < 2
the result may need to be normalized

(assume F2 ≠ 0)
Float and double 
• Float : single precision floating point
• Double : Double precision floating point
• Floating points operation are slower
– But not in newer PC ☺ ☺
But not in newer PC ☺
• Double operation are even slower
Precision/Accuracy  in Calculation   

Integer  Float  Double

Speed of Calculation  36
Floating point Comparison
gp p
• Three phases 
• Phase I: Compare sign  (give result)
Phase I: Compare sign (give result)
• Phase II: If (sign of both numbers are same) 
– Compare exponents  and give result
C d i l
– 90% of case it fall in this categories
– Faster as compare to integer comparison : 
F t t i t i
Require only 8 bit comparison for float and 11 bit 
for double   (Example : sorting of float numbers)
( p g )
• Phase III: If (both sign and exponents are 
same))
– compare fraction/mantissa
Storing and Printing Floating Point
Storing and Printing Floating Point
float x=145.0,y;
,y; Many Round 
y=sqrt(sqrt((x))); off cause loss 
of accuracy
x=(y*y)*(y*y);
printf("\nx=%f",x); x=145.000015

float x=1.0/3.0; Value stored in x is 


if ( x==1
x==1.0/3.0)
0/3 0) not exactly same
not exactly same 
as 1.0/3.0
printf(“YES”);
else NO
printf(“NO”); 38
Storing and Printing Floating Point
Storing and Printing Floating Point
float
oat a=34359243.5366233;
a 3 359 3.5366 33;
float b=3.5366233;
float c=0.00000212363;
c=0 00000212363;
printf("\na=%8.6f, b=%8.6f
%8 12f\ " a, b
c=%8.12f\n", b, c )
);

Big number with 
a=34359243.000000 small fraction can 
b=3.5366233
b 3.5366233 not combined
not combined
c=0.000002123630
39
Storing and Printing Floating Point
Storing and Printing Floating Point
//15 S digits
g to store
float a=34359243.5366233;
//8 S digits to store
float b=3.5366233;
//6 S digits to store
float c=0.00000212363;

Thumb rule:  8 to 9 significant digits of a 
number can be stored in a 32 bit number
40
Thanks

41

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy