We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21
Representing and Manipulating Information
Floating-Point Number Representation
A floating-point number (or real number) can represent a very large value (e.g., 1.23×10^88) or a very small value (e.g., 1.23×10^-88). It could also represent very large negative number (e.g., -1.23×10^88) and very small negative number (e.g., -1.23×10^-88), as well as zero, as illustrated: Floating-Point Number Representation A floating-point number is typically expressed in the scientific notation, With a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×10^E); While binary numbers use radix of 2 (F×2^E). Representation of floating point number is not unique. For example, the number 55.66 can be represented as 5.566×101, 0.5566×102, 0.05566×103, and so on. Floating-Point Number Representation The fractional part can be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×102; binary number 1010.1011B can be normalized as 1.0101011B×23. Floating-Point Number Representation It is important to note that floating-point numbers suffer from loss of precision. When represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy. Floating-Point Number Representation Floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point co- processor. Hence, use integers if your application does not require floating-point numbers. Floating-Point Number Representation In computers, floating-point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating- point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision. IEEE-754 32-bit Single-Precision Floating-Point Numbers In 32-bit single-precision floating-point representation: The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers. The following 8 bits represent exponent (E). The remaining 23 bits represents fraction (F). Normalized Form Let's illustrate with an example, suppose that the 32-bit pattern is, 1 1000 0001 011 0000 0000 0000 0000 0000, with: S=1 E = 1000 0001 F = 011 0000 0000 0000 0000 0000 Normalized Form In the normalized form, the actual fraction is normalized with an implicit leading 1 in the form of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2-2 + 1×2-3 = 1.375D. The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this example with S=1, this is a negative number, i.e., -1.375D. Normalized Form The exponent field is interpreted as representing a signed integer in biased form. That is, the exponent value is E = e − Bias, where e is the unsigned number having bit representation ek−1 . . . e1e0 and Bias is a bias value equal to 2k−1 − 1. This yields exponent ranges from −126 to +127. Normalized Form Why set the bias this way for denormalized values? Having the exponent value be 1 − Bias rather than simply −Bias. it provides for smooth transition from denormalized to normalized values. Normalized Form In this example, E=e-127=129-127=2D. Hence, the number represented is -1.375×22=-5.5D. De-Normalized Form Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero. When the exponent field is all zeros, the represented number is in denormalized form. In this case, the exponent value is E = 1 − Bias. The value of the fraction field without an implied leading 1. De-Normalized Form Denormalized numbers serve two purposes. First, they provide a way to represent numeric value 0, Since with a normalized number we must always have F ≥ 1, and hence we cannot represent 0. In fact, the floating-point representation of +0.0 has a bit pattern of all zeros: the sign bit is 0, the exponent field is all zeros (indicating a denormalized value), and the fraction field is all zeros, giving F = 0. De-Normalized Form when the sign bit is 1, but the other fields are all zeros, we get the value −0.0. A second function of denormalized numbers is to represent numbers that are very close to 0.0. De-Normalized Form We can also represent very small positive and negative numbers in de- normalized form with E=0. For example, if S=1, E=0, and F=011 0000 0000 0000 0000 0000. The actual fraction is 0.011=1×2-2+1×2-3=0.375D. Since S=1, it is a negative number. With E=0, the actual exponent is -126. Hence the number is -0.375×2-126 = -4.4×10-39, which is an extremely small negative number (close to zero). Special Values A final category of values occurs when the exponent field is all ones. When the fraction field is all zeros, the resulting values represent infinity, either +∞ when s = 0 or −∞ when s = 1. Infinity can represent results that overflow, as when we multiply two very large numbers, or when we divide by zero. When the fraction field is nonzero, the resulting value is called a NaN, short for “not a number.” IEEE-754 64-bit Double-Precision Floating-Point Numbers The representation scheme for 64-bit double-precision is similar to the 32- bit single-precision: The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers. The following 11 bits represent exponent (E). The remaining 52 bits represents fraction (F). IEEE-754 64-bit Double-Precision Floating-Point Numbers The value (N) is calculated as follows: Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023). Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022). These are in the denormalized form. For E = 2047, N represents special values, such as ±INF (infinity), NaN (not a number).