0% found this document useful (0 votes)
33 views

03 Integersfloats

The document discusses different ways of representing integers in binary, including unsigned integers, signed integers using sign-and-magnitude representation, and arithmetic operations on integers. It also discusses encoding playing cards using binary representations and comparing card values and suits.

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

03 Integersfloats

The document discusses different ways of representing integers in binary, including unsigned integers, signed integers using sign-and-magnitude representation, and arithmetic operations on integers. It also discusses encoding playing cards using binary representations and comparing card values and suits.

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

University of Washington

Roadmap Memory & data


Integers & floats
Machine code & C
C: Java:
x86 assembly
car *c = malloc(sizeof(car)); Car c = new Car(); Procedures & stacks
c->miles = 100; c.setMiles(100);
Arrays & structs
c->gals = 17; c.setGals(17);
float mpg = get_mpg(c); float mpg =
Memory & caches
free(c); c.getMPG(); Processes
Virtual memory
Assembly get_mpg: Memory allocation
language: pushq %rbp Java vs. C
movq %rsp, %rbp
...
popq %rbp
ret
OS:
Machine 0111010000011000
100011010000010000000010
code: 1000100111000010
110000011111101000011111

Computer
system:

Autumn 2013 Integers & Floats 1


University of Washington

Integers
¢ Representation of integers: unsigned and signed
¢ Casting
¢ Arithmetic and shifting
¢ Sign extension

Autumn 2013 Integers & Floats 2


University of Washington

But before we get to integers….


¢ Encode a standard deck of playing cards.
¢ 52 cards in 4 suits
§ How do we encode suits, face cards?
¢ What operations do we want to make easy to implement?
§ Which is the higher value card?
§ Are they the same suit?

Autumn 2013 Integers & Floats 3


University of Washington

Two possible representations


¢ 52 cards – 52 bits with bit corresponding to card set to 1

low-order 52 bits of 64-bit word


§ “One-hot” encoding
§ Drawbacks:
§ Hard to compare values and suits
§ Large number of bits required

Autumn 2013 Integers & Floats 4


University of Washington

Two possible representations


¢ 52 cards – 52 bits with bit corresponding to card set to 1

low-order 52 bits of 64-bit word


§ “One-hot” encoding
§ Drawbacks:
§ Hard to compare values and suits
§ Large number of bits required

¢ 4 bits for suit, 13 bits for card value – 17 bits with two set to 1

§ Pair of one-hot encoded values


§ Easier to compare suits and values
§ Still an excessive number of bits
¢
Autumn 2013
Can we do better? Integers & Floats 5
University of Washington

Two better representations


¢ Binary encoding of all 52 cards – only 6 bits needed

low-order 6 bits of a byte


§ Fits in one byte
§ Smaller than one-hot encodings.
§ How can we make value and suit comparisons easier?

Autumn 2013 Integers & Floats 6


University of Washington

Two better representations


¢ Binary encoding of all 52 cards – only 6 bits needed

low-order 6 bits of a byte


§ Fits in one byte
§ Smaller than one-hot encodings.
§ How can we make value and suit comparisons easier?

¢ Binary encoding of suit (2 bits) and value (4 bits) separately

suit value

§ Also fits in one byte, and easy to do comparisons

Autumn 2013 Integers & Floats 7


University of Washington

mask: a bit vector that, when bitwise


ANDed with another bit vector v, turns
Compare Card Suits all but the bits of interest in v to 0

#define SUIT_MASK 0x30

int sameSuitP(char card1, char card2) {


return (! (card1 & SUIT_MASK) ^ (card2 & SUIT_MASK));
//return (card1 & SUIT_MASK) == (card2 & SUIT_MASK);
}

returns int SUIT_MASK = 0x30 = 0 0 1 1 0 0 0 0 equivalent


suit value

char hand[5]; // represents a 5-card hand


char card1, card2; // two cards to compare
card1 = hand[0];
card2 = hand[1];
...
if ( sameSuitP(card1, card2) ) { ... }
Autumn 2013 Integers & Floats 8
University of Washington

mask: a bit vector that, when bitwise


ANDed with another bit vector v, turns
Compare Card Values all but the bits of interest in v to 0

#define VALUE_MASK 0x0F works even if value is


stored in high bits
int greaterValue(char card1, char card2) {
return ((unsigned int)(card1 & VALUE_MASK) >
(unsigned int)(card2 & VALUE_MASK));
}

VALUE_MASK = 0x0F = 0 0 0 0 1 1 1 1

suit value

char hand[5]; // represents a 5-card hand


char card1, card2; // two cards to compare
card1 = hand[0];
card2 = hand[1];
...
if ( greaterValue(card1, card2) ) { ... }
Autumn 2013 Integers & Floats 9
University of Washington

Encoding Integers
¢ The hardware (and C) supports two flavors of integers:
§ unsigned – only the non-negatives
§ signed – both negatives and non-negatives

¢ There are only 2W distinct bit patterns of W bits, so...


§ Can not represent all the integers
§ Unsigned values: 0 ... 2W-1
§ Signed values: -2W-1 ... 2W-1-1

¢ Reminder: terminology for binary representations


“Most-significant” or “Least-significant” or
“high-order” bit(s) “low-order” bit(s)

0110010110101001
Autumn 2013 Integers & Floats 10
University of Washington

Unsigned Integers
¢ Unsigned values are just what you expect
§ b7b6b5b4b3b2b1b0 = b727 + b626 + b525 + … + b121 + b020
§ Useful formula: 1+2+4+8+...+2N-1 = 2N - 1

¢ Add and subtract using the normal 00111111 63


“carry” and “borrow” rules, just in binary. +00001000 + 8
01000111 71

¢ How would you make signed integers?

Autumn 2013 Integers & Floats 11


University of Washington

Signed Integers: Sign-and-Magnitude


¢ Let's do the natural thing for the positives
§ They correspond to the unsigned integers of the same value
§ Example (8 bits): 0x00 = 0, 0x01 = 1, …, 0x7F = 127
¢ But, we need to let about half of them be negative
§ Use the high-order bit to indicate negative: call it the “sign bit”
§Call this a “sign-and-magnitude” representation
§ Examples (8 bits):
§ 0x00 = 000000002 is non-negative, because the sign bit is 0
§ 0x7F = 011111112 is non-negative
§ 0x85 = 100001012 is negative
§ 0x80 = 100000002 is negative...

Autumn 2013 Integers & Floats 12


University of Washington

Signed Integers: Sign-and-Magnitude


¢ How should we represent -1 in binary?
§ 100000012
Use the MSB for + or -, and the other bits to give magnitude.
Most Significant Bit
–7 +0
–6 1111 0000 +1
–5 1110 0001 +2
1101 0010
–4 +3
1100 0011

– 3 1011 0100 + 4
1010 0101
–2 1001 0110 +5

–1 1000 0111 +6
–0 +7
Autumn 2013 Integers & Floats 13
University of Washington

Sign-and-Magnitude Negatives
¢ How should we represent -1 in binary?
§ 100000012
Use the MSB for + or -, and the other bits to give magnitude.
(Unfortunate side effect: there are two representations of 0!)

–7 +0
–6 1111 0000 +1
–5 1110 0001 +2
1101 0010
–4 +3
1100 0011

– 3 1011 0100 + 4
1010 0101
–2 1001 0110 +5

–1 1000 0111 +6
–0 +7
Autumn 2013 Integers & Floats 14
University of Washington

Sign-and-Magnitude Negatives
¢ How should we represent -1 in binary?
§ 100000012
Use the MSB for + or -, and the other bits to give magnitude.
(Unfortunate side effect: there are two representations of 0!)
§ Another problem: arithmetic is cumbersome.
§ Example: –7 +0
4 - 3 != 4 + (-3) –6 1111 0000 +1
–5 1110 0001 +2
1101 0010
0100 –4 +3
+1011 1100 0011
1111
– 3 1011 0100 + 4
1010 0101
–2 1001 0110 +5

–1 1000 0111 +6
How do we solve these problems? –0 +7
Autumn 2013 Integers & Floats 15
University of Washington

Two’s Complement Negatives


¢ How should we represent -1 in binary?

–1 0
–2 1111 0000 +1
–3 1110 0001 +2
1101 0010
–4 +3
1100 0011

– 5 1011 0100 + 4
1010 0101
–6 1001 0110 +5

–7 1000 0111 +6
–8 +7

Autumn 2013 Integers & Floats 16


University of Washington

Two’s Complement Negatives


¢ How should we represent -1 in binary?
Rather than a sign bit, let MSB have same value, but negative weight.
§bw-1 = 1 adds -2w-1 to the value. for i < w-1: bi = 1 adds +2i to the value.
bw- bw- ... b0
1 2

–1 0
–2 1111 0000 +1
–3 1110 0001 +2
1101 0010
–4 +3
1100 0011

– 5 1011 0100 + 4
1010 0101
–6 1001 0110 +5

–7 1000 0111 +6
Autumn 2013 Integers & Floats –8 +7 17
University of Washington

Two’s Complement Negatives


¢ How should we represent -1 in binary?
Rather than a sign bit, let MSB have same value, but negative weight.
§bw-1 = 1 adds -2w-1 to the value. for i < w-1: bi = 1 adds +2i to the value.
bw- bw- ... b0
1 2
e.g. unsigned 10102:
1*23 + 0*22 + 1*21 + 0*20 = 1010 –1 0
2’s compl. 10102: –2 1111 0000 +1
-1*23 + 0*22 + 1*21 + 0*20 = -610 1110 0001
–3 +2
1101 0010
–4 +3
1100 0011

– 5 1011 0100 + 4
1010 0101
–6 1001 0110 +5

–7 1000 0111 +6
Autumn 2013 Integers & Floats –8 +7 18
University of Washington

Two’s Complement Negatives


¢ How should we represent -1 in binary?
Rather than a sign bit, let MSB have same value, but negative weight.
§bw-1 = 1 adds -2w-1 to the value. for i < w-1: bi = 1 adds +2i to the value.
bw- bw- ... b0
1 2
e.g. unsigned 10102:
1*23 + 0*22 + 1*21 + 0*20 = 1010 –1 0
2’s compl. 10102: –2 1111 0000 +1
-1*23 + 0*22 + 1*21 + 0*20 = -610 1110 0001
–3 +2
¢ -1 is represented as 11112 = -23 + (23 – 1) 1101 0010
All negative integers still have MSB = 1. –4
1100 0011
+3
¢ Advantages: single zero, simple arithmetic
¢ To get negative representation of – 5 1011 0100 + 4
1010 0101
any integer, take bitwise complement
–6 1001 0110 +5
and then add one!
–7 1000 0111 +6
Autumn 2013
~x + 1 == -x Integers & Floats –8 +7 19
University of Washington

4-bit Unsigned vs. Two’s Complement


1 0 1 1

23 x 1 + 22 x 0 + 21 x 1 + 20 x 1 -23 x 1 + 22 x 0 + 21 x 1 + 20 x 1

15 0 –1 0
14 1111 0000 1 –2 1111 0000 +1

13 1110 0001 2 –3 1110 0001 +2


1101 0010 1101 0010
12 3 –4 +3
1100 0011 1100 0011

11 1011 0100 4 – 5 1011 0100 + 4


1010 0101 1010 0101
10 1001 0110 5 –6 1001 0110 +5

9 1000 0111 6 –7 1000 0111 +6


Autumn 2013 8 7 Integers & Floats –8 +7 20
University of Washington

4-bit Unsigned vs. Two’s Complement


1 0 1 1

23 x 1 + 22 x 0 + 21 x 1 + 20 x 1 -23 x 1 + 22 x 0 + 21 x 1 + 20 x 1

11 -5
(math) difference = 16 = 24
15 0 –1 0
14 1111 0000 1 –2 1111 0000 +1

13 1110 0001 2 –3 1110 0001 +2


1101 0010 1101 0010
12 3 –4 +3
1100 0011 1100 0011

11 1011 0100 4 – 5 1011 0100 + 4


1010 0101 1010 0101
10 1001 0110 5 –6 1001 0110 +5

9 1000 0111 6 –7 1000 0111 +6


Autumn 2013 8 7 Integers & Floats –8 +7 21
University of Washington

4-bit Unsigned vs. Two’s Complement


1 0 1 1

23 x 1 + 22 x 0 + 21 x 1 + 20 x 1 -23 x 1 + 22 x 0 + 21 x 1 + 20 x 1

11 -5
(math) difference = 16 = 24
15 0 –1 0
14 1111 0000 1 –2 1111 0000 +1

13 1110 0001 2 –3 1110 0001 +2


1101 0010 1101 0010
12 3 –4 +3
1100 0011 1100 0011

11 1011 0100 4 – 5 1011 0100 + 4


1010 0101 1010 0101
10 1001 0110 5 –6 1001 0110 +5

9 1000 0111 6 –7 1000 0111 +6


Autumn 2013 8 7 Integers & Floats –8 +7 22
University of Washington

Two’s Complement Arithmetic


¢ The same addition procedure works for both unsigned and
two’s complement integers
§ Simplifies hardware: only one algorithm for addition
§ Algorithm: simple addition, discard the highest carry bit
§ Called “modular” addition: result is sum modulo 2W
¢ Examples:

Autumn 2013 Integers & Floats 23


University of Washington

Two’s Complement
¢ Why does it work?
§ Put another way, for all positive integers x, we want:
§ bits( x ) + bits( –x ) = 0 (ignoring the carry-out bit)

§ This turns out to be the bitwise complement plus one


§ What should the 8-bit representation of -1 be?
00000001
+???????? (we want whichever bit string gives the right result)
00000000

00000010 00000011
+???????? +????????
00000000 00000000

Autumn 2013 Integers & Floats 24


University of Washington

Two’s Complement
¢ Why does it work?
§ Put another way, for all positive integers x, we want:
§ bits( x ) + bits( –x ) = 0 (ignoring the carry-out bit)

§ This turns out to be the bitwise complement plus one


§ What should the 8-bit representation of -1 be?
00000001
+11111111 (we want whichever bit string gives the right result)
100000000

00000010 00000011
+???????? +????????
00000000 00000000

Autumn 2013 Integers & Floats 25


University of Washington

Two’s Complement
¢ Why does it work?
§ Put another way, for all positive integers x, we want:
§ bits( x ) + bits( –x ) = 0 (ignoring the carry-out bit)

§ This turns out to be the bitwise complement plus one


§ What should the 8-bit representation of -1 be?
00000001
+11111111 (we want whichever bit string gives the right result)
100000000

00000010 00000011
+11111110 +11111101
100000000 100000000

Autumn 2013 Integers & Floats 26


University of Washington

Unsigned & Signed Numeric Values


bits Unsigned Signed
0000 0 0 ¢ Signed and unsigned integers have limits.
0001 1 1 § If you compute a number that is too big
0010 2 2 (positive), it wraps:
0011 3 3 6 + 4 = ? 15U + 2U = ?
0100 4 4 § If you compute a number that is too
0101 5 5 small (negative), it wraps:
0110 6 6 -7 - 3 = ? 0U - 2U = ?
0111 7 7 § Answers are only correct mod 2b
1000 8 –8
1001 9 –7
1010 10 –6 ¢ The CPU may be capable of “throwing an
1011 11 –5 exception” for overflow on signed values.
1100 12 –4 § It won't for unsigned.
1101 13 –3 ¢ But C and Java just cruise along silently
1110 14 –2 when overflow occurs... Oops.
1111 15 –1

Autumn 2013 Integers & Floats 27


University of Washington

Conversion Visualized
¢ Two’s Complement ® Unsigned
UMax
§ Ordering Inversion
UMax – 1
§ Negative ® Big Positive

TMax + 1 Unsigned
TMax TMax Range

2’s Complement 0 0
Range –1
–2

TMin
Autumn 2013 Integers & Floats 28
University of Washington

Overflow/Wrapping: Unsigned
addition: drop the carry bit

15 1111 15 0

+2 + 0010 13
14
1110
1111 0000
0001
1
2
17 10001 12
1101 0010
3
1100 0011

1 11 1011 0100 4
1010 0101
10 1001 0110 5

9 1000 0111 6
8 7

Modular Arithmetic
Autumn 2013 Integers & Floats 29
University of Washington

Overflow/Wrapping: Two’s Complement


addition: drop the carry bit

-1 1111 –1 0
–2 +1
+2 + 0010 –3 1110
1111 0000
0001 +2
1 10001 –4
1101 0010
+3
1100 0011

– 5 1011 0100 + 4
1010 0101
6 0110 –6 1001 0110 +5

+3 + 0011 –7
–8
1000 0111
+7
+6

9 1001
-7
Modular Arithmetic
Autumn 2013 Integers & Floats 30
University of Washington

Values To Remember
¢ Unsigned Values ¢ Two’s Complement Values
§ UMin = 0 § TMin = –2w–1
§ 000…0 § 100…0
§ UMax = 2w – 1 § TMax = 2w–1 – 1
§ 111…1 § 011…1
§ Negative one
§ 111…1 0xF...F
Values for W = 32
Decimal Hex Binary
UMax 4,294,967,296 FF FF FF FF 11111111 11111111 11111111 11111111

TMax 2,147,483,647 7F FF FF FF 01111111 11111111 11111111 11111111

TMin -2,147,483,648 80 00 00 00 10000000 00000000 00000000 00000000

-1 -1 FF FF FF FF 11111111 11111111 11111111 11111111

0 0 00 00 00 00 00000000 00000000 00000000 00000000

Autumn 2013 Integers & Floats 31


University of Washington

Signed vs. Unsigned in C


¢ Constants
§ By default are considered to be signed integers
§ Use “U” suffix to force unsigned:
§ 0U, 4294967259U

Autumn 2013 Integers & Floats 32


University of Washington

Signed vs. Unsigned in C !!!


¢ Casting
§ int tx, ty;
§ unsigned ux, uy;
§ Explicit casting between signed & unsigned:
§ tx = (int) ux;
§ uy = (unsigned) ty;
§ Implicit casting also occurs via assignments and function calls:
§ tx = ux;
§ uy = ty;
§The gcc flag -Wsign-conversion produces warnings for implicit casts,
but -Wall does not!
§ How does casting between signed and unsigned work?
§ What values are going to be produced?

Autumn 2013 Integers & Floats 33


University of Washington

Signed vs. Unsigned in C !!!


¢ Casting
§ int tx, ty;
§ unsigned ux, uy;
§ Explicit casting between signed & unsigned:
§ tx = (int) ux;
§ uy = (unsigned) ty;
§ Implicit casting also occurs via assignments and function calls:
§ tx = ux;
§ uy = ty;
§The gcc flag -Wsign-conversion produces warnings for implicit casts,
but -Wall does not!
§ How does casting between signed and unsigned work?
§ What values are going to be produced?
§ Bits are unchanged, just interpreted differently!
Autumn 2013 Integers & Floats 34
University of Washington

¢
Casting Surprises
Expression Evaluation
!!!
§ If you mix unsigned and signed in a single expression, then
signed values are implicitly cast to unsigned.
§ Including comparison operations <, >, ==, <=, >=
§ Examples for W = 32: TMIN = -2,147,483,648 TMAX = 2,147,483,647
¢ Constant1 Constant2 Relation Evaluation
00 0U0U == unsigned
-1-1 00 < signed
-1-1 0U0U > unsigned
2147483647
2147483647 -2147483648
-2147483648 > signed
2147483647U
2147483647U -2147483648
-2147483648 < unsigned
-1-1 -2-2 > signed
(unsigned)-1
(unsigned) -1 -2-2 > unsigned
2147483647
2147483647 2147483648U
2147483648U < unsigned
2147483647
2147483647 (int)
(int)2147483648U
2147483648U > signed
Autumn 2013 Integers & Floats 35
University of Washington

Sign Extension
¢ What happens if you convert a 32-bit signed integer to a 64-
bit signed integer?

Autumn 2013 Integers & Floats 36


University of Washington

Sign Extension
¢ Task:
§ Given w-bit signed integer x
§ Convert it to w+k-bit integer with same value
¢ Rule:
§ Make k copies of sign bit:
§ X = xw–1 ,…, xw–1 , xw–1 , xw–2 ,…, x0

k copies of MSB w
X •••

•••

X¢ ••• •••
k w
Autumn 2013 Integers & Floats 37
University of Washington

8-bit representations

00001001 10000001

11111111 00100111

C: casting between unsigned and signed just reinterprets the same bits.
Autumn 2013 Integers & Floats 38
University of Washington

Sign Extension

0010 4-bit 2

00000010 8-bit 2

1100 4-bit -4

????1100 8-bit -4

Autumn 2013 Integers & Floats 39


University of Washington

Sign Extension

0010 4-bit 2

00000010 8-bit 2

1100 4-bit -4

00001100 8-bit 12

Autumn 2013 Integers & Floats 40


University of Washington

Sign Extension

0010 4-bit 2

00000010 8-bit 2

1100 4-bit -4

10001100 8-bit -116

Autumn 2013 Integers & Floats 41


University of Washington

Sign Extension

0010 4-bit 2

00000010 8-bit 2

1100 4-bit -4

11111100 8-bit -4

Autumn 2013 Integers & Floats 42


University of Washington

Sign Extension Example


¢ Converting from smaller to larger integer data type
¢ C automatically performs sign extension (Java too)

short int x = 12345;


int ix = (int) x;
short int y = -12345;
int iy = (int) y;

Decimal Hex Binary


x 12345 30 39 00110000 01101101
ix 12345 00 00 30 39 00000000 00000000 00110000 01101101
y -12345 CF C7 11001111 11000111
iy -12345 FF FF CF C7 11111111 11111111 11001111 11000111

Autumn 2013 Integers & Floats 43


University of Washington

Shift Operations
¢ Left shift: x << y Argument x 01100010
§ Shift bit vector x left by y positions
<< 3 00010000
§ Throw away extra bits on left
§ Fill with 0s on right Logical >> 2 00011000
¢ Right shift: x >> y Arithmetic >> 2 00011000
§ Shift bit-vector x right by y positions
§Throw away extra bits on right
§ Logical shift (for unsigned values) Argument x 10100010
§ Fill with 0s on left << 3 00010000
§ Arithmetic shift (for signed values)
Logical >> 2 00101000
§ Replicate most significant bit on left
§ Maintains sign of x Arithmetic >> 2 11101000

The behavior of >> in C depends on the compiler! It is arithmetic shift right in GCC.
Java: >>> is logical shift right; >> is arithmetic shift right.
Autumn 2013 Integers & Floats 44
University of Washington

Shift Operations
¢ Left shift: x << y Argument x 01100010
§ Shift bit vector x left by y positions
<< 3 00010000
§ Throw away extra bits on left
§ Fill with 0s on right Logical >> 2 00011000
¢ Right shift: x >> y Arithmetic >> 2 00011000
§ Shift bit-vector x right by y positions
§Throw away extra bits on right
§ Logical shift (for unsigned values) Argument x 10100010
§ Fill with 0s on left << 3 00010000
§ Arithmetic shift (for signed values)
Logical >> 2 00101000
§ Replicate most significant bit on left
§ Maintains sign of x Arithmetic >> 2 11101000
§ Why is this useful?
x >> 9?
The behavior of >> in C depends on the compiler! It is arithmetic shift right in GCC.
Java: >>> is logical shift right; >> is arithmetic shift right.
Autumn 2013 Integers & Floats 45
University of Washington

What happens when…


¢ x >> n?

¢ x << m?

Autumn 2013 Integers & Floats 46


University of Washington

What happens when…


¢ x >> n: divide by 2n

¢ x << m: multiply by 2m

faster than general multiple or divide operations


Autumn 2013 Integers & Floats 47
University of Washington

Shifting and Arithmetic

x = 27; 00011011 x*2n


y = x << 2; logical shift left:

y == 108
0001101100 shift in zeros from the right

overflow rounding (down)


unsigned

x/2n 11101101 x = 237;

logical shift right: y = x >> 2;

shift in zeros from the left


0011101101 y == 59

Autumn 2013 Integers & Floats 48


University of Washington

Shifting and Arithmetic


signed
x = -101; 10011011 x*2n
y = x << 2; logical shift left:

y == 108
1001101100 shift in zeros from the right

overflow rounding (down)


signed

x/2n 11101101 x = -19;

arithmetic shift right: y = x >> 2;

shift in copies of most significant bit


from the left
1111101101 y == -5

Autumn 2013
clarification from Mon.: shifts byIntegers
n < &0Floats
or n >= word size are undefined 49
University of Washington

Using Shifts and Masks


¢ Extract the 2nd most significant byte of an integer?

x 01100001 01100010 01100011 01100100

Autumn 2013 Integers & Floats 50


University of Washington

Using Shifts and Masks


¢ Extract the 2nd most significant byte of an integer:
§ First shift, then mask: ( x >> 16 ) & 0xFF
x 01100001 01100010 01100011 01100100
x >> 16 00000000 00000000
00010000
01100001 01100010
00011000
00000000 00000000 00000000 11111111
( x >> 16) & 0xFF
00000000 00000000 00000000 01100010

¢ Extract the sign bit of a signed integer?

Autumn 2013 Integers & Floats 51


University of Washington

Using Shifts and Masks


¢ Extract the 2nd most significant byte of an integer:
§ First shift, then mask: ( x >> 16 ) & 0xFF
x 01100001 01100010 01100011 01100100
x >> 16 00000000 00000000
00010000
01100001 01100010
00011000
00000000 00000000 00000000 11111111
( x >> 16) & 0xFF
00000000 00000000 00000000 01100010

¢ Extract the sign bit of a signed integer:


§ ( x >> 31 ) & 1 - need the “& 1” to clear out all other bits except LSB
¢ Conditionals as Boolean expressions (assuming x is 0 or 1)
§ if (x) a=y else a=z; which is the same as a = x ? y : z;
§ Can be re-written (assuming arithmetic right shift) as:
a = ( ( (x << 31) >> 31) & y ) | ( ( (!x) << 31 ) >> 31 ) & z );
Autumn 2013 Integers & Floats 52
University of Washington

Multiplication
¢ What do you get when you multiply 9 x 9?

¢ What about 230 x 3?

¢ 230 x 5?

¢ -231 x -231?

Autumn 2013 Integers & Floats 53


University of Washington

Unsigned Multiplication in C

Operands: w bits u •••


* v •••
True Product: 2*w bits
u·v ••• •••
Discard w bits: w bits UMultw(u , v) •••

¢ Standard Multiplication Function


§ Ignores high order w bits
¢ Implements Modular Arithmetic
UMultw(u , v) = u · v mod 2w

Autumn 2013 Integers & Floats 54


University of Washington

Power-of-2 Multiply with Shift


¢ Operation
§ u << k gives u * 2k
§ Both signed and unsigned k
u •••
Operands: w bits
* 2k 0 ••• 0 1 0 ••• 0 0
True Product: w+k bits u · 2k ••• 0 ••• 0 0
Discard k bits: w bits UMultw(u , 2k) ••• 0 ••• 0 0
TMultw(u , 2k)
¢ Examples
§ u << 3 == u * 8
§ u << 5 - u << 3 == u * 24
§ Most machines shift and add faster than multiply
§ Compiler generates this code automatically

Autumn 2013 Integers & Floats 55


University of Washington

Code Security Example


/* Kernel memory region holding user-accessible data */
#define KSIZE 1024
char kbuf[KSIZE];

/* Copy at most maxlen bytes from kernel region to user buffer */


int copy_from_kernel(void* user_dest, int maxlen) {
/* Byte count len is minimum of buffer size and maxlen */
int len = KSIZE < maxlen ? KSIZE : maxlen;
memcpy(user_dest, kbuf, len);
return len;
}

#define MSIZE 528

void getstuff() {
char mybuf[MSIZE];
copy_from_kernel(mybuf, MSIZE);
printf(“%s\n”, mybuf);
}

Autumn 2013 Integers & Floats 56


University of Washington

Malicious Usage /* Declaration of library function memcpy */


void* memcpy(void* dest, void* src, size_t n);

/* Kernel memory region holding user-accessible data */


#define KSIZE 1024
char kbuf[KSIZE];

/* Copy at most maxlen bytes from kernel region to user buffer */


int copy_from_kernel(void* user_dest, int maxlen) {
/* Byte count len is minimum of buffer size and maxlen */
int len = KSIZE < maxlen ? KSIZE : maxlen;
memcpy(user_dest, kbuf, len);
return len;
}

#define MSIZE 528

void getstuff() {
char mybuf[MSIZE];
copy_from_kernel(mybuf, -MSIZE);
. . .
}

Autumn 2013 Integers & Floats 57


University of Washington

Floating point topics


¢ Background: fractional binary numbers
¢ IEEE floating-point standard
¢ Floating-point operations and rounding
¢ Floating-point in C

¢ There are many more details that we won’t cover


§ It’s a 58-page standard…

Autumn 2013 Integers & Floats 58


University of Washington

Fractional Binary Numbers


2i
2i–1

4
••• 2

.
1
bi bi–1 • • • b2 b1 b0 b–1 b–2 b–3 ••• b–j
1/2
1/4 •••
1/8

2–j
¢ Representation
§ Bits to right of “binary point” represent fractional powers of 2
§ Represents rational number: i
k
å bk ×2
k =- j
Autumn 2013 Integers & Floats 59
University of Washington

Fractional Binary Numbers


¢ Value Representation
§ 5 and 3/4 101.112
§ 2 and 7/8 10.1112
§ 47/64 0.1011112

¢ Observations
§ Shift left = multiply by power of 2
§ Shift right = divide by power of 2
§ Numbers of the form 0.111111…2 are just below 1.0
¢ Limitations:
§ Exact representation possible only for numbers of the form x * 2y
§ Other rational numbers have repeating bit representations
§ 1/3 = 0.333333…10 = 0.01010101[01]…2
Autumn 2013 Integers & Floats 60
University of Washington

Fixed Point Representation


¢ Implied binary point. Examples:
#1: the binary point is between bits 2 and 3
b7 b6 b5 b4 b3 [.] b2 b1 b0
#2: the binary point is between bits 4 and 5
b7 b6 b5 [.] b4 b3 b2 b1 b0

¢ Same hardware as for integer arithmetic.


#3: integers! the binary point is after bit 0
b7 b6 b5 b4 b3 b2 b1 b0 [.]

¢ Fixed point = fixed range and fixed precision


§ range: difference between largest and smallest numbers possible
§ precision: smallest possible difference between any two numbers

Autumn 2013 Integers & Floats 61


University of Washington

IEEE Floating Point


¢ Analogous to scientific notation
§ 12000000 1.2 x 107 C: 1.2e7
§ 0.0000012 1.2 x 10-6 C: 1.2e-6

¢ IEEE Standard 754 used by all major CPUs today

¢ Driven by numerical concerns


§ Rounding, overflow, underflow
§ Numerically well-behaved, but hard to make fast in hardware

Autumn 2013 Integers & Floats 62


University of Washington

Floating Point Representation


¢ Numerical form:
V10 = (–1)s * M * 2E

§ Sign bit s determines whether number is negative or positive


§ Significand (mantissa) M normally a fractional value in range [1.0,2.0)
§ Exponent E weights value by a (possibly negative) power of two

Autumn 2013 Integers & Floats 63


University of Washington

Floating Point Representation


¢ Numerical form:
V10 = (–1)s * M * 2E

§ Sign bit s determines whether number is negative or positive


§ Significand (mantissa) M normally a fractional value in range [1.0,2.0)
§ Exponent E weights value by a (possibly negative) power of two

¢ Representation in memory:
§ MSB s is sign bit s
§ exp field encodes E (but is not equal to E)
§ frac field encodes M (but is not equal to M)

s exp frac

Autumn 2013 Integers & Floats 64


University of Washington

Precisions
¢ Single precision: 32 bits
s exp frac
1 bit 8 bits 23 bits

¢ Double precision: 64 bits


s exp frac
1 bit 11 bits 52 bits

¢ Finite representation means not all values can be


represented exactly. Some will be approximated.

Autumn 2013 Integers & Floats 65


University of Washington

Normalization and Special Values


s
V = (–1) * M * 2
E
s exp frac

¢ “Normalized” = M has the form 1.xxxxx


§ As in scientific notation, but in binary
§ 0.011 x 25 and 1.1 x 23 represent the same number, but the latter
makes better use of the available bits
§ Since we know the mantissa starts with a 1, we don't bother to store it

¢ How do we represent 0.0? Or special / undefined values like


1.0/0.0?

Autumn 2013 Integers & Floats 66


University of Washington

Normalization and Special Values


s
V = (–1) * M * 2
E
s exp frac

¢ “Normalized” = M has the form 1.xxxxx


§ As in scientific notation, but in binary
§ 0.011 x 25 and 1.1 x 23 represent the same number, but the latter
makes better use of the available bits
§ Since we know the mantissa starts with a 1, we don't bother to store it.
¢ Special values:
§ zero: s == 0 exp == 00...0 frac == 00...0
§ + ,- : exp == 11...1 frac == 00...0

1.0/0.0 = -1.0/-0.0 = +¥, 1.0/-0.0 = -1.0/0.0 = -¥

§ NaN (“Not a Number”): exp == 11...1 frac != 00...0


Results from operations with undefined result: sqrt(-1), ¥ - ¥, ¥ * 0, etc.

Autumn 2013
§ note: exp=11…1 and exp=00…0 are reserved, limiting exp range…
Integers & Floats 67
University of Washington

Floating Point Operations: Basic Idea


s
V = (–1) * M * 2
E
s exp frac

¢ x +f y = Round(x + y)

¢ x *f y = Round(x * y)

¢ Basic idea for floating point operations:


§ First, compute the exact result
§ Then, round the result to make it fit into desired precision:
§ Possibly overflow if exponent too large
§ Possibly drop least-significant bits of significand to fit into frac

Autumn 2013 Integers & Floats 68


University of Washington

Floating Point Multiplication


(–1)s1 M1 2E1 * (–1)s2 M2 2E2
¢ Exact Result: (–1)s M 2E
§ Sign s: s1 ^ s2
§ Significand M: M1 * M2
§ Exponent E: E1 + E2

¢ Fixing
§ If M ≥ 2, shift M right, increment E
§ If E out of range, overflow
§ Round M to fit frac precision

Autumn 2013 Integers & Floats 69


University of Washington

Floating Point Addition


(–1)s1 M1 2E1 + (-1)s2 M2 2E2
Assume E1 > E2
E1–E2

¢ Exact Result: (–1)s M 2E (–1)s1 M1


§ Sign s, significand M: (–1)s2 M2
+
Result of signed align & add
§
§ Exponent E: E1 (–1)s M

¢ Fixing
§ If M ≥ 2, shift M right, increment E
§ if M < 1, shift M left k positions, decrement E by k
§ Overflow if E out of range
§ Round M to fit frac precision
Autumn 2013 Integers & Floats 70
University of Washington

Rounding modes
¢ Possible rounding modes (illustrate with dollar rounding):
$1.40 $1.60 $1.50 $2.50 –$1.50
§ Round-toward-zero $1 $1 $1 $2 –$1
§ Round-down (-¥) $1 $1 $1 $2 –$2
§ Round-up (+¥) $2 $2 $2 $3 –$1
§ Round-to-nearest $1 $2 ?? ?? ??
§ Round-to-even $1 $2 $2 $2 –$2
¢ Round-to-even avoids statistical bias in repeated rounding.
§ Rounds up about half the time, down about half the time.
§ Default rounding mode for IEEE floating-point

Autumn 2013 Integers & Floats 71


University of Washington

Mathematical Properties of FP Operations


¢ Exponent overflow yields +¥ or -¥

¢ Floats with value +¥, -¥, and NaN can be used in operations
§ Result usually still +¥, -¥, or NaN; sometimes intuitive, sometimes not

¢ Floating point operations are not always associative or


distributive, due to rounding!
§ (3.14 + 1e10) - 1e10 != 3.14 + (1e10 - 1e10)
§ 1e20 * (1e20 - 1e20) != (1e20 * 1e20) - (1e20 * 1e20)

Autumn 2013 Integers & Floats 72


University of Washington

Floating Point in C !!!


¢ C offers two levels of precision
float single precision (32-bit)
double double precision (64-bit)

¢ #include <math.h> to get INFINITY and NAN constants


¢ Equality (==) comparisons between floating point numbers are
tricky, and often return unexpected results
§ Just avoid them!

Autumn 2013 Integers & Floats 73


University of Washington

Floating Point in C !!!


¢ Conversions between data types:
§ Casting between int, float, and double changes the bit
representation.
§ int → float
§May be rounded; overflow not possible
§ int → double or float → double
§Exact conversion (32-bit ints; 52-bit frac + 1-bit sign)
§ long int → double
§ Rounded or exact, depending on word size
§ double or float → int
§ Truncates fractional part (rounded toward zero)
§ Not defined when out of range or NaN: generally sets to Tmin

Autumn 2013 Integers & Floats 74


University of Washington

Number Representation Really Matters !!!


¢ 1991: Patriot missile targeting error
§ clock skew due to conversion from integer to floating point
¢ 1996: Ariane 5 rocket exploded ($1 billion)
§ overflow converting 64-bit floating point to 16-bit integer
¢ 2000: Y2K problem
§ limited (decimal) representation: overflow, wrap-around
¢ 2038: Unix epoch rollover
§ Unix epoch = seconds since 12am, January 1, 1970
§ signed 32-bit integer representation rolls over to TMin in 2038
¢ other related bugs
§ 1994: Intel Pentium FDIV (floating point division) HW bug ($475 million)
§ 1997: USS Yorktown “smart” warship stranded: divide by zero
§ 1998: Mars Climate Orbiter crashed: unit mismatch ($193 million)
Autumn 2013 Integers & Floats 75
University of Washington

Floating Point and the Programmer


#include <stdio.h>

int main(int argc, char* argv[]) {

float f1 = 1.0;
float f2 = 0.0;
int i;
for ( i=0; i<10; i++ ) {
f2 += 1.0/10.0;
}

printf("0x%08x 0x%08x\n", *(int*)&f1, *(int*)&f2); $ ./a.out


printf("f1 = %10.8f\n", f1); 0x3f800000 0x3f800001
printf("f2 = %10.8f\n\n", f2); f1 = 1.000000000
f2 = 1.000000119
f1 = 1E30;
f2 = 1E-30; f1 == f3? yes
float f3 = f1 + f2;
printf ("f1 == f3? %s\n", f1 == f3 ? "yes" : "no" );

return 0;
}

Autumn 2013 Integers & Floats 76


University of Washington

Memory Referencing Bug


double fun(int i)
{
volatile double d[1] = {3.14};
volatile long int a[2];
a[i] = 1073741824; /* Possibly out of bounds */
return d[0];
}

fun(0) –> 3.14


fun(1) –> 3.14
fun(2) –> 3.1399998664856
fun(3) –> 2.00000061035156
fun(4) –> 3.14, then segmentation fault

Explanation: Saved State 4


d7 … d4 3
Location accessed by
d3 … d0 2
fun(i)
a[1] 1
a[0] 0
Autumn 2013 Integers & Floats 77
University of Washington

Representing 3.14 as a Double FP Number


¢ 1073741824 = 0100 0000 0000 0000 0000 0000 0000 0000
¢ 3.14 = 11.0010 0011 1101 0111 0000 1010 000…
¢ (–1)s M 2E
§ S = 0 encoded as 0
§ M = 1.1001 0001 1110 1011 1000 0101 000…. (leading 1 left out)
§ E = 1 encoded as 1024 (with bias)
s exp (11) frac (first 20 bits)
0 100 0000 0000 1001 0001 1110 1011 1000

frac (the other 32 bits)


0101 0000 …

Autumn 2013 Integers & Floats 78


University of Washington

Memory Referencing Bug (Revisited)


double fun(int i)
{
volatile double d[1] = {3.14};
volatile long int a[2];
a[i] = 1073741824; /* Possibly out of bounds */
return d[0];
}

fun(0) –> 3.14


fun(1) –> 3.14
fun(2) –> 3.1399998664856
fun(3) –> 2.00000061035156
fun(4) –> 3.14, then segmentation fault

Saved State 4
d7 … d4 0100 0000 0000 1001 0001 1110 1011 1000 3
Location
d3 … d0 0101 0000 … 2
accessed
a[1] 1 by fun(i)
a[0] 0
Autumn 2013 Integers & Floats 79
University of Washington

Memory Referencing Bug (Revisited)


double fun(int i)
{
volatile double d[1] = {3.14};
volatile long int a[2];
a[i] = 1073741824; /* Possibly out of bounds */
return d[0];
}

fun(0) –> 3.14


fun(1) –> 3.14
fun(2) –> 3.1399998664856
fun(3) –> 2.00000061035156
fun(4) –> 3.14, then segmentation fault

Saved State 4
d7 … d4 0100 0000 0000 1001 0001 1110 1011 1000 3
Location
d3 … d0 0100 0000 0000 0000 0000 0000 0000 0000 2
accessed
a[1] 1 by fun(i)
a[0] 0
Autumn 2013 Integers & Floats 80
University of Washington

Memory Referencing Bug (Revisited)


double fun(int i)
{
volatile double d[1] = {3.14};
volatile long int a[2];
a[i] = 1073741824; /* Possibly out of bounds */
return d[0];
}

fun(0) –> 3.14


fun(1) –> 3.14
fun(2) –> 3.1399998664856
fun(3) –> 2.00000061035156
fun(4) –> 3.14, then segmentation fault

Saved State 4
d7 … d4 0100 0000 0000 0000 0000 0000 0000 0000 3
Location
d3 … d0 0101 0000 … 2
accessed
a[1] 1 by fun(i)
a[0] 0
Autumn 2013 Integers & Floats 81
University of Washington

Summary
¢ As with integers, floats suffer from the fixed number of bits
available to represent them
§ Can get overflow/underflow, just like ints
§ Some “simple fractions” have no exact representation (e.g., 0.2)
§ Can also lose precision, unlike ints
§ “Every operation gets a slightly wrong result”

¢ Mathematically equivalent ways of writing an expression


may compute different results
§ Violates associativity/distributivity

¢ Never test floating point values for equality!


¢ Careful when converting between ints and floats!
Autumn 2013 Integers & Floats 82
University of Washington

Autumn 2013 Integers & Floats 83


University of Washington

Many more details for the curious...


¢ Exponent bias
¢ Denormalized values – to get finer precision near zero
¢ Distribution of representable values
¢ Floating point multiplication & addition algorithms
¢ Rounding strategies

¢ We won’t be using or testing you on any of these extras in


351.

Autumn 2013 Integers & Floats 84


University of Washington

Normalized Values
s
V = (–1) * M * 2
E
s exp frac
k n
¢ Condition: exp ¹ 000…0 and exp ¹ 111…1
¢ Exponent coded as biased value: E = exp - Bias
§ exp is an unsigned value ranging from 1 to 2k-2 (k == # bits in exp)
§ Bias = 2k-1 - 1
Single precision: 127 (so exp: 1…254, E: -126…127)
§
§ Double precision: 1023 (so exp: 1…2046, E: -1022…1023)
§ These enable negative values for E, for representing very small values

¢ Significand coded with implied leading 1: M = 1.xxx…x2


§ xxx…x: the n bits of frac
§ Minimum when 000…0 (M = 1.0)
§ Maximum when 111…1 (M = 2.0 – e)
§ Get extra leading bit for “free”
Autumn 2013 Integers & Floats 85
University of Washington

Normalized Encoding Example


s
V = (–1) * M * 2
E
s exp frac
k n
¢ Value: float f = 12345.0;
§ 1234510 = 110000001110012
= 1.10000001110012 x 213 (normalized form)
¢ Significand:
M = 1.10000001110012
frac = 100000011100100000000002

¢ Exponent: E = exp - Bias, so exp = E + Bias


E = 13
Bias = 127
exp = 140 = 100011002

¢ Result:
0 10001100 10000001110010000000000
s exp frac

Autumn 2013 Integers & Floats 86


University of Washington

Denormalized Values
¢ Condition: exp = 000…0

¢ Exponent value: E = exp – Bias + 1 (instead of E = exp – Bias)


¢ Significand coded with implied leading 0: M = 0.xxx…x2
§ xxx…x: bits of frac
¢ Cases
§ exp = 000…0, frac = 000…0
§Represents value 0
§ Note distinct values: +0 and –0 (why?)
§ exp = 000…0, frac ¹ 000…0
§ Numbers very close to 0.0
§ Lose precision as get smaller
§ Equispaced
Autumn 2013 Integers & Floats 87
University of Washington

Special Values
¢ Condition: exp = 111…1

¢ Case: exp = 111…1, frac = 000…0


§ Represents value ¥ (infinity)
§ Operation that overflows
§ Both positive and negative
§ E.g., 1.0/0.0 = -1.0/-0.0 = +¥, 1.0/-0.0 = -1.0/0.0 = -¥

¢ Case: exp = 111…1, frac ¹ 000…0


§ Not-a-Number (NaN)
§ Represents case when no numeric value can be determined
§ E.g., sqrt(–1), ¥ - ¥, ¥ * 0

Autumn 2013 Integers & Floats 88


University of Washington

Visualization: Floating Point Encodings

-¥ +¥
-Normalized -Denorm +Denorm +Normalized

NaN NaN
-0 +0

Autumn 2013 Integers & Floats 89


University of Washington

Tiny Floating Point Example


s exp frac
1 4 3

¢ 8-bit Floating Point Representation


§ the sign bit is in the most significant bit.
§ the next four bits are the exponent, with a bias of 7.
§ the last three bits are the frac

¢ Same general form as IEEE Format


§ normalized, denormalized
§ representation of 0, NaN, infinity

Autumn 2013 Integers & Floats 90


University of Washington

Dynamic Range (Positive Only)


s exp frac E Value
0 0000 000-6 0
0 0000 001-6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010-6 2/8*1/64 = 2/512
numbers …
0 0000 110-6 6/8*1/64 = 6/512
0 0000 111-6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512

0 0110 110-1 14/8*1/2 = 14/16
0 0110 111-1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 0000 8/8*1 = 1
numbers closest to 1 above
0 0111 0010 9/8*1 = 9/8
0 0111 0100 10/8*1 = 10/8

0 1110 110 7 14/8*128 = 224
0 1110 1117 15/8*128 = 240 largest norm
0 1111 000n/a inf
Autumn 2013 Integers & Floats 91
University of Washington

Distribution of Values

¢ 6-bit IEEE-like format


§ e = 3 exponent bits s exp frac
§ f = 2 fraction bits 1 3 2

§ Bias is 23-1-1 = 3

¢ Notice how the distribution gets denser toward zero.

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity

Autumn 2013 Integers & Floats 92


University of Washington

Distribution of Values (close-up view)


¢ 6-bit IEEE-like format
s exp frac
§ e = 3 exponent bits
1 3 2
§ f = 2 fraction bits
§ Bias is 3

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity

Autumn 2013 Integers & Floats 93


University of Washington

Interesting Numbers {single,double}

Description exp frac Numeric Value


¢ Zero 00…00 00…00 0.0
¢ Smallest Pos. Denorm. 00…00 00…01 2– {23,52} * 2– {126,1022}
§ Single » 1.4 * 10–45
§ Double » 4.9 * 10–324
¢ Largest Denormalized 00…00 11…11 (1.0 – e) * 2– {126,1022}
§ Single » 1.18 * 10–38
§ Double » 2.2 * 10–308
¢ Smallest Pos. Norm. 00…01 00…00 1.0 * 2– {126,1022}
§ Just larger than largest denormalized
¢ One 01…11 00…00 1.0
¢ Largest Normalized 11…10 11…11 (2.0 – e) * 2{127,1023}
§ Single » 3.4 * 1038
§ Double » 1.8 * 10308

Autumn 2013 Integers & Floats 94


University of Washington

Special Properties of Encoding


¢ Floating point zero (0+) exactly the same bits as integer zero
§ All bits = 0

¢ Can (Almost) Use Unsigned Integer Comparison


§ Must first compare sign bits
§ Must consider 0- = 0+ = 0
§ NaNs problematic
§Will be greater than any other values
§ What should comparison yield?
§ Otherwise OK
§ Denorm vs. normalized
§ Normalized vs. infinity

Autumn 2013 Integers & Floats 95


University of Washington

Floating Point Multiplication


(–1)s1 M1 2E1 * (–1)s2 M2 2E2

¢ Exact Result: (–1)s M 2E


§ Sign s: s1 ^ s2 // xor of s1 and s2
§ Significand M: M1 * M2
§ Exponent E: E1 + E2

¢ Fixing
§ If M ≥ 2, shift M right, increment E
§ If E out of range, overflow
§ Round M to fit frac precision

Autumn 2013 Integers & Floats 96


University of Washington

Floating Point Addition


(–1)s1 M1 2E1 + (–1)s2 M2 2E2 Assume E1 > E2

E1–E2
¢ Exact Result: (–1)s M 2E
(–1)s1 M1
§ Sign s, significand M:
Result of signed align & add
§ + (–1)s2 M2
§ Exponent E: E1
(–1)s M
¢ Fixing
§ If M ≥ 2, shift M right, increment E
§ if M < 1, shift M left k positions, decrement E by k
§ Overflow if E out of range
§ Round M to fit frac precision

Autumn 2013 Integers & Floats 97


University of Washington

Closer Look at Round-To-Even


¢ Default Rounding Mode
§ Hard to get any other kind without dropping into assembly
§ All others are statistically biased
§ Sum of set of positive numbers will consistently be over- or under-
estimated

¢ Applying to Other Decimal Places / Bit Positions


§ When exactly halfway between two possible values
§ Round so that least significant digit is even
§ E.g., round to nearest hundredth
1.2349999 1.23 (Less than half way)
1.2350001 1.24 (Greater than half way)
1.2350000 1.24 (Half way—round up)
1.2450000 1.24 (Half way—round down)
Autumn 2013 Integers & Floats 98
University of Washington

Rounding Binary Numbers


¢ Binary Fractional Numbers
§ “Half way” when bits to right of rounding position = 100…2

¢ Examples
§ Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2

Autumn 2013 Integers & Floats 99

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy