chapter02b float 中文
chapter02b float 中文
A Programmer’s Perspective
计算机系统
周学海
xhzhou@ustc.edu.cn
0551-63492149
中国科学技术大学
Review
• Representing information as bits
• Bit-level manipulations
• Integers
– Representation: unsigned and signed
– Conversion, casting
– Expanding, truncating
– Addition, negation, multiplication, shifting
• Summary
Floating Point
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 3
Floating Point Puzzles
• For each of the following C expressions, either:
– Argue that it is true for all argument values
– Explain why not true
• x == (int)(float) x
int x = …; • x == (int)(double) x
float f = …; • f == (float)(double) f
double d = …; • d == (float) d
• f == -(-f);
• 2/3 == 2/3.0
Assume neither
d nor f is NaN • d < 0.0 ((d*2) < 0.0)
• d > f -f > -d
• d * d >= 0.0
• (d+f)-d == f
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 4
Fractional binary numbers
• What is 1011.1012?
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 5
Fractional Binary Numbers
2i
2i–1
4
••• 2
1
2–j
• 二进制小数
– Bits to right of “binary point” represent fractional powers of 2
i
k
– Represents rational number: bk 2
k j
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 6
Frac. Binary Number Examples
• Value Representation
5 + 3/4 101.112 = 4+1+1/2+1/4
2 + 7/8 10.1112 = 2+1/2+1/4+1/8
63/64 0.1111112 = 1/2+1/4+1/8+1/16+1/32+1/64
• Observations
– Divide by 2 by shifting right
– Multiply by 2 by shifting left
– Numbers of form 0.111111…2 just below 1.0
• 1/2 + 1/4 + 1/8 + … + 1/2i + … 1.0
• Use notation 1.0 –
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 7
Representable Numbers
• Limitation #1
– Can only represent numbers of the form x/2k
• Other numbers have repeating bit representations
• Value Representation
– 1/3 0.0101010101[01]…2
– 1/5 0.001100110011[0011]…2
– 1/10 0.0001100110011[0011]…2
• Limitation #2
–Just one setting of binary point within the w bits
• Limited range of numbers (very small values? very large?)
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 8
IEEE Floating Point
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 9
IEEE 754-2019
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 10
Floating Point Representation
• 数值表示:(–1)s M 2E
• Sign bit s determines whether number is negative or positive
• Significand(尾数) M normally a fractional value in range [1.0,2.0).
• Exponent(阶码) E weights value by power of two
• 编码格式
– MSB is sign bit
– exp field encodes E
– frac field encodes M
s exp frac
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 11
Floating Point Precisions
• 单精度Single precision: 32 bits ≈7 decimal digits, 10±38
s exp frac
1 8-bits 23-bits
s exp frac
1 15-bits 63 or 64-bits
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 13
不同Float8格式的模型推理精度
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 14
Three “kinds” of floating point numbers
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 15
“Normalized” Numeric Values
• 规格化浮点数表示的情况:
– Condition: exp 000…0 and exp 111…1
• 指数E(有符号整数)编码为增加偏置值(移码)的非负整数
(exp)
– Exponent coded as biased value:E = exp – Bias
– exp : unsigned value denoted by exp
– Bias : Bias value
• Single precision: 127 (exp: 1…254 E: -126…127) (阶码8位)
• Double precision: 1023 (exp: 1…2046 E: -1022…1023) (阶码11位)
• in general: Bias = 2e-1 - 1, where e is number of exponent bits
• 尾数编码解释为 的小数表示
– Significand coded with implied leading 1:M = 1.xxx…x2
– xxx…x: bits of frac
– Minimum when 000…0 (M = 1.0)
– Maximum when 111…1 (M = 2.0 – ε)
– Get extra leading bit for “free”
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 16
Normalized Encoding Example
• Value:Float F = 15213.0;
– 1521310 = 111011011011012 = 1.11011011011012 * 213
• Significand
M = 1.11011011011012
frac = 110110110110100000000002 (23位)
• Exponent
E = 13
Bias = 127
exp = 140 = 100011002
Floating Point Representation (Class 02):
Hex: 4 6 6 D B 4 0 0
Binary: 0100 0110 0110 1101 1011 0100 0000 0000
140: 100 0110 0
15213: 1110 1101 1011 01
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 17
Denormalized Values
• 非规格化浮点数表示的情况
– Condition: exp = 000…0
• 阶码和尾数部分的解释
– Exponent value E = –Bias + 1
– Significand coded with implied leading 0:M = 0.xxx…x2
• xxx…x: bits of frac
• 分为两种情况
– exp = 000…0, frac = 000…0
• Represents value 0
• Note that have distinct values +0 and –0
– exp = 000…0, frac 000…0
• Numbers very close to 0.0
• Lose precision as get smaller
• “Gradual underflow”
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 18
Special Values
• 特殊值的情况: exp = 111…1
• 情形1: exp = 111…1, frac = 000…0
– Represents value(infinity)
– Operation that overflows
– Both positive and negative
– E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 =
• 情形2:exp = 111…1, frac 000…0
• Not-a-Number (NaN)
• Represents case when no numeric value can be determined
• E.g., sqrt(–1), , *0
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 19
C float Decoding Example
• float: 0xC0A00000
• binary: 1100 0000 1010 0000 0000 0000 0000 0000
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 20
C float Decoding Example #2
• float: 0x001C0000
• binary: 0000 0000 0001 1100 0000 0000 0000 0000
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 21
Summary of Floating Point Real Number Encoding
NaN NaN
0 +0
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 22
Tiny Floating Point Example
7 6 3 2 0
s exp frac
1 4 3
• 8-bit 浮点数表示
– the sign bit is in the most significant bit.
– the next four bits are the exponent, with a bias of 7. (24-1-1)
– the last three bits are the frac
• 与IEEE格式形式相同
– normalized, denormalized
– representation of 0, NaN, infinity
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 23
Values Related to the Exponent
Exp exp E 2E
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 25
Dynamic Range
• 6-bit IEEE-like format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 23-1-1 = 3
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 26
Distribution of Values (close-up view)
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 27
Interesting Numbers
• Description exp frac Numeric Value
• Zero 00…00 00…00 0.0
• Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}
– Single 1.4 X 10–45
– Double 4.9 X 10–324
• Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022}
– Single 1.18 X 10–38
– Double 2.2 X 10–308
• Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}
– Just larger than largest denormalized
• One 01…11 00…00 1.0
• Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}
– Single 3.4 X 1038
– Double 1.8 X 10308
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 28
Special Properties of Encoding
• 浮点数与整型数零的表示相同
– All bits = 0
• 大多数情况下无符号整型数比较规则适用于浮点数
– Must first compare sign bits
– Must consider -0 = 0
– NaNs problematic
• Will be greater than any other values?
• What should comparison yield? The answer is complicated.
– Otherwise OK
• Denorm vs. normalized
• Normalized vs. infinity
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 29
Floating Point
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 30
Floating Point Operations: Basic Idea
• x +f y = Round(x + y)
• x ×f y = Round(x × y)
• 基本思路
– First compute exact result
– Make it fit into desired precision
• Possibly overflow if exponent too large
• Possibly round to fit into frac
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 31
Floating Point Operations
• 基本思路
– First compute exact result
– Make it fit into desired precision
• Possibly overflow if exponent too large
• Possibly round to fit into frac
• 舍入方式 (illustrate with $ rounding)
• $1.40 $1.60 $1.50 $2.50 $-1.50
– Zero $1 $1 $1 $2 –$1
– Round down (-) $1 $1 $1 $2 –$2
– Round up (+) $2 $2 $2 $3 –$1
– Nearest Even (default) $1 $2 $2 $2 –$2
Note:
1. Round down: rounded result is close to but no greater than true result.
2. Round up: rounded result is close to but no less than true result.
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 32
Closer Look at Round-To-Even
• IEEE 754默认的舍入方式:Round-To-Even
– Hard to get any other kind without dropping into assembly
• C99 has support for rounding mode management
– All others are statistically biased
• Sum of set of positive numbers will consistently be over- or under-
estimated
• Round-To-Even
– When exactly halfway between two possible values
• Round so that least significant digit is even
– E.g., round to nearest hundredth
1.2349999 1.23 (Less than half way)
1.2350001 1.24 (Greater than half way)
1.2350000 1.24 (Half way—round up)
1.2450000 1.24 (Half way—round down)
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 33
Rounding Binary Numbers
• 二进制小数
– “Even” when least significant bit is 0
• 例如
– Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 (1/2—up) 3
2 5/8 10.101002 10.102 (1/2—down) 2 1/2
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 34
Rounding
Guard bit: LSB of result
1.BBGRXXX
• 向上舍入(Round up)的条件
– Round = 1, Sticky = 1 ➙ > 0.5
– Guard = 1, Round = 1, Sticky = 0 ➙ Round to even
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 36
FP Addition
• 两个操作数 : (–1)s1 M1 2E1+ (–1)s2 M2 2E2
E1–E2
– Assume E1 > E2
(–1)s1 M1
• 具体运算结果: (–1)s M 2E
– Sign s, significand M: + (–1)s2 M2
• Result of signed align & add
– Exponent E: E1 (–1)s M
• 结果调整
– If M ≥ 2, shift M right, increment E
– if M < 1, shift M left k positions, decrement E by k
– Overflow if E out of range
– Round M to fit frac precision
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 37
Mathematical Properties of FP Add
• 是否构成 阿贝尔群(Abelian Group)
– Closed under addition? YES
• But may generate infinity or NaN
– Commutative? YES
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 38
Math. Properties of FP Mult
• 是否构成交换环(Commutative Ring)
– Closed under multiplication? YES
• But may generate infinity or NaN
– Multiplication Commutative? YES
– Multiplication is Associative? NO
• Possibility of overflow, inexactness of rounding
– 1 is multiplicative identity? YES
– Multiplication distributes over addition? NO
• Possibility of overflow, inexactness of rounding
• 1e20*(1e20-1e20)= 0.0, 1e20*1e20 – 1e20*1e20 = NaN
• 是否满足单调性(Monotonicity)
– a ≥ b & c ≥ 0 a *c ≥ b *c? ALMOST
• Except for infinities & NaNs
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 39
Floating Point in C
• C 支持两种精度的浮点数操作
float single precision
double double precision
• 不同数据类型间的转换规则
– Casting between int, float, & double changes numeric values and bit
representation
– Double or float to int
• Truncates fractional part
• Like rounding toward zero
• Not defined when out of range
– Generally saturates to TMin
– int to double
• Exact conversion, as long as int has ≤ 53 bit word size
– int to float
• Will round according to rounding mode
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 40
Answers to Floating Point Puzzles
int x = …;
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x No: 24 bit significand
• x == (int)(double) x Yes: 53 bit significand
• f == (float)(double) f Yes: increases precision
• d == (float) d No: loses precision
• f == -(-f); Yes: Just change sign bit
• 2/3 == 2/3.0 No: 2/3 == 0
• d < 0.0 ((d*2) < 0.0) Yes!
• d > f -f > -d Yes!
• d * d >= 0.0 Yes!
• (d+f)-d == f No: Not associative
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 41
Summary
• IEEE 754标准的浮点数运算具有清晰的数学性质
– 我们可以不基于实现来预测其操作行为
– As if computed with perfect precision and then rounded浮点数
的表示形式为 M 2E
• 与数学中的算术运算不同之处:
– Violates associativity/distributivity
– Makes life difficult for compilers & serious numerical applications
programmers
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 42
Additional Slides: Creating Floating Point Number
• 基本步骤
– Normalize to have leading 1
– Round to fit within fraction
– Postnormalize to deal with effects of rounding
• 举例
– Convert 8-bit unsigned numbers to tiny floating point format
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 43
Normalize
• 基本步骤
– Set binary point so that numbers of form 1.xxxxx
– Adjust all to have leading one
• Decrement exponent as shift left
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 44
Postnormalize
• 后序规格化处理
– Rounding may have caused overflow
– Handle by shifting right once & incrementing exponent
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 45
This is important!
• Ariane 5 在其首次发射中爆炸:-造成500万美元损失
– Exploded 37 seconds after liftoff,Cargo worth $500 million
• 原因:
– 64-bit floating point number assigned to 16-bit integer
• Computed horizontal velocity as floating point number
• Converted to 16-bit integer
• Worked OK for Ariane 4
• Overflowed for Ariane 5
• Used same software
– Causes rocket to get incorrect value of horizontal velocity and crash
• 爱国者导弹防御系统未命中飞毛腿- 28人死亡
– System tracks time in tenths of second
– Converted from integer to floating point number.
– Accumulated rounding error causes drift. 20% drift over 8 hours.
– Eventually (on 2/25/1991 system was on for 100 hours) causes range mis_x0002_estimation
sufficiently large to miss in comming missiles
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 46
Acknowledgements
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 47
群、环、域的定义
10/17/2024 Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition 48