Lecture9 - Fixed Point

/INFOMOV/
Optimization & Vectorization

J. Bikker - Sep-Nov 2015 - Lecture 9: “Fixed Point Math”
Welcome!
Today’s Agenda:
 Introduction
 Float to Fixed Point and Back
 Operations
 Fixed Point & Accuracy
INFOMOV – Lecture 9 – “Fixed Point Math” 3
Introduction
The Concept of Fixed Point Math
Basic idea: we have 𝜋: 3.1415926536.
 Multiplying that by 1010 yields 31415926536.

 Adding 1 to 𝜋 yields 4.1415926536.
 Adding 1·1010 to the scaled up version of 𝜋 yields 41415926536.
 In base 10, we get 𝑁 digits of fractional precision if we multiply our

numbers by 10𝑁 (and remember where we put that dot).
Some consequences:
 π · 2 ≡ 31415926536 * 20000000000 = 628318530720000000000

 π / 2 ≡ 31415926536 / 20000000000 = 1 (or 2, if we use proper rounding).
Introduction
The Concept of Fixed Point Math
On a computer, this is naturally done in base 2. Starting with π again:
 Multiplying by 216 yields 205887.

 Adding 1·216 to the scaled up version of 𝜋 yields 271423.
In binary:
 205887 = 00000000 00000011 00100100 00111111

 271423 = 00000000 00000100 00100100 00111111
Looking at the first number (205887), and splitting in two sets of 16 bit, we get:
 11 (base 2) = 3 (base 10);

9279
 10010000111111 (base 2) = 9279 (base 10); 216 = 0.141586304.
Introduction
But… Why!?
Nintendo DS has two CPUs: ARM946E-S (main) and ARM7TDMI (coproc).

Characteristics: 32-bit processor, no floating point support.
Many DSPs do not support floating point. A DSP that supports floating
point is more complex, and more expensive.
Pixel operations can be dominated by int-to-float and float-to-int
conversions if we use float arithmetic.
Floating point and integer instructions can execute at the same time on a
superscalar processor architecture.
Introduction
But… Why!?
Texture mapping in Quake 1: Perspective Correction
 Affine texture mapping: interpolate u/v linearly over polygon

 Perspective correct texture mapping: interpolate 1/z, u/z and v/z.
 Reconstruct u and v per pixel using the reciprocal of 1/z.
Quake’s solution:
 Divide a horizontal line of pixels in segments of 8 pixels;

 Calculate u and v for the start and end of the segment;
 Interpolate linearly (fixed point!) over the 8 pixels.
And:
Start the floating point division (21 cycles) for the next segment, so it
can complete while we execute integer code for the linear interpolation.
Introduction
But… Why!?
Epsilon: required to prevent registering a hit at distance 0.

What is the optimal epsilon?
Too large: light leaks because we miss the left wall;
Too small: we get the hit at distance 0.
Solution: use fixed point math, and set epsilon to 1.
For an example, see

“Fixed Point Hardware Ray Tracing”, J. Hannika, 2007.
https://www.uni-ulm.de/fileadmin/website_uni_ulm/iui.inst.100/institut/mitarbeiter/jo/dreggn2.pdf
Today’s Agenda:
 Introduction
 Operations
Conversions
Practical Things
Converting a floating point number to fixed point:
Multiply the float by a power of 2 represented by a floating point value, and

cast the result to an integer. E.g.:
fp_pi = (int)(3.141593f * 65536.0f); // 16 bits fractional
After calculations, cast the result to int by discarding the fractional bits. E.g.:
int result = fp_pi >> 16; // divide by 65536
Or, get the original float back by casting to float and dividing by 2fractionalbits :
float result = (float)fp_pi / 65536.0f;
Note that this last option has significant overhead, which should be
outweighed by the gains.
Conversions
Practical Things - Considerations
Example: precomputed sin/cos table
#define FP_SCALE 65536.0f 1073741824.0f

int sintab[256], costab[256];
for( int i = 0; i < 256; i++ )
sintab[i] = (int)(FP_SCALE * sinf( (float)i / 128.0f * PI )),
costab[i] = (int)(FP_SCALE * cosf( (float)i / 128.0f * PI ));
What is the best value for FP_SCALE in this case? And should we use int or
unsigned int for the table?
Sine/cosine: range is [-1, 1]. In this case, we need 1 sign bit, and 1 bit for the
whole part of the number. So:
 We use 30 bits for fractional precision, 1 for sign, 1 for range.
In base 10, the fractional precision is ~10 digits (float has 7).
Conversions
Example: values in a z-buffer
A 3D engine needs to keep track of the depth

of pixels on the screen for depth sorting. For
this, it uses a z-buffer.
We can make two observations:
1. All values are positive (no objects behind the camera are drawn);
2. Further away we need less precision.
By adding 1 to z, we guarantee that z is in the range [1..infinity].

The reciprocal of z is then in the range [0..1].
We store 1/(z+1) as a 0:32 unsigned fixed point number for
maximum precision.
Conversions
Example: particle simulation
Your particle simulation operates on particles inside a

100x100x100 box centered around the origin. What fixed
point format do you use for the coordinates of the particles?
1. Since all coordinates are in the range [-50,50], we need a sign

2. The maximum integer value of 50 fits in 6 bits
3. This leaves 25 bits fractional precision (a bit more than 8 decimal digits).
 We use a 7:25 fixed point representation.
Better: scale the simulation to a box of 127x127x127 for better use of the full
range; this gets you ~8.5 decimal digits of precision.
Conversions
Mixing fixed point formats:
Suppose you want to add a sine wave to your 7:25 particle coordinates using
the precalculated 2:30 sine table. How do we get from 2:30 to 7:25?
Simple: shift the sine values 5 bits to the right (losing some precision).
(What happens if you used the 127x127x127 grid, and adding the sine wave
makes particles exceed this range?)
Conversions
Practical Things – 64 bit
So far, we assumed the use of 32bit integers to represent our fixed point
numbers. What about 64bit?
 Process is the same

 But storage requirements double.
In many cases, we do not need the extra precision;
but we will use 64bit to overcome problems with multiplication and division.
Today’s Agenda:
 Introduction
 Operations
Operations
Addition & Subtraction
Adding two fixed point numbers is straightforward:
fp_a = … ;
fp_b = … ;
fp_sum = fp_a + fp_b;
Subtraction is done in the same way.
Note that this does require that fp_a and fp_b have the same
number of fractional bits. Also don’t mix signed and unsigned
carelessly.
fp_a = … ; // 8:24
fp_b = … ; // 16:16
fp_sum = (fp_a >> 8) + fp_b; // result is 16:16
Operations
Multiplication
Multiplying fixed point numbers:
fp_a = … ; // 10:22
fp_b = … ; // 10:22
fp_sum = fp_a * fp_b; // 20:44
Situation 1: fp_sum is a 64 bit value.
 Divide fp_sum by 222 to reduce it to 20:22 fixed point.

(shift right by 22 bits)
Situation 2: fp_sum is a 32 bit value.
 Ensure that intermediate results never exceed 32 bits.

Operations
Multiplication
 “Ensure that intermediate results never exceed 32 bits.”
Using the 10:22 * 10:22 example from the previous slide:
1. (fp_a * fp_b) >> 22; // good if fp_a and fp_b are very small
2. (fp_a >> 22) * fp_b; // good if fp_a is a whole number
3. (fp_a >> 11) * (fp_b >> 11); // good if fp_a and fp_b are large
4. ((fp_a >> 5) * (fp_b >> 5)) >> 12;
Which option we chose depends on the parameters:
fp_a = PI;
fp_b = 0.5f * 2^22;
int fp_prod = fp_a >> 1; // 
Operations
Division
Dividing fixed point numbers:
fp_a = … ; // 10:22
fp_b = … ; // 10:22
fp_sum = fp_a / fp_b; // 10:0
Situation 1: we can use a 64-bit intermediate value.
 Multiply fp_a by 222 before the division

(shift left by 22 bits)
Situation 2: we need to respect the 32-bit limit.

Operations
Division
1. (fp_a << 22) / fp_b; // good if fp_a and fp_b are very small
2. fp_a / (fp_b >> 22); // good if fp_b is a whole number
3. (fp_a << 11) / (fp_b >> 11); // good if fp_a and fp_b are large
4. ((fp_a << 5) / (fp_b >> 5)) >> ?;
Note that a division by a constant can be replaced by a multiplication by its reciprocal:
fp_reci = (1 << 22) / fp_b;

fp_prod = (fp_a * fp_reci) >> 22; // or one of the alternatives
Operations
Square Root
For square roots of fixed point numbers, optimal performance is achieved via
_mm_rsqrt_ps (via float). If precision is of little concern, use a lookup table, optionally
combined with interpolation and / or a Newton-Raphson iteration.
Sine / Cosine / Log / Pow / etc.
Almost always a LUT is the best option.

Operations
Fixed Point & SIMD
For a world of hurt, combine SIMD and fixed point:
_mm_mul_epu32
_mm_mullo_epi16
_mm_mulhi_epu16
_mm_srl_epi32
_mm_srai_epi32
See MSDN for more details.

Today’s Agenda:
 Introduction
 Operations
Accuracy
Error
In base 10, error is clear:
PI = 3.14 means: 3.145 > 𝑃𝐼 > 3.135

The maximum error is thus 0.005.
In base 2, we apply the same principle:
1
16:16 fixed point numbers have a maximum error of 217 ≈ 7.6 · 10−6 .
 We get slightly more than 5 digits of decimal precision.
A 32-bit floating point number represents ~7 digits of decimal precision.

Accuracy
Error
During some operations, precision may suffer greatly:
𝑥 = 𝑦/𝑧
𝑓𝑝_𝑥 = (𝑓𝑝_𝑦 << 8) / (𝑓𝑝_𝑧 >> 8)
Assuming 16:16 input, 𝑓𝑝_𝑧 briefly becomes 16:8, with a precision of only 2 decimal digits.
Similarly:
𝑓𝑝_𝑥 = (𝑓𝑝_𝑦 >> 8) ∗ (𝑓𝑝_𝑧 >> 8)
Here, both 𝑓𝑝_𝑦 and 𝑓𝑝_𝑧 become 16:8, and the cumulative error will exceed 1/29 .
Accuracy
Error
Careful balancing of range and precision in fixed point calculations can reduce this problem.
Note that accuracy problems also occur in float calculations; they are just exposed more
clearly in fixed point. And: this time we can do something about it.
Accuracy
Error - Example
Accuracy
Improving the function.zip example
The following slides contain a step-by-step improvement of the fixed point evaluation of the
1
function 𝑓 𝑥 = sin 4𝑥 3 − cos 4𝑥 2 + 𝑥 , which failed during the real-time session in class.
Starting point is the working, but inaccurate version available from the website.
Initial accuracy, expressed as summed error relative to the ‘double’ evaluation, is 246.84.
For comparison, the summed error of the ‘float’ evaluation is just 0.013.
Accuracy
Improving the function.zip example
int EvaluateFixed( double x )
{
16:16 int fp_pi = (int)(PI * 65536.0);
16:16 int fp_x = (int)(x * 65536.0);
if ((fp_x >> 8) == 0) return 0; // safety net for division
int fp_4x = fp_x * 4; 16:16 * 3:0 = 19:16

int a = (fp_4x << 8) / ((2 * fp_pi) >> 8); // map radians to 0..4095
16:16 int fp_sin4x = sintab[(a >> 4) & 4095];
16:16 int fp_sin4x3 = (((fp_sin4x >> 8) * (fp_sin4x >> 8)) >> 8) * (fp_sin4x >> 8);
16:16 int fp_cos4x = costab[(a >> 4) & 4095];

16:16 int fp_cos4x2 = (fp_cos4x >> 8) * (fp_cos4x >> 8);
In the original code, almost everything
16:16 int fp_recix = (65536 << 8) / (fp_x >> 8); is 16:16. This allows for a range of
0..32767 (+/-), which is a waste for
return fp_sin4x3 - fp_cos4x2 + fp_recix; 16:16 most values here.
}
Accuracy
Improving the function.zip example Notice how many values do not use
the full integer range: e.g, PI is 3 and
int EvaluateFixed( double x ) needs two bits; x is -9..+9 and needs
{ four bits, sin/cos is -1..1 and needs
2:16 int fp_pi = (int)(PI * 65536.0); only one bit for range.
4:16 int fp_x = (int)(x * 65536.0);
int fp_4x = fp_x * 4; 16:16 * 3:0 = 19:16

int a = (fp_4x << 8) / ((2 * fp_pi) >> 8); // map radians to 0..4095
1:16 int fp_sin4x = sintab[(a >> 4) & 4095];
1:16 int fp_cos4x = costab[(a >> 4) & 4095];

1:16 int fp_cos4x2 = (fp_cos4x >> 8) * (fp_cos4x >> 8);
16:16 int fp_recix = (65536 << 8) / (fp_x >> 8);
return fp_sin4x3 - fp_cos4x2 + fp_recix; 16:16

}
Accuracy
Here, x is adjusted to use maximum precision:
4:27. 4x is then just a reinterpretation of this
Improving the function.zip example number, 6:25.
int EvaluateFixed( double x ) The calculation of sin4x3 is interesting: since
{ sin(x) is -1..1, sin(x)^3 is also -1..1. We drop a
2:16 int fp_pi = (int)(PI * 65536.0); minimal amount of bits and keep precision.
4:27 int fp_x = (int)(x * (double)(1 << 27));
if ((fp_x >> 10) == 0) return 0; // safety net for division Error is now down to 14.94.
6:25 int fp_4x = fp_x;
int a = fp_4x / ((2 * fp_pi) >> 3); 6:25 / 3:13 = 4:12
1:16 int fp_sin4x = sintab[a & 4095];
^ 0:15 * 0:15 = 0:30; 0.15 * 0:15 = 0.30
1:16 int fp_cos4x = costab[a & 4095];
0:39 int fp_cos4x2 = (fp_cos4x >> 1) * (fp_cos4x >> 1); 0:15 * 0:15 = 0:30
16:16 int fp_recix = (1 << 30) / (fp_x >> 13); 1:30 / 5:14 = 0:16
return ((fp_sin4x3 - fp_cos4x2) >> 14) + fp_recix; 16:16

}
Accuracy
Where do we go from here?
Improving the function.zip example  The sin/cos tables still contain 1:16 data. However,
int EvaluateFixed( double x ) the way their data is used makes that increasing
{
int fp_pi = (int)(PI * 65536.0); precision here doesn’t help.
 We could calculate fp_sin4x3 and fp_cos4x2 via 64-
int fp_x = (int)(x * (double)(1 << 27));
int fp_4x = fp_x; bit intermediate variables. I tried it; impact is

int
int
a = fp_4x / ((2 * fp_pi) >> 3);
fp_sin4x = sintab[a & 4095]; minimal…
 We can return a value more precise than 16:16 (as
int fp_sin4x3 = (((fp_sin4x >> 1) * (fp_sin4x >> 1)) >> 15) * (fp_sin4x >> 1);
int fp_cos4x = costab[a & 4095];

int fp_cos4x2 = (fp_cos4x >> 1) * (fp_cos4x >> 1); we do currently). Problem is around x = 0, where
int fp_recix = (1 << 30) / (fp_x >> 13); the function returns large values and needs the
return ((fp_sin4x3 - fp_cos4x2) >> 14) + fp_recix; range.
 Perhaps 4096 entries in the sin/cos tables is not
}
enough?
To be continued. 
Accuracy
Error – Take-away
 Fixed point code should carefully balance range and precision.

 Do not default to 16:16!
 In multiplications / divisions, carefully conserve precision.
 Use of 64-bit intermediate results is expensive in 32-bit mode. In 64-bit mode, the only
disadvantage of 64-bit numbers is increased storage requirements.
Today’s Agenda:
 Introduction
 Operations
Take-away
Fixed point:
 Only practical option on some devices

 On your CPU: good for preventing excessive type conversions
 On your CPU: good for mixing float and integer code
 In some scenarios, fixed point offers more accuracy than float
 Range / precision issues become clear.
/INFOMOV/
END of “Fixed Point Math”

next lecture: “GPGPU (1)”

Lecture9 - Fixed Point

Uploaded by

Copyright:

Available Formats

Lecture9 - Fixed Point

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture9 - Fixed Point

Uploaded by

Copyright:

Available Formats

/INFOMOV/

Optimization & Vectorization

Basic idea: we have 𝜋: 3.1415926536.

 Multiplying that by 1010 yields 31415926536.

 In base 10, we get 𝑁 digits of fractional precision if we multiply our

 π · 2 ≡ 31415926536 * 20000000000 = 628318530720000000000

On a computer, this is naturally done in base 2. Starting with π again:

 Multiplying by 216 yields 205887.

 205887 = 00000000 00000011 00100100 00111111

 11 (base 2) = 3 (base 10);

Nintendo DS has two CPUs: ARM946E-S (main) and ARM7TDMI (coproc).

Texture mapping in Quake 1: Perspective Correction

 Affine texture mapping: interpolate u/v linearly over polygon

 Divide a horizontal line of pixels in segments of 8 pixels;

Epsilon: required to prevent registering a hit at distance 0.

For an example, see

Converting a floating point number to fixed point:

Multiply the float by a power of 2 represented by a floating point value, and

Example: precomputed sin/cos table

#define FP_SCALE 65536.0f 1073741824.0f

Example: values in a z-buffer

A 3D engine needs to keep track of the depth

We can make two observations:

By adding 1 to z, we guarantee that z is in the range [1..infinity].

Example: particle simulation

Your particle simulation operates on particles inside a

1. Since all coordinates are in the range [-50,50], we need a sign

 We use a 7:25 fixed point representation.

Mixing fixed point formats:

 Process is the same

In many cases, we do not need the extra precision;

Adding two fixed point numbers is straightforward:

Subtraction is done in the same way.

Multiplying fixed point numbers:

Situation 1: fp_sum is a 64 bit value.

 Divide fp_sum by 222 to reduce it to 20:22 fixed point.

Situation 2: fp_sum is a 32 bit value.

 Ensure that intermediate results never exceed 32 bits.

 “Ensure that intermediate results never exceed 32 bits.”

Using the 10:22 * 10:22 example from the previous slide:

Which option we chose depends on the parameters:

Dividing fixed point numbers:

Situation 1: we can use a 64-bit intermediate value.

 Multiply fp_a by 222 before the division

Situation 2: we need to respect the 32-bit limit.

Note that a division by a constant can be replaced by a multiplication by its reciprocal:

fp_reci = (1 << 22) / fp_b;

Sine / Cosine / Log / Pow / etc.

Almost always a LUT is the best option.

For a world of hurt, combine SIMD and fixed point:

See MSDN for more details.

In base 10, error is clear:

PI = 3.14 means: 3.145 > 𝑃𝐼 > 3.135

In base 2, we apply the same principle:

A 32-bit floating point number represents ~7 digits of decimal precision.

During some operations, precision may suffer greatly:

𝑓𝑝_𝑥 = (𝑓𝑝_𝑦 << 8) / (𝑓𝑝_𝑧 >> 8)

𝑓𝑝_𝑥 = (𝑓𝑝_𝑦 >> 8) ∗ (𝑓𝑝_𝑧 >> 8)

int fp_4x = fp_x * 4; 16:16 * 3:0 = 19:16