Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..
Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..
Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..
Acknowledgements:
• R. Katz, “Contemporary Logic Design”, Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5)
• J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital Integrated Circuits: A Design Perspective” Prentice Hall, 2003.
• Kevin Atkinson, Alice Wang, Rex Min
N bits
-2N-1 2N-2 … … … 23 22 21 20
[Katz93, chapter 5]
6.111 Fall 2004 Lectures 9/10, Slide 4
Binary Addition
So let’s build an arithmetic unit that does both addition and subtraction.
Operation selected by control input:
Signed
Signedcomparison:
comparison:
N (negative): result is < 0 SN-1
LT
LT N⊕V
N⊕V
LE Z+(N⊕V)
C (carry): indicates that add in the most LE Z+(N⊕V)
EQ ZZ
significant position produced a carry, e.g., EQ
NE ~Z
“1 + (-1)” from last FA NE ~Z
GE
GE ~(N⊕V)
~(N⊕V)
V (overflow): indicates that the answer has GT
GT ~(Z+(N⊕V))
~(Z+(N⊕V))
too many bits to be represented correctly by
the result width, e.g., “(2N-1 - 1)+ (2N-1- 1)” Unsigned
Unsignedcomparison:
comparison:
LTU
LTU CC
V =A B S +A B S
N −1 N −1 N −1 N −1 N −1 N −1 LEU
LEU C+ZC+Z
V = COUT ⊕CIN GEU
GEU ~C ~C
N −1 N −1 GTU
GTU ~(C+Z)
~(C+Z)
CI to CO CIN-1 to SN-1
Θ(N) is read “order N” and tells us that the latency of our adder
grows proportional to the number of bits in the operands.
6.111 Fall 2004 Lectures 9/10, Slide 8
Faster carry logic
Let’s see if we can improve the speed by rewriting the equations for COUT:
generate propagate
Actually,
Actually,PPisisusually
usually
For adding two N-bit numbers: defined
definedas asPP==A⊕BA⊕B
which
whichwon’t
won’tchange
change
CN = GN-1 + PN-1CN-1 CCOUT but will allow us
OUT but will allow us
= GN-1 + PN-1 GN-2 + PN-1 PN-2CN-2 to
toexpress
expressSSas asaa
simple
simplefunction
functionof of
= GN-1 + PN-1 GN-2 + PN-1 PN-2GN-3 + … + PN-1 ...P0CIN
andCCIN: :SS==PP⊕C
PPand
IN ⊕CIN IN
Ci,0 FA Co,0
FA Co,1
FA Co,2
FA Co,3
BP= P0P1P2P3
P,G P,G P,G P,G
P0 G0 P1 G1 P2 G2 P3 G3
Ci,0
FA Co,0
FA Co,1
FA Co,2
FA 0 Co,3
Co,15
What is the worst case propagation delay for the 16-bit adder?
Co,15
BP2 = 0 or BP2 = 1?
P/G generation
1st level of
lookahead
Log2(N)
Log2(N)
tPD = Θ(log(N))
A3 A2 A1 A0
x B3 B2 B1 B0
P B NC A Repeat M times {
M bits
N 1 N
P ← P + (BLSB==1 ? A : 0)
+ xN shift P/B right one bit
}
N+1
y0
x3 x2 x1 x0
y1
x3 x2 x1 x0
¾ Partial product computation z0
is simple (single and gate)
HA FA FA HA
x3 x2 y2
x1 x0
z1
FA FA FA HA
x3 x2 x1 y3
x0
z2
FA FA FA HA
z7 z6 z5 z4 z3
6.111 Fall 2004 Lectures 9/10, Slide 19
2’s Complement Multiplication
Step 1: two’s complement operands so Step 3: add the ones to the partial
high order bit is –2N-1. Must sign extend products and propagate the carries. All
partial products and subtract the last one the sign extension bits go away!
Step 2: don’t want all those extra additions, so Step 4: finish computing the constants…
add a carefully chosen constant, remembering
to subtract it at the end. Convert subtraction
in add of (complement + 1).
X3Y0 X2Y0 X1Y0 X0Y0
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1
+ 1 + X2Y2 X1Y2 X0Y2
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + X3Y3 X2Y3 X1Y3 X0Y3
+ 1 + 1 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ 1
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 –B = ~B + 1 Result: multiplying 2’s complement operands
+ 1
+ 1
takes just about same amount of hardware as
- 1 1 1 1 multiplying unsigned operands!
6.111 Fall 2004 Lectures 9/10, Slide 20
2’s Complement Multiplication
y0
x3 x2 x1 x0
y1
x3 x2 x1 x0
1 z0
FA FA FA HA
x3 x2 y2
x1 x0
z1
FA FA FA HA
x3 x2 x1 y3
x0
1 z2
HA FA FA FA HA
z7 z6 z5 z4 z3
CSA
CSA
CSA
CSA
CSA
CPA
product ...
picture :
CSA
CSA
CSA
CSA
CSA
CSA
...
CPA
Even and odd streams pass through half the adders so
even/odd design runs at almost twice the speed of simple
implementation.
O(log1.5M)
CSA
CSA
CSA
CSA
CPA
Wallace
Tree
CSA
CSA
M/2 2
...
BK+1,K*A = 0*A → 0
Booth’s insight: rewrite = 1*A → A
2*A and 3*A cases, = 2*A → 4A – 2A
leave 4A for next partial = 3*A → 4A – A
product to do!
6.111 Fall 2004 Lectures 9/10, Slide 25
Booth recoding
current bit pair from previous bit pair
X Z
Y = (1001)2 = 23 + 20
<< 3
shifts using wiring
6.111 Fall 2004 Lectures 9/10, Slide 28
Transform: Canonical Signed Digits (CSD)
Canonical signed digit representation is used to increase the number of
zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.
01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1
=
10010001 1 0 0 -1 0 0 0 -
1
X << 7 Z
<< 4
Shift translates to re-wiring
6.111 Fall 2004 Lectures 9/10, Slide 29
Algebraic Transformations
Commutativity Distributivity
A C B
A B A B
B A
C
⇔
⇔
A + B = B + A (A + B) C = AB + BC
A B
A B
(A + B) + C = A + (B+C)
6.111 Fall 2004 Lectures 9/10, Slide 30
Transforms for Efficient Resource Utilization
distributivity
A C B D E FG H I
Reduce number of
operators to 2 multipliers
1
and 2 adders
D
D
D
D
D
Cutset retiming: A cutset intersects the edges, such that this would result in
two disjoint partitions of these edges being cut. To retime, delays are moved
from the ingoing to the outgoing edges or vice versa.
Benefits of retiming:
• Modify critical path delay
• Reduce total number of registers
6.111 Fall 2004 Lectures 9/10, Slide 32
Retiming Example: FIR Filter
x(n) D D D Symbol for multiplication
y(n)
(4) retime
x(n)
Note: here we use a first cut analysis that assumes the delay of a chain of
operators is the sum of their individual delays. This is not accurate.
6.111 Fall 2004 Lectures 9/10, Slide 33
Pipelining = Adding Registers + Retiming
D
15 TTCLK == 25
25 (w/
(w/ ideal
ideal regs)
regs)
5 5 CLK
Latency = 1 clock cycle
Latency = 1 clock cycle
D D
Throughput
Throughput == 1/clock
1/clock cycle
cycle
15
Unlike retiming, pipelining
Add more input
adds extra registers to
registers
the system
D D D
D D D D
How to pipeline:
1. Add extra registers at all
retime inputs (or, equivalently, all
outputs)
15 2. Retime
D D
5 5
D D TTCLK == 15
15 (w/
(w/ ideal
ideal regs)
regs)
D D CLK
Latency
Latency == 33 clock
clock cycles
cycles
D Throughput
Throughput = 1/clock cycle
= 1/clock cycle
15
6.111 Fall 2004 Lectures 9/10, Slide 34
The Power of Transforms: Lookahead
y(n) = x(n) + A y(n-1) x(n) y(n)
x(n) y(n) loop
unrolling D A 2D
A D A
D 2D
A A A
associativity
x(n) y(n)
x(n) y(n)
retiming
A D D D D 2D
A A2
A2
precomputed
6.111 Fall 2004 Lectures 9/10, Slide 35