Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..

Arithmetic Circuits
Didn’t I learn how

to do addition in
the second grade? 01011
MIT courses aren’t
what they used to +00101
be... 10000
Acknowledgements:
• R. Katz, “Contemporary Logic Design”, Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5)
• J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital Integrated Circuits: A Design Perspective” Prentice Hall, 2003.
• Kevin Atkinson, Alice Wang, Rex Min
6.111 Fall 2004 Lectures 9/10, Slide 1

Number Systems Basics
How to represent negative numbers?
• Three common schemes: sign-magnitude, ones complement,
twos complement
• Sign-magnitude: MSB = 0 for positive, 1 for negative
– Range: -(2N-1 – 1) to +(2N-1 – 1)
– Two representations for zero: 0000… & 1000…
– Simple multiplication but complicated addition/subtraction
_
• Ones complement: if N is positive then its negative is N
– Example: 0111 = 7, 1000 = -7
– Range: -(2N-1 – 1) to +(2N-1 – 1)
– Two representations for zero: 0000… & 1111…
– Subtraction implemented as addition followed by ones
complement

2’s Complement
N bits
-2N-1 2N-2 … … … 23 22 21 20
Range: – 2N-1 to 2N-1 – 1

“sign bit” “decimal” point
8-bit 2’s complement example:

11010110 = –27 + 26 + 24 + 22 + 21 = – 128 + 64 + 16 + 4 + 2 = – 42
If we use a two’s-complement representation for signed integers, the

same binary addition procedure will work for adding both signed and
unsigned numbers.
By moving the implicit location of “decimal” point, we can represent

fractions too:
1101.0110 = –23 + 22 + 20 + 2-2 + 2-3 = – 8 + 4 + 1 + 0.25 + 0.125 = – 2.25
Twos Complement Representation
Twos complement = bitwise complement + 1
0111 → 1000 + 1 = 1001 = -7
1001 → 0110 + 1 = 0111 = 7
Asymmetric range: -2N-1 to +2N-1-1

Only one representation for zero
Simple addition and subtraction
Most common representation
4 0100 -4 1100 4 0100 -4 1100
+ 3 0011 + (-3) 1101 - 3 1101 + 3 0011
7 0111 -7 11001 1 10001 -1 1111
[Katz93, chapter 5]
Binary Addition
Here’s an example of binary addition as one might do it by “hand”:
Carries from previous column

1 1 0 1
Adding two N-bit 1101
numbers produces + 0101
an (N+1)-bit result
10010
We’ve already built the circuit that implements one column:
So we can quickly build a circuit two add two 4-bit numbers…

Subtraction: A-B = A + (-B)
Using 2’s complement representation: –B = ~B + 1

~ = bit-wise complement
So let’s build an arithmetic unit that does both addition and subtraction.
Operation selected by control input:
But what about

the “+1”?

Condition Codes
Besides the sum, one often wants four other bits To

Tocompare
compareAAand
andB,
B,
of information from an arithmetic unit: perform
performA–B
A–Band
anduse
use
Z (zero): result is = 0 big NOR gate condition
conditioncodes:
codes:
Signed
Signedcomparison:
comparison:
N (negative): result is < 0 SN-1
LT
LT N⊕V
N⊕V
LE Z+(N⊕V)
C (carry): indicates that add in the most LE Z+(N⊕V)
EQ ZZ
significant position produced a carry, e.g., EQ
NE ~Z
“1 + (-1)” from last FA NE ~Z
GE
GE ~(N⊕V)
~(N⊕V)
V (overflow): indicates that the answer has GT
GT ~(Z+(N⊕V))
~(Z+(N⊕V))
too many bits to be represented correctly by
the result width, e.g., “(2N-1 - 1)+ (2N-1- 1)” Unsigned
Unsignedcomparison:
comparison:
LTU
LTU CC
V =A B S +A B S
N −1 N −1 N −1 N −1 N −1 N −1 LEU
LEU C+ZC+Z
V = COUT ⊕CIN GEU
GEU ~C ~C
N −1 N −1 GTU
GTU ~(C+Z)
~(C+Z)

tPD of Ripple-carry Adder
Worse-case path: carry propagation from LSB to MSB, e.g., when

adding 11…111 to 00…001.
tPD = (N-1)*(tPD,OR + tPD,AND) + tPD,XOR ≈ Θ(N)
CI to CO CIN-1 to SN-1
Θ(N) is read “order N” and tells us that the latency of our adder
grows proportional to the number of bits in the operands.
Faster carry logic
Let’s see if we can improve the speed by rewriting the equations for COUT:
COUT = AB + ACIN + BCIN

= AB + (A + B)CIN
= G + P CIN where G = AB and P = A + B
generate propagate
Actually,
Actually,PPisisusually
usually
For adding two N-bit numbers: defined
definedas asPP==A⊕BA⊕B
which
whichwon’t
won’tchange
change
CN = GN-1 + PN-1CN-1 CCOUT but will allow us
OUT but will allow us
= GN-1 + PN-1 GN-2 + PN-1 PN-2CN-2 to
toexpress
expressSSas asaa
simple
simplefunction
functionof of
= GN-1 + PN-1 GN-2 + PN-1 PN-2GN-3 + … + PN-1 ...P0CIN
andCCIN: :SS==PP⊕C
PPand
IN ⊕CIN IN
CN in only 3 (!) gate delays:

1 for P/G generation, 1 for ANDs, 1 for final OR
Carry Bypass Adder
A0 B0 A1 B1 A2 B2 A3 B3
P,G P,G P,G P,G Can compute P, G

P0 G0 P1 G1 P2 G2 P3 G3 in parallel for all bits
Ci,0 FA Co,0
FA Co,1
FA Co,2
FA Co,3
BP= P0P1P2P3
P,G P,G P,G P,G
P0 G0 P1 G1 P2 G2 P3 G3
Ci,0
FA Co,0
FA Co,1
FA Co,2
FA 0 Co,3
Key Idea: if (P0 P1 P2 P3) then Co,3 = Ci,0

16-bit Carry Bypass Adder
BP= P0P1P2P3 BP= P4P5P6P7 BP= P8P9P10P11 BP= P12P13P14P15

P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G
Ci,0 Co,3 Co,7 Co,11
FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0
Co,0 Co,1 Co,2 Co,4 Co,5 Co,6
1 1 Co,8 Co,9 Co,10 Co,12 Co,13 Co,14
1 1
Co,15
Assume the following for delay each gate:

P, G from A, B: 1 delay unit
P, G, Ci to Co or Sum for a FA: 1 delay unit
2:1 mux delay: 1 delay unit
What is the worst case propagation delay for the 16-bit adder?

Critical Path Analysis
BP= P0P1P2P3 BP2= P4P5P6P7 BP3= P8P9P10P11 BP4= P12P13P14P15

P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G
Ci,0 Co,3 Co,7 Co,11
FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0
Co,0 Co,1 Co,2 Co,4 Co,5 Co,6
1 1 Co,8 Co,9 Co,10 Co,12 Co,13 Co,14
1 1
Co,15
For the second stage, is the critical path:
BP2 = 0 or BP2 = 1?
Message: Timing Analysis is Very Tricky –

Must Carefully Consider Data Dependencies For
False Paths
Carry-lookahead Adders (CLA)
We can choose the maximum fan-in we want for our logic

gates and then build a hierarchical carry chain using these
equations:
CJ+1 = GIJ + PIJCI “generate a carry from bits I thru
K if it is generated in the high-order
GIK = GJ+1,K + PJ+1,K GIJ (J+1,K) part of the block or if it is
generated in the low-order (I,J) part
PIK = PIJ PJ+1,K of the block and then propagated
thru the high part”
where I < J and J+1 < K
P/G generation
1st level of
lookahead
Hierarchical building block

8-bit CLA (P/G generation)
Log2(N)
From Hennessy & Patterson, Appendix A

8-bit CLA (carry generation)
Log2(N)

8-bit CLA (complete)
tPD = Θ(log(N))

Unsigned Multiplication
A3 A2 A1 A0
x B3 B2 B1 B0
ABi called a “partial product” A3B0 A2B0 A1B0 A0B0

A3B1 A2B1 A1B1 A0B1
A3B2 A2B2 A1B2 A0B2
+ A3B3 A2B3 A1B3 A0B3
Multiplying N-bit number by M-bit number gives (N+M)-bit result
Easy part: forming partial products

(just an AND gate since BI is either 0 or 1)
Hard part: adding M N-bit partial products

Sequential Multiplier
Assume the multiplicand (A) has N bits and the

multiplier (B) has M bits. If we only want to invest
in a single N-bit adder, we can build a sequential
circuit that processes a single partial product at a
time and then cycle the circuit M times:
SN SN-1…S0 Init: P←0, load A and B

LSB
P B NC A Repeat M times {
M bits
N 1 N
P ← P + (BLSB==1 ? A : 0)
+ xN shift P/B right one bit
}
N+1
Done: (N+M)-bit result in P/B

Combinational Multiplier
y0
x3 x2 x1 x0
y1
x3 x2 x1 x0
¾ Partial product computation z0
is simple (single and gate)
HA FA FA HA
x3 x2 y2
x1 x0
z1
FA FA FA HA
x3 x2 x1 y3
x0
z2
FA FA FA HA
z7 z6 z5 z4 z3
2’s Complement Multiplication
Step 1: two’s complement operands so Step 3: add the ones to the partial
high order bit is –2N-1. Must sign extend products and propagate the carries. All
partial products and subtract the last one the sign extension bits go away!
X3 X2 X1 X0 X3Y0 X2Y0 X1Y0 X0Y0

* Y3 Y2 Y1 Y0 + X3Y1 X2Y1 X1Y1 X0Y1
-------------------- + X2Y2 X1Y2 X0Y2
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y3 X2Y3 X1Y3 X0Y3
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2 - 1 1 1 1
- X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
-----------------------------------------
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
Step 2: don’t want all those extra additions, so Step 4: finish computing the constants…
add a carefully chosen constant, remembering
to subtract it at the end. Convert subtraction
in add of (complement + 1).
X3Y0 X2Y0 X1Y0 X0Y0
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1
+ 1 + X2Y2 X1Y2 X0Y2
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + X3Y3 X2Y3 X1Y3 X0Y3
+ 1 + 1 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ 1
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 –B = ~B + 1 Result: multiplying 2’s complement operands
+ 1
+ 1
takes just about same amount of hardware as
- 1 1 1 1 multiplying unsigned operands!
2’s Complement Multiplication
y0
x3 x2 x1 x0
y1
x3 x2 x1 x0
1 z0
FA FA FA HA
x3 x2 y2
x1 x0
z1
FA FA FA HA
x3 x2 x1 y3
x0
1 z2
HA FA FA FA HA
z7 z6 z5 z4 z3

Carry-Save Adder (CSA)
Good for pipelining: delay

through each partial product
(except the last) is just
tPD,AND + tPD,FA. No
carry propagation time!
Last stage is still a carry-propagate adder (CPA)

Latency Improvements
M-2
Abstract
partial
CSA
CSA
CSA
CSA
CSA
CPA
product ...
picture :
Rewire so that first two adders work in parallel. Feed results

into third and fourth adders which also work in parallel, etc.
M-4 2
CSA
CSA
CSA
CSA
CSA
CSA
...
CPA
Even and odd streams pass through half the adders so
even/odd design runs at almost twice the speed of simple
implementation.

More Latency Improvements
O(log1.5M)
CSA
CSA
CSA
CSA
CPA
Wallace
Tree
CSA
CSA
... We have been using full-

adders (3 inputs, 2 outputs) in
our array adders. Higher
fan-in adders can be used to
CSA
further reduce delays for

large M.

Higher-radix multiplication
Idea: If we could use, say, 2 bits of the multiplier in generating
each partial product we would halve the number of columns and
halve the latency of the multiplier!
AN-1 AN-2 … A4 A3 A2 A1 A0
x BM-1 BM-2 … B3 B2 B1 B0
M/2 2
...
BK+1,K*A = 0*A → 0
Booth’s insight: rewrite = 1*A → A
2*A and 3*A cases, = 2*A → 4A – 2A
leave 4A for next partial = 3*A → 4A – A
product to do!
Booth recoding
current bit pair from previous bit pair
BK+1 BK BK-1 action

0 0 0 add 0
0 0 1 add A
0 1 0 add A
0 1 1 add 2*A
1 0 0 sub 2*A
1 0 1 sub A -2*A+A
1 1 0 sub A
1 1 1 add 0 -A+A
A “1” in this bit means the previous stage

needed to add 4*A. Since this stage is
shifted by 2 bits with respect to the
previous stage, adding 4*A in the previous
stage is like adding A in this stage!
Behavioral Transformations
There are a large number of implementations of the
same functionality
These implementations present a different point in the
area-time-power design space
Behavioral transformations allow exploring the design
space a high-level
Optimization metrics: power

1. Area of the design
2. Throughput or sample time TS
3. Latency: clock cycles between
the input and associated
output change area
4. Power consumption
5. Energy of executing a task time
6. …
Fixed-Coefficient Multiplication
Conventional Multiplication X3 X2 X1 X0
Z=X·Y Y3 Y2 Y1 Y0
X 3 · Y0 X 2 · Y0 X 1 · Y0 X 0 · Y0
X 3 · Y1 X 2 · Y1 X 1 · Y1 X 0 · Y1
X 3 · Y2 X 2 · Y2 X 1 · Y2 X 0 · Y2
X 3 · Y3 X 2 · Y3 X 1 · Y3 X 0 · Y3
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
Constant multiplication (become hardwired shifts and adds)

X3 X2 X1 X0
Z = X · (1001)2 1 0 0 1
X3 X2 X1 X0
X3 X2 X1 X0
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
X Z
Y = (1001)2 = 23 + 20
<< 3
shifts using wiring
Transform: Canonical Signed Digits (CSD)
Canonical signed digit representation is used to increase the number of
zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.
Iterative encoding: replace

string of consecutive 1’s
0 1 1 … 1 1 1 0 0 … 0 -
2N-2 + … + 21 + 20 1
2N-1 - 20
Worst case CSD has 50% non zero bits
01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1
=
10010001 1 0 0 -1 0 0 0 -
1
X << 7 Z
<< 4
Shift translates to re-wiring
Algebraic Transformations
Commutativity Distributivity
A C B
A B A B
B A
C
⇔
⇔
A + B = B + A (A + B) C = AB + BC
Associativity Common sub-expressions

A B B C X Y
X Y X
C A
⇔
⇔
A B
A B
(A + B) + C = A + (B+C)
Transforms for Efficient Resource Utilization
A B C D E FG H I Time multiplexing: mapped

to 3 multipliers and 3
adders
1
distributivity
A C B D E FG H I
Reduce number of
operators to 2 multipliers
1
and 2 adders

Retiming: A very useful transform
Retiming is the action of moving delay around in the systems
Delays have to be moved from ALL inputs to ALL outputs or vice versa
D
D
D
D
D
Cutset retiming: A cutset intersects the edges, such that this would result in
two disjoint partitions of these edges being cut. To retime, delays are moved
from the ingoing to the outgoing edges or vice versa.
Benefits of retiming:
• Modify critical path delay
• Reduce total number of registers
Retiming Example: FIR Filter
x(n) D D D Symbol for multiplication
h(0) h(1) h(2) h(3) K

y ( n ) = h ( n ) ⊗ x ( n ) = ∑ x ( n − i ) ⋅ h (i )
Direct y(n) i =0
form
associativity
x(n)
of addition
D D D
(10) h(0) h(1) h(2) h(3) Tclk = 22 ns
y(n)
(4) retime
x(n)
h(0) h(1) h(2) h(3)

Transposed Tclk = 14 ns
form
y(n) D D D
Note: here we use a first cut analysis that assumes the delay of a chain of
operators is the sum of their individual delays. This is not accurate.
Pipelining = Adding Registers + Retiming
D
15 TTCLK == 25
25 (w/
(w/ ideal
ideal regs)
regs)
5 5 CLK
Latency = 1 clock cycle
Latency = 1 clock cycle
D D
Throughput
Throughput == 1/clock
1/clock cycle
cycle
15
Unlike retiming, pipelining
Add more input
adds extra registers to
registers
the system
D D D
D D D D
How to pipeline:
1. Add extra registers at all
retime inputs (or, equivalently, all
outputs)
15 2. Retime
D D
5 5
D D TTCLK == 15
15 (w/
(w/ ideal
ideal regs)
regs)
D D CLK
Latency
Latency == 33 clock
clock cycles
cycles
D Throughput
Throughput = 1/clock cycle
= 1/clock cycle
15
The Power of Transforms: Lookahead
y(n) = x(n) + A y(n-1) x(n) y(n)
x(n) y(n) loop
unrolling D A 2D
A D A
y(n) = x(n) + A[x(n-1) + A y(n-2)]

Try pipelining
this structure distributivity
x(n) y(n)
D 2D
A A A
associativity
x(n) y(n)
x(n) y(n)
retiming
A D D D D 2D
A A2
A2
precomputed

Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..

Uploaded by

Copyright:

Available Formats

Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..

Uploaded by

Copyright:

Available Formats

Arithmetic Circuits

Didn’t I learn how

6.111 Fall 2004 Lectures 9/10, Slide 1

6.111 Fall 2004 Lectures 9/10, Slide 2

Range: – 2N-1 to 2N-1 – 1

8-bit 2’s complement example:

If we use a two’s-complement representation for signed integers, the

By moving the implicit location of “decimal” point, we can represent

 Asymmetric range: -2N-1 to +2N-1-1

4 0100 -4 1100 4 0100 -4 1100

+ 3 0011 + (-3) 1101 - 3 1101 + 3 0011

7 0111 -7 11001 1 10001 -1 1111

Here’s an example of binary addition as one might do it by “hand”:

Carries from previous column

So we can quickly build a circuit two add two 4-bit numbers…

6.111 Fall 2004 Lectures 9/10, Slide 5

Using 2’s complement representation: –B = ~B + 1

But what about

6.111 Fall 2004 Lectures 9/10, Slide 6

Besides the sum, one often wants four other bits To

6.111 Fall 2004 Lectures 9/10, Slide 7

Worse-case path: carry propagation from LSB to MSB, e.g., when

tPD = (N-1)*(tPD,OR + tPD,AND) + tPD,XOR ≈ Θ(N)

COUT = AB + ACIN + BCIN

CN in only 3 (!) gate delays:

P,G P,G P,G P,G Can compute P, G

Key Idea: if (P0 P1 P2 P3) then Co,3 = Ci,0

BP= P0P1P2P3 BP= P4P5P6P7 BP= P8P9P10P11 BP= P12P13P14P15

Assume the following for delay each gate:

6.111 Fall 2004 Lectures 9/10, Slide 11

BP= P0P1P2P3 BP2= P4P5P6P7 BP3= P8P9P10P11 BP4= P12P13P14P15

For the second stage, is the critical path:

Message: Timing Analysis is Very Tricky –

We can choose the maximum fan-in we want for our logic

Hierarchical building block

From Hennessy & Patterson, Appendix A

6.111 Fall 2004 Lectures 9/10, Slide 14

6.111 Fall 2004 Lectures 9/10, Slide 15

6.111 Fall 2004 Lectures 9/10, Slide 16

ABi called a “partial product” A3B0 A2B0 A1B0 A0B0

Multiplying N-bit number by M-bit number gives (N+M)-bit result

Easy part: forming partial products

6.111 Fall 2004 Lectures 9/10, Slide 17

Assume the multiplicand (A) has N bits and the

SN SN-1…S0 Init: P←0, load A and B

Done: (N+M)-bit result in P/B

6.111 Fall 2004 Lectures 9/10, Slide 18

X3 X2 X1 X0 X3Y0 X2Y0 X1Y0 X0Y0

6.111 Fall 2004 Lectures 9/10, Slide 21

Good for pipelining: delay

Last stage is still a carry-propagate adder (CPA)

Rewire so that first two adders work in parallel. Feed results

6.111 Fall 2004 Lectures 9/10, Slide 23

... We have been using full-

further reduce delays for

6.111 Fall 2004 Lectures 9/10, Slide 24

BK+1 BK BK-1 action

A “1” in this bit means the previous stage

Optimization metrics: power

Asymmetric range: -2N-1 to +2N-1-1