Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Arithmetic Circuits

Didn’t I learn how


to do addition in
the second grade? 01011
MIT courses aren’t
what they used to +00101
be... 10000

Acknowledgements:

• R. Katz, “Contemporary Logic Design”, Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5)
• J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital Integrated Circuits: A Design Perspective” Prentice Hall, 2003.
• Kevin Atkinson, Alice Wang, Rex Min

6.111 Fall 2004 Lectures 9/10, Slide 1


Number Systems Basics
How to represent negative numbers?
• Three common schemes: sign-magnitude, ones complement,
twos complement
• Sign-magnitude: MSB = 0 for positive, 1 for negative
– Range: -(2N-1 – 1) to +(2N-1 – 1)
– Two representations for zero: 0000… & 1000…
– Simple multiplication but complicated addition/subtraction
_
• Ones complement: if N is positive then its negative is N
– Example: 0111 = 7, 1000 = -7
– Range: -(2N-1 – 1) to +(2N-1 – 1)
– Two representations for zero: 0000… & 1111…
– Subtraction implemented as addition followed by ones
complement

6.111 Fall 2004 Lectures 9/10, Slide 2


2’s Complement

N bits

-2N-1 2N-2 … … … 23 22 21 20

Range: – 2N-1 to 2N-1 – 1


“sign bit” “decimal” point

8-bit 2’s complement example:


11010110 = –27 + 26 + 24 + 22 + 21 = – 128 + 64 + 16 + 4 + 2 = – 42

If we use a two’s-complement representation for signed integers, the


same binary addition procedure will work for adding both signed and
unsigned numbers.

By moving the implicit location of “decimal” point, we can represent


fractions too:
1101.0110 = –23 + 22 + 20 + 2-2 + 2-3 = – 8 + 4 + 1 + 0.25 + 0.125 = – 2.25
6.111 Fall 2004 Lectures 9/10, Slide 3
Twos Complement Representation
Twos complement = bitwise complement + 1
0111 → 1000 + 1 = 1001 = -7
1001 → 0110 + 1 = 0111 = 7

ƒ Asymmetric range: -2N-1 to +2N-1-1


ƒ Only one representation for zero
ƒ Simple addition and subtraction
ƒ Most common representation

4 0100 -4 1100 4 0100 -4 1100

+ 3 0011 + (-3) 1101 - 3 1101 + 3 0011

7 0111 -7 11001 1 10001 -1 1111

[Katz93, chapter 5]
6.111 Fall 2004 Lectures 9/10, Slide 4
Binary Addition

Here’s an example of binary addition as one might do it by “hand”:

Carries from previous column


1 1 0 1
Adding two N-bit 1101
numbers produces + 0101
an (N+1)-bit result
10010
We’ve already built the circuit that implements one column:

So we can quickly build a circuit two add two 4-bit numbers…

6.111 Fall 2004 Lectures 9/10, Slide 5


Subtraction: A-B = A + (-B)

Using 2’s complement representation: –B = ~B + 1


~ = bit-wise complement

So let’s build an arithmetic unit that does both addition and subtraction.
Operation selected by control input:

But what about


the “+1”?

6.111 Fall 2004 Lectures 9/10, Slide 6


Condition Codes

Besides the sum, one often wants four other bits To


Tocompare
compareAAand
andB,
B,
of information from an arithmetic unit: perform
performA–B
A–Band
anduse
use
Z (zero): result is = 0 big NOR gate condition
conditioncodes:
codes:

Signed
Signedcomparison:
comparison:
N (negative): result is < 0 SN-1
LT
LT N⊕V
N⊕V
LE Z+(N⊕V)
C (carry): indicates that add in the most LE Z+(N⊕V)
EQ ZZ
significant position produced a carry, e.g., EQ
NE ~Z
“1 + (-1)” from last FA NE ~Z
GE
GE ~(N⊕V)
~(N⊕V)
V (overflow): indicates that the answer has GT
GT ~(Z+(N⊕V))
~(Z+(N⊕V))
too many bits to be represented correctly by
the result width, e.g., “(2N-1 - 1)+ (2N-1- 1)” Unsigned
Unsignedcomparison:
comparison:
LTU
LTU CC
V =A B S +A B S
N −1 N −1 N −1 N −1 N −1 N −1 LEU
LEU C+ZC+Z
V = COUT ⊕CIN GEU
GEU ~C ~C
N −1 N −1 GTU
GTU ~(C+Z)
~(C+Z)

6.111 Fall 2004 Lectures 9/10, Slide 7


tPD of Ripple-carry Adder

Worse-case path: carry propagation from LSB to MSB, e.g., when


adding 11…111 to 00…001.

tPD = (N-1)*(tPD,OR + tPD,AND) + tPD,XOR ≈ Θ(N)

CI to CO CIN-1 to SN-1

Θ(N) is read “order N” and tells us that the latency of our adder
grows proportional to the number of bits in the operands.
6.111 Fall 2004 Lectures 9/10, Slide 8
Faster carry logic

Let’s see if we can improve the speed by rewriting the equations for COUT:

COUT = AB + ACIN + BCIN


= AB + (A + B)CIN
= G + P CIN where G = AB and P = A + B

generate propagate
Actually,
Actually,PPisisusually
usually
For adding two N-bit numbers: defined
definedas asPP==A⊕BA⊕B
which
whichwon’t
won’tchange
change
CN = GN-1 + PN-1CN-1 CCOUT but will allow us
OUT but will allow us
= GN-1 + PN-1 GN-2 + PN-1 PN-2CN-2 to
toexpress
expressSSas asaa
simple
simplefunction
functionof of
= GN-1 + PN-1 GN-2 + PN-1 PN-2GN-3 + … + PN-1 ...P0CIN
andCCIN: :SS==PP⊕C
PPand
IN ⊕CIN IN

CN in only 3 (!) gate delays:


1 for P/G generation, 1 for ANDs, 1 for final OR
6.111 Fall 2004 Lectures 9/10, Slide 9
Carry Bypass Adder
A0 B0 A1 B1 A2 B2 A3 B3

P,G P,G P,G P,G Can compute P, G


P0 G0 P1 G1 P2 G2 P3 G3 in parallel for all bits

Ci,0 FA Co,0
FA Co,1
FA Co,2
FA Co,3

BP= P0P1P2P3
P,G P,G P,G P,G
P0 G0 P1 G1 P2 G2 P3 G3
Ci,0
FA Co,0
FA Co,1
FA Co,2
FA 0 Co,3

Key Idea: if (P0 P1 P2 P3) then Co,3 = Ci,0


6.111 Fall 2004 Lectures 9/10, Slide 10
16-bit Carry Bypass Adder

BP= P0P1P2P3 BP= P4P5P6P7 BP= P8P9P10P11 BP= P12P13P14P15


P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G
Ci,0 Co,3 Co,7 Co,11
FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0
Co,0 Co,1 Co,2 Co,4 Co,5 Co,6
1 1 Co,8 Co,9 Co,10 Co,12 Co,13 Co,14
1 1

Co,15

Assume the following for delay each gate:


P, G from A, B: 1 delay unit
P, G, Ci to Co or Sum for a FA: 1 delay unit
2:1 mux delay: 1 delay unit

What is the worst case propagation delay for the 16-bit adder?

6.111 Fall 2004 Lectures 9/10, Slide 11


Critical Path Analysis

BP= P0P1P2P3 BP2= P4P5P6P7 BP3= P8P9P10P11 BP4= P12P13P14P15


P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G P,G
Ci,0 Co,3 Co,7 Co,11
FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0 FA FA FA FA 0
Co,0 Co,1 Co,2 Co,4 Co,5 Co,6
1 1 Co,8 Co,9 Co,10 Co,12 Co,13 Co,14
1 1

Co,15

For the second stage, is the critical path:

BP2 = 0 or BP2 = 1?

Message: Timing Analysis is Very Tricky –


Must Carefully Consider Data Dependencies For
False Paths
6.111 Fall 2004 Lectures 9/10, Slide 12
Carry-lookahead Adders (CLA)

We can choose the maximum fan-in we want for our logic


gates and then build a hierarchical carry chain using these
equations:
CJ+1 = GIJ + PIJCI “generate a carry from bits I thru
K if it is generated in the high-order
GIK = GJ+1,K + PJ+1,K GIJ (J+1,K) part of the block or if it is
generated in the low-order (I,J) part
PIK = PIJ PJ+1,K of the block and then propagated
thru the high part”
where I < J and J+1 < K

P/G generation

1st level of
lookahead

Hierarchical building block


6.111 Fall 2004 Lectures 9/10, Slide 13
8-bit CLA (P/G generation)

Log2(N)

From Hennessy & Patterson, Appendix A

6.111 Fall 2004 Lectures 9/10, Slide 14


8-bit CLA (carry generation)

Log2(N)

6.111 Fall 2004 Lectures 9/10, Slide 15


8-bit CLA (complete)

tPD = Θ(log(N))

6.111 Fall 2004 Lectures 9/10, Slide 16


Unsigned Multiplication

A3 A2 A1 A0
x B3 B2 B1 B0

ABi called a “partial product” A3B0 A2B0 A1B0 A0B0


A3B1 A2B1 A1B1 A0B1
A3B2 A2B2 A1B2 A0B2
+ A3B3 A2B3 A1B3 A0B3

Multiplying N-bit number by M-bit number gives (N+M)-bit result

Easy part: forming partial products


(just an AND gate since BI is either 0 or 1)
Hard part: adding M N-bit partial products

6.111 Fall 2004 Lectures 9/10, Slide 17


Sequential Multiplier

Assume the multiplicand (A) has N bits and the


multiplier (B) has M bits. If we only want to invest
in a single N-bit adder, we can build a sequential
circuit that processes a single partial product at a
time and then cycle the circuit M times:

SN SN-1…S0 Init: P←0, load A and B


LSB

P B NC A Repeat M times {
M bits
N 1 N
P ← P + (BLSB==1 ? A : 0)
+ xN shift P/B right one bit
}
N+1

Done: (N+M)-bit result in P/B

6.111 Fall 2004 Lectures 9/10, Slide 18


Combinational Multiplier

y0
x3 x2 x1 x0

y1
x3 x2 x1 x0
¾ Partial product computation z0
is simple (single and gate)
HA FA FA HA

x3 x2 y2
x1 x0
z1

FA FA FA HA

x3 x2 x1 y3
x0
z2

FA FA FA HA

z7 z6 z5 z4 z3
6.111 Fall 2004 Lectures 9/10, Slide 19
2’s Complement Multiplication
Step 1: two’s complement operands so Step 3: add the ones to the partial
high order bit is –2N-1. Must sign extend products and propagate the carries. All
partial products and subtract the last one the sign extension bits go away!

X3 X2 X1 X0 X3Y0 X2Y0 X1Y0 X0Y0


* Y3 Y2 Y1 Y0 + X3Y1 X2Y1 X1Y1 X0Y1
-------------------- + X2Y2 X1Y2 X0Y2
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y3 X2Y3 X1Y3 X0Y3
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2 - 1 1 1 1
- X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
-----------------------------------------
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

Step 2: don’t want all those extra additions, so Step 4: finish computing the constants…
add a carefully chosen constant, remembering
to subtract it at the end. Convert subtraction
in add of (complement + 1).
X3Y0 X2Y0 X1Y0 X0Y0
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1
+ 1 + X2Y2 X1Y2 X0Y2
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + X3Y3 X2Y3 X1Y3 X0Y3
+ 1 + 1 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ 1
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 –B = ~B + 1 Result: multiplying 2’s complement operands
+ 1
+ 1
takes just about same amount of hardware as
- 1 1 1 1 multiplying unsigned operands!
6.111 Fall 2004 Lectures 9/10, Slide 20
2’s Complement Multiplication
y0
x3 x2 x1 x0

y1
x3 x2 x1 x0
1 z0

FA FA FA HA

x3 x2 y2
x1 x0
z1

FA FA FA HA

x3 x2 x1 y3
x0
1 z2

HA FA FA FA HA

z7 z6 z5 z4 z3

6.111 Fall 2004 Lectures 9/10, Slide 21


Carry-Save Adder (CSA)

Good for pipelining: delay


through each partial product
(except the last) is just
tPD,AND + tPD,FA. No
carry propagation time!

Last stage is still a carry-propagate adder (CPA)


6.111 Fall 2004 Lectures 9/10, Slide 22
Latency Improvements
M-2
Abstract
partial

CSA
CSA

CSA
CSA

CSA

CPA
product ...
picture :

Rewire so that first two adders work in parallel. Feed results


into third and fourth adders which also work in parallel, etc.
M-4 2

CSA
CSA

CSA

CSA

CSA

CSA
...

CPA
Even and odd streams pass through half the adders so
even/odd design runs at almost twice the speed of simple
implementation.

6.111 Fall 2004 Lectures 9/10, Slide 23


More Latency Improvements

O(log1.5M)

CSA

CSA

CSA

CSA

CPA
Wallace
Tree
CSA

CSA

... We have been using full-


adders (3 inputs, 2 outputs) in
our array adders. Higher
fan-in adders can be used to
CSA

further reduce delays for


large M.

6.111 Fall 2004 Lectures 9/10, Slide 24


Higher-radix multiplication
Idea: If we could use, say, 2 bits of the multiplier in generating
each partial product we would halve the number of columns and
halve the latency of the multiplier!
AN-1 AN-2 … A4 A3 A2 A1 A0
x BM-1 BM-2 … B3 B2 B1 B0

M/2 2

...

BK+1,K*A = 0*A → 0
Booth’s insight: rewrite = 1*A → A
2*A and 3*A cases, = 2*A → 4A – 2A
leave 4A for next partial = 3*A → 4A – A
product to do!
6.111 Fall 2004 Lectures 9/10, Slide 25
Booth recoding
current bit pair from previous bit pair

BK+1 BK BK-1 action


0 0 0 add 0
0 0 1 add A
0 1 0 add A
0 1 1 add 2*A
1 0 0 sub 2*A
1 0 1 sub A -2*A+A
1 1 0 sub A
1 1 1 add 0 -A+A

A “1” in this bit means the previous stage


needed to add 4*A. Since this stage is
shifted by 2 bits with respect to the
previous stage, adding 4*A in the previous
stage is like adding A in this stage!
6.111 Fall 2004 Lectures 9/10, Slide 26
Behavioral Transformations
ƒ There are a large number of implementations of the
same functionality
ƒ These implementations present a different point in the
area-time-power design space
ƒ Behavioral transformations allow exploring the design
space a high-level

Optimization metrics: power


1. Area of the design
2. Throughput or sample time TS
3. Latency: clock cycles between
the input and associated
output change area
4. Power consumption
5. Energy of executing a task time
6. …
6.111 Fall 2004 Lectures 9/10, Slide 27
Fixed-Coefficient Multiplication
Conventional Multiplication X3 X2 X1 X0
Z=X·Y Y3 Y2 Y1 Y0
X 3 · Y0 X 2 · Y0 X 1 · Y0 X 0 · Y0
X 3 · Y1 X 2 · Y1 X 1 · Y1 X 0 · Y1
X 3 · Y2 X 2 · Y2 X 1 · Y2 X 0 · Y2
X 3 · Y3 X 2 · Y3 X 1 · Y3 X 0 · Y3
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

Constant multiplication (become hardwired shifts and adds)


X3 X2 X1 X0
Z = X · (1001)2 1 0 0 1
X3 X2 X1 X0
X3 X2 X1 X0
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

X Z
Y = (1001)2 = 23 + 20
<< 3
shifts using wiring
6.111 Fall 2004 Lectures 9/10, Slide 28
Transform: Canonical Signed Digits (CSD)
Canonical signed digit representation is used to increase the number of
zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.

Iterative encoding: replace


string of consecutive 1’s
0 1 1 … 1 1 1 0 0 … 0 -
2N-2 + … + 21 + 20 1
2N-1 - 20

Worst case CSD has 50% non zero bits

01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1
=

10010001 1 0 0 -1 0 0 0 -
1

X << 7 Z
<< 4
Shift translates to re-wiring
6.111 Fall 2004 Lectures 9/10, Slide 29
Algebraic Transformations
Commutativity Distributivity
A C B
A B A B
B A
C

A + B = B + A (A + B) C = AB + BC

Associativity Common sub-expressions


A B B C X Y
X Y X
C A

A B
A B
(A + B) + C = A + (B+C)
6.111 Fall 2004 Lectures 9/10, Slide 30
Transforms for Efficient Resource Utilization

A B C D E FG H I Time multiplexing: mapped


to 3 multipliers and 3
adders
1

distributivity
A C B D E FG H I
Reduce number of
operators to 2 multipliers
1
and 2 adders

6.111 Fall 2004 Lectures 9/10, Slide 31


Retiming: A very useful transform
Retiming is the action of moving delay around in the systems
ƒ Delays have to be moved from ALL inputs to ALL outputs or vice versa

D
D
D
D
D

Cutset retiming: A cutset intersects the edges, such that this would result in
two disjoint partitions of these edges being cut. To retime, delays are moved
from the ingoing to the outgoing edges or vice versa.

Benefits of retiming:
• Modify critical path delay
• Reduce total number of registers
6.111 Fall 2004 Lectures 9/10, Slide 32
Retiming Example: FIR Filter
x(n) D D D Symbol for multiplication

h(0) h(1) h(2) h(3) K


y ( n ) = h ( n ) ⊗ x ( n ) = ∑ x ( n − i ) ⋅ h (i )
Direct y(n) i =0
form
associativity
x(n)
of addition
D D D

(10) h(0) h(1) h(2) h(3) Tclk = 22 ns

y(n)

(4) retime
x(n)

h(0) h(1) h(2) h(3)


Transposed Tclk = 14 ns
form
y(n) D D D

Note: here we use a first cut analysis that assumes the delay of a chain of
operators is the sum of their individual delays. This is not accurate.
6.111 Fall 2004 Lectures 9/10, Slide 33
Pipelining = Adding Registers + Retiming
D
15 TTCLK == 25
25 (w/
(w/ ideal
ideal regs)
regs)
5 5 CLK
Latency = 1 clock cycle
Latency = 1 clock cycle
D D
Throughput
Throughput == 1/clock
1/clock cycle
cycle

15
Unlike retiming, pipelining
Add more input
adds extra registers to
registers
the system
D D D

D D D D

How to pipeline:
1. Add extra registers at all
retime inputs (or, equivalently, all
outputs)
15 2. Retime
D D
5 5

D D TTCLK == 15
15 (w/
(w/ ideal
ideal regs)
regs)
D D CLK
Latency
Latency == 33 clock
clock cycles
cycles
D Throughput
Throughput = 1/clock cycle
= 1/clock cycle
15
6.111 Fall 2004 Lectures 9/10, Slide 34
The Power of Transforms: Lookahead
y(n) = x(n) + A y(n-1) x(n) y(n)
x(n) y(n) loop
unrolling D A 2D
A D A

y(n) = x(n) + A[x(n-1) + A y(n-2)]


Try pipelining
this structure distributivity
x(n) y(n)

D 2D
A A A
associativity

x(n) y(n)
x(n) y(n)
retiming
A D D D D 2D
A A2
A2
precomputed
6.111 Fall 2004 Lectures 9/10, Slide 35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy