DSP Design - Lecture 6: Unfolding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

DSP Design

DSP Design Lecture 6

Unfolding

Fredrik Edman

fredrik.edman@eit.lth.se

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Repetition
Critical path - the combinational path with maximum total execution time
Loop (=cycle) - a path beginning and ending at same node
Loop bound for loop
Tj loop computation time
(3) (6) (21)
Wj number of delays in loop D
B C D
Iteration Bound - maximum of all loop bounds
2D
t
T = max l
l L w l
It is the lower bound on execution time for DFG (assuming only pipelining,
retiming, unfolding)

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Tj
Retiming Loop bound =
Wj
loop computation time
number of delays in
the loop
Retiming does not change
delay in loop t
(2)
D (4)
T = max l
the iteration bound l L w l

A B
(2) (4)
D
A B Critical path = 4
Loop bound = 6/2 = 3
2D
Critical path = 6
...but it changes the
Loop bound = 6/2 = 3 critical path!
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Retiming Formulation
(e) = weight of edge e = # of delays
r(x) = retiming values
(e)
r(U) U V r(V)

r (e) Destination/receive
U V Source/send

r (e) = (e) + r(V) - r(U)


Valid retiming if all r (e) 0 for all edges!
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Ex. Cutset Retiming


Cutset: A set of edges that if removed, or cut,
results in two disjoint graphs.
Graph G2
(2) 1 Cutset Retiming
Add k delays to edges
D 2D going one way and
D
remove k delays from
ones going the other.
(2) 2 3
(4)

Graph G1
4 Cutset
(4)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Slow Down by k
Replace each D by kD
(1) (1) Clock
0 A0 B0
A B 1 A1 B1 Tclk= 2t.u.
2 A2 B2
D Titer= 2t.u.
After 2-slow transformation
Clock
(1) (1)
0 A0B0
A B 1 Tclk= 2t.u.
2 A1B1 Titer=22t.u.
2D
3
4 A2B2 =4t.u.
Input new samples every alternate cycles.
null operations account for odd clock cycles.
Hardware utilized only 50% time
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: 3-stage Lattice Filter


Slowdown by factor 2
Cutsets

2D
2D 2D

Add delays on Edges in one direction


and remove in the other

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Register Minimization
D y1 y1

y2 D 2D 4D y3
3D
y2
7D y3

Register reduction through


Register Sharing
node delay transfer from
When a node has multiple fan-
out with different number of
multiple input edges to output
delays, the registers can be edges (e.g. r(v) > 0)
shared so that only the branch Should be done only when
with max. # of delays will be clock cycle constraint (if any)
needed. is not violated.

Algorithm for register minimization in 4.4.3


Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Drawbacks with Retiming


The state encoding of the circuit may be destroyed,
making testing and verification more difficult.

Some retimed circuits may require complicated


initialization logic to have the circuit start in a special
initial state.

Retiming changes the circuit's topology which have


consequences in other logical and physical synthesis
steps that make design closure more difficult.
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding
Chapter 5

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding
Unfolding is a structured way to achieve parallel processing

Unfolding creates a program with more than one iteration

J is called the unfolding factor

Applications
Reveal hidden concurrencies so that the program can be
scheduled to a smaller iteration period T
Parallel processing
Bit-serial and Digit-serial

Unfolding in software is called loop unrolling or loop unwinding


assembly programming
compiler theory

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Loop unrolling +


Software Pipelining
CC oper
1 1 1
2 2 1 2
3 3 1 2 3
5 1
GSM Speechcoder
2 3
Org. C-code = 250k cc
6 2 3
7 3
8 1 Mod. C-code = 90k cc

Hand Opt. = 50k cc


Iteration 1 Iteration 3
Higher order
Iteration 2
Iterations

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Loop unrolling


Example: A procedure in a computer program is to delete 100 items from a collection.

This can be accomplished by means of a for-loop which calls the function delete(item_number) 100 times.

If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared
to those for the delete(x) loop, unwinding can be used to speed it up as shown below.

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding Parallel Processing


2-unfolded
(1) (1) t
A B T = max l (1) (1) 0,2,4,.
l L w l
A0 B0
2D
T= 2ut D
A0B0=> A2B2=> A4B4=>..
A1B1=> A3B3=> A5B5=>.. (1) (1) 1,3,5,.
A1 B1
2 nodes & 2 edges & 2 delays
T= (1+1)/2 = 1ut D T= 2ut
4 nodes & 4 edges & 2 delays
T= 2/2 = 1ut
In a J unfolded system each delay is J-slow
if input to a delay element is x(kJ + m)
the output is x((k-1)J + m) = x(kJ + m J ). J samples
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding, example
y(n)
x(n) 9D a

y (n ) = ay (n 9 ) + x (n )
Unfolding J=2, 2-times parallel

y (2k ) = ay (2k 9 ) + x (2k )



y (2k + 1) = ay (2k 8 ) + x (2k + 1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding, example

y (2k ) = ay (2k 9 ) + x (2k )



y (2k + 1) = ay (2k 8 ) + x (2k + 1)
2(k-1)-9=2k-11=2(k-5)+1 x(J(k-1) + m)
2(k-1)-8=2k-10=2(k-5)+0
J

y (2k ) = ay (2( k 5) + 1) + x (2k )



y (2k + 1) = ay (2( k 4) + 0) + x (2k + 1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding, example
y (2k ) = ay (2( k 5) + 1) + x (2k )

y (2k + 1) = ay (2( k 4) + 0) + x (2k + 1)
y(2k)

x(2k) a
5D Not trivial even
for a simple
graph!
x(2k+1) 4D a

y(2k+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Definitions

x is the floor of x, largest integer x

x is the ceiling of x, smallest integer x

a%b remainder after a b


Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

General Algorithm for unfolding


Step 1. For each node U in the original
J=4
DFG, draw J nodes U0 , U1 , U2 ,, UJ-1
U0 9D V0

37D 9D
U V U1 V1

(i + w ) (i + 37 ) 9, i = 0,1,2 U2 V2
J = 4 = 10, i = 3
9D
U3 10D V3

Step 2. For each edge U V with w delays in the original DFG,


draw the J edges Ui V(i + w)%J with
(i+w)/J delays for i = 0, 1, , J-1
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Properties of unfolding
2D
gcd=greatest
U
D
V U0 V0 2D T0 common divisor
3-unfolded gcd(12 , 3)
5D 6D
U1 V1 2D T1 =3
T DFG
2D
U2 D V2 2D T2
D
Unfolding preserves the number of delays in a DFG
w/J + (w+1)/J + + (w + J - 1)/J = w
Unfolding preserves precedence constraints
J-unfolding of a loop with wl delays in the original DFG
gcd(wl , J) loops in the unfolded DFG. Each loop contains
wl/gcd(wl , J) delays and J/ gcd(wl , J) copies of each node.
Unfolding a DFG with iteration bound T results in a J-unfolded
DFG with iteration bound JT .
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Relation Unfolding and Iteration Bound


gcd(9 , 2) = 1 1 loop
TA=3, TM=6
T = 18 / 9 = 2
y(n)
y(2k)
x(n) 9D a
x(2k) a
5D
T = 9 / 9 = 1

But we x(2k+1) 4D a
JT process
2 samples y(2k+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Relation Unfolding and the Critical Path


If edge with w<J (J-w) paths with zero
delay and w paths with 1 delay

D D
A B C A0 B0 C0
D D

Can lead to A1 B1 C1
increased
critical path!
A2 B2 C2
Edge with wJ will not
create new critical path!
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Applications of Unfolding:
Sample Period Reduction
Case 1 : A node in the DFG having
computation time greater than T.

Case 2 : Iteration bound is not an integer.

Case 3 : Longest node computation is


larger than the iteration bound T, and T
is not an integer

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 1

(4)
b2 S
Q
D (4) D
(1)
b1
Q T
Q
2D
2D (0) (0)
X(n) y(n)
P R U
(1)
Close to
IIR-filter

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 1


The original DFG cannot have sample period equal to the iteration
bound because a node computation time is more than iteration bound

6 (4)
t
S
T = max l
3
(1) (4) D
l L w l
Q T
2D
6 6
(0) (0)
= max , = 3
l L 3 2
P R U
(1) 6
2 <4, max node time
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
(4)
S0
Sample Period
(1) (4)
Reduction: case 1 Q0 T0
2D
(0) D (0)
If the computation time of P0 R0 U0
a node U, tu, is greater (1)
than the iteration bound D But two
T, then tu/T - (4) Samples!
unfolding should be used. 4 S1

tu = 4 and T = 3 (1)
Q1
(4)
T1
3
(0) 6 D (0)
4/3 = 2 - unfolding P1 R1 U1
(1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 2


The original DFG cannot have sample period equal to the iteration
bound because the iteration bound is not an integer

(1) (1) (1) (1)

t l 4
S D T U D V
T = max =
l L w l 3
D
If a critical loop bound is of the form tl/wl where tl and wl are
mutually co-prime, then wl-unfolding should be used.

Unfolding of 3
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 2 (2)


(1) (1) (1) (1)
S0 T1 U1 V2

D
(1) (1) (1) (1) (1) (1) (1) (1)
S D T U D V S1 T2 U2 D V0

T = 4
(1) (1) (1) (1)
S2 D T0 U0 V1

and 3 samples gives


minimum sample period 4/3
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 3


The original DFG cannot have sample period
equal to the iteration bound because the
longest node computation is larger than the
iteration bound T, and T is not an integer

The minimum J that achieves the iteration


bound is the minimun value of J such that JT
is an integer and is greater or equal to the
longest node computation time
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Sample Period Reduction: case 3


Basically case 3 = case I + case II

The minimum J that achieves the iteration bound is the minimun value
of J such that JT is an integer and is greater or equal to the longest
node computation time.

If then

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Parallel Processing and Unfolding


Parallel processing can be performed by unfolding (chapter 3)

x(2k+1) x(2k)
x(2k-1)
D
x(2k-2)
D
b0 b1 b2

y(2k)

b0 b1 b2

y(2k+1)

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Parallel Processing Techniques
Word-level Parallel Processing
Unfolding a word-serial architecture by J creates a word-parallel
architecture that processes J words per clock cycle

Bit-level Parallel Processing


Bit-serial processing
One bit is processed per clock cycle and a complete word is processed in W
clock cycles, where W is the word-length.

Bit-parallel processing
One word of W bits is processed every clock cycle

Digit-serial processing
N bits are processed per clock cycle and a word is processed in W/N clock
cycles, where N is referred to as the digit size

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Bit-Level Parallel Processing


a0 b0
a1 b1
a2 Bit-parallel
b2
a3 b3

a3 a 2 a 1 a0 Bit-serial b3 b 2 b 1 b0

a 2 a0 b2 b0
Digit-Serial
(Digit-size = 2)
a 3 a1 b3 b1

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Bit-Parallel
amsb bmsb ai+1bi+1 ai bi

cin msb cin i

coutmsb couti+1 ai Digit-Serial si


bi
couti

Bit-Serial ai +1
si +1
cout i +1
bi +1
ai
si si + 2
bi ai + 2
couti bi + 2 cout i +2


Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Bit-serial adder
Bit-serial can be seen as a time-multiplexed architecture,
in this example on addition (i.e. 1 iteration) takes 4cc.

a3 a2 a1 a0 s3 s2 s1 s0
b3 b2 b1 b0 Bit-serial
adder
D
4l+0 4l+1,2,3
0

Switch for carry signal (Wl+u)


How to unfold switches?
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding of Switches
The following assumptions are made when unfolding an edge UV
containing a switch :
The wordlength W is a multiple of the unfolding factor J, i.e. W = WJ.
All edges into and out of the switch have no delays.

Wl+u
U V
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Unfolding of Switches
The following assumptions are made when unfolding an edge UV
containing a switch :
The wordlength W is a multiple of the unfolding factor J, i.e. W = WJ.
All edges into and out of the switch have no delays.

If so, an edge UV can be unfolded as:


Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Draw an edge from the node Uu%J Vu%J,

which is switched at time instance ( Wl + u/J) .


Wl+u
U V
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Unfolding of Switches, J=3


U0 V0
9l+1,5
U V U1 V1

U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Unfolding of Switches, J=3


U0 V0
9l+1,5
U V U1 V1

U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Edges
9l+1=3(3l + 1/3 ) + (1%3) = 3(3l + 0) + 1
between
9l+5=3(3l + 5/3 ) + (5%3) = 3(3l + 1) + 2 Nodes
Switched at
time instances

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Unfolding of Switches, J=3


U0 V0
9l+1,5
U V U1 V1

U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Edges
9l+1=3(3l + 1/3 ) + (1%3) = 3(3l + 0) + 1
between
9l+5=3(3l + 5/3 ) + (5%3) = 3(3l + 1) + 2 Nodes
Draw an edge from the node Uu%J Vu%J, I.e.
U1 V1 and U2 V2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Example: Unfolding of Switches, J=3


U0 V0
9l+1,5 (3l+0)
U V U1 V1
(3l+1)
U2 V2
9l+1=3(3l + 1/3 ) + (1%3) = 3(3l + 0) + 1
9l+1=3(3l + 5/3 ) + (5%3) = 3(3l + 1) + 2
Switched at
time instances

switched at time instance ( Wl + u/J), I.e.


U1 V1 at (3l+0) and U2 V2 at (3l+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Switch with multiple instances


Example :

U0 V0
12l + 1, 7, 9, 11 Unfolding by 3
U V U1 V1

U2 V2
Wl + u = J( Wl + u/J ) + (u%J)
To unfold the DFG by J=3, the switching instances are as follows
12l + 1 = 3(4l + 0) + 1
12l + 7 = 3(4l + 2) + 1
12l + 9 = 3(4l + 3) + 0
12l + 11 = 3(4l + 3) + 2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

Switch with multiple instances


Example : 4l + 3

U0 V0
12l + 1, 7, 9, 11 Unfolding by 3 4l + 0,2
U V U1 V1
4l + 3

U2 V2
Wl + u = J( Wl + u/J ) + (u%J)
Switched at time instances
12l + 1 = 3(4l + 0) + 1
12l + 7 = 3(4l + 2) + 1
12l + 9 = 3(4l + 3) + 0
12l + 11 = 3(4l + 3) + 2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design

End of Lecture

Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy