DSP Design - Lecture 6: Unfolding
DSP Design - Lecture 6: Unfolding
DSP Design - Lecture 6: Unfolding
Unfolding
Fredrik Edman
fredrik.edman@eit.lth.se
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Repetition
Critical path - the combinational path with maximum total execution time
Loop (=cycle) - a path beginning and ending at same node
Loop bound for loop
Tj loop computation time
(3) (6) (21)
Wj number of delays in loop D
B C D
Iteration Bound - maximum of all loop bounds
2D
t
T = max l
l L w l
It is the lower bound on execution time for DFG (assuming only pipelining,
retiming, unfolding)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Tj
Retiming Loop bound =
Wj
loop computation time
number of delays in
the loop
Retiming does not change
delay in loop t
(2)
D (4)
T = max l
the iteration bound l L w l
A B
(2) (4)
D
A B Critical path = 4
Loop bound = 6/2 = 3
2D
Critical path = 6
...but it changes the
Loop bound = 6/2 = 3 critical path!
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Retiming Formulation
(e) = weight of edge e = # of delays
r(x) = retiming values
(e)
r(U) U V r(V)
r (e) Destination/receive
U V Source/send
Graph G1
4 Cutset
(4)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Slow Down by k
Replace each D by kD
(1) (1) Clock
0 A0 B0
A B 1 A1 B1 Tclk= 2t.u.
2 A2 B2
D Titer= 2t.u.
After 2-slow transformation
Clock
(1) (1)
0 A0B0
A B 1 Tclk= 2t.u.
2 A1B1 Titer=22t.u.
2D
3
4 A2B2 =4t.u.
Input new samples every alternate cycles.
null operations account for odd clock cycles.
Hardware utilized only 50% time
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
2D
2D 2D
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Register Minimization
D y1 y1
y2 D 2D 4D y3
3D
y2
7D y3
Unfolding
Chapter 5
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Unfolding
Unfolding is a structured way to achieve parallel processing
Applications
Reveal hidden concurrencies so that the program can be
scheduled to a smaller iteration period T
Parallel processing
Bit-serial and Digit-serial
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
This can be accomplished by means of a for-loop which calls the function delete(item_number) 100 times.
If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared
to those for the delete(x) loop, unwinding can be used to speed it up as shown below.
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Unfolding, example
y(n)
x(n) 9D a
y (n ) = ay (n 9 ) + x (n )
Unfolding J=2, 2-times parallel
Unfolding, example
Unfolding, example
y (2k ) = ay (2( k 5) + 1) + x (2k )
y (2k + 1) = ay (2( k 4) + 0) + x (2k + 1)
y(2k)
x(2k) a
5D Not trivial even
for a simple
graph!
x(2k+1) 4D a
y(2k+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Definitions
37D 9D
U V U1 V1
(i + w ) (i + 37 ) 9, i = 0,1,2 U2 V2
J = 4 = 10, i = 3
9D
U3 10D V3
Properties of unfolding
2D
gcd=greatest
U
D
V U0 V0 2D T0 common divisor
3-unfolded gcd(12 , 3)
5D 6D
U1 V1 2D T1 =3
T DFG
2D
U2 D V2 2D T2
D
Unfolding preserves the number of delays in a DFG
w/J + (w+1)/J + + (w + J - 1)/J = w
Unfolding preserves precedence constraints
J-unfolding of a loop with wl delays in the original DFG
gcd(wl , J) loops in the unfolded DFG. Each loop contains
wl/gcd(wl , J) delays and J/ gcd(wl , J) copies of each node.
Unfolding a DFG with iteration bound T results in a J-unfolded
DFG with iteration bound JT .
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
But we x(2k+1) 4D a
JT process
2 samples y(2k+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
D D
A B C A0 B0 C0
D D
Can lead to A1 B1 C1
increased
critical path!
A2 B2 C2
Edge with wJ will not
create new critical path!
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Applications of Unfolding:
Sample Period Reduction
Case 1 : A node in the DFG having
computation time greater than T.
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
(4)
b2 S
Q
D (4) D
(1)
b1
Q T
Q
2D
2D (0) (0)
X(n) y(n)
P R U
(1)
Close to
IIR-filter
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
6 (4)
t
S
T = max l
3
(1) (4) D
l L w l
Q T
2D
6 6
(0) (0)
= max , = 3
l L 3 2
P R U
(1) 6
2 <4, max node time
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
(4)
S0
Sample Period
(1) (4)
Reduction: case 1 Q0 T0
2D
(0) D (0)
If the computation time of P0 R0 U0
a node U, tu, is greater (1)
than the iteration bound D But two
T, then tu/T - (4) Samples!
unfolding should be used. 4 S1
tu = 4 and T = 3 (1)
Q1
(4)
T1
3
(0) 6 D (0)
4/3 = 2 - unfolding P1 R1 U1
(1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
t l 4
S D T U D V
T = max =
l L w l 3
D
If a critical loop bound is of the form tl/wl where tl and wl are
mutually co-prime, then wl-unfolding should be used.
Unfolding of 3
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
D
(1) (1) (1) (1) (1) (1) (1) (1)
S D T U D V S1 T2 U2 D V0
T = 4
(1) (1) (1) (1)
S2 D T0 U0 V1
The minimum J that achieves the iteration bound is the minimun value
of J such that JT is an integer and is greater or equal to the longest
node computation time.
If then
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
x(2k+1) x(2k)
x(2k-1)
D
x(2k-2)
D
b0 b1 b2
y(2k)
b0 b1 b2
y(2k+1)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Parallel Processing Techniques
Word-level Parallel Processing
Unfolding a word-serial architecture by J creates a word-parallel
architecture that processes J words per clock cycle
Bit-parallel processing
One word of W bits is processed every clock cycle
Digit-serial processing
N bits are processed per clock cycle and a word is processed in W/N clock
cycles, where N is referred to as the digit size
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
a3 a 2 a 1 a0 Bit-serial b3 b 2 b 1 b0
a 2 a0 b2 b0
Digit-Serial
(Digit-size = 2)
a 3 a1 b3 b1
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Bit-Parallel
amsb bmsb ai+1bi+1 ai bi
Bit-Serial ai +1
si +1
cout i +1
bi +1
ai
si si + 2
bi ai + 2
couti bi + 2 cout i +2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Bit-serial adder
Bit-serial can be seen as a time-multiplexed architecture,
in this example on addition (i.e. 1 iteration) takes 4cc.
a3 a2 a1 a0 s3 s2 s1 s0
b3 b2 b1 b0 Bit-serial
adder
D
4l+0 4l+1,2,3
0
Unfolding of Switches
The following assumptions are made when unfolding an edge UV
containing a switch :
The wordlength W is a multiple of the unfolding factor J, i.e. W = WJ.
All edges into and out of the switch have no delays.
Wl+u
U V
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
Unfolding of Switches
The following assumptions are made when unfolding an edge UV
containing a switch :
The wordlength W is a multiple of the unfolding factor J, i.e. W = WJ.
All edges into and out of the switch have no delays.
U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Edges
9l+1=3(3l + 1/3 ) + (1%3) = 3(3l + 0) + 1
between
9l+5=3(3l + 5/3 ) + (5%3) = 3(3l + 1) + 2 Nodes
Switched at
time instances
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
U2 V2
Write the switching instance as
Wl + u = J( Wl + u/J ) + (u%J)
Edges
9l+1=3(3l + 1/3 ) + (1%3) = 3(3l + 0) + 1
between
9l+5=3(3l + 5/3 ) + (5%3) = 3(3l + 1) + 2 Nodes
Draw an edge from the node Uu%J Vu%J, I.e.
U1 V1 and U2 V2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
U0 V0
12l + 1, 7, 9, 11 Unfolding by 3
U V U1 V1
U2 V2
Wl + u = J( Wl + u/J ) + (u%J)
To unfold the DFG by J=3, the switching instances are as follows
12l + 1 = 3(4l + 0) + 1
12l + 7 = 3(4l + 2) + 1
12l + 9 = 3(4l + 3) + 0
12l + 11 = 3(4l + 3) + 2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
U0 V0
12l + 1, 7, 9, 11 Unfolding by 3 4l + 0,2
U V U1 V1
4l + 3
U2 V2
Wl + u = J( Wl + u/J ) + (u%J)
Switched at time instances
12l + 1 = 3(4l + 0) + 1
12l + 7 = 3(4l + 2) + 1
12l + 9 = 3(4l + 3) + 0
12l + 11 = 3(4l + 3) + 2
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se
DSP Design
End of Lecture
Fredrik Edman, Dept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se