Lecture 8
Lecture 8
Lecture 8
Why DSP?
Growing Market
Dedicated ASIC not always best option for implementing signal processing Flexibility
Sample Products
Types of Processing
Continuous / Real - Time Limited storage Hard constraints Offline Entire signal stored in memory Softer constraints
May 2, 2001 C54 Architecture and Programming 3
Inefficient for memory intensive operations One memory space Example: 20 Tap FIR 4 Memory Accesses 1 Parallel MAC At least 80 cycles per output!
May 2, 2001 C54 Architecture and Programming 4
May 2, 2001
Separate program and data memory Enables parallel memory access (improves w/ DARAM) May store coefficients in program memory (ROM)
May 2, 2001 C54 Architecture and Programming 6
8 Auxiliary Registers (ARs) DARAM Compare Select and Store (CSSU) for Viterbi
Number Crunching
40 bit Acc. (A and B) 40 bit Barrel Shifter Temporary Register Dedicated support CSSU (Viterbi) Bit reverse (FFT)
C54 Architecture and Programming 8
May 2, 2001
May 2, 2001
Pipeline Phases P - generate program address F - get opcode D - decode instruction A - generate read address R - read operands X - execute
P F D A R X P F D A R X P F D A R X P F D A R X P F D A R X P F D A R X Full Pipeline
May 2, 2001
10
0000
I/O
1400
External memory
I/O Memory
E000
DROM bit
VECTORS
PAGE 0 (64K)
FFFF
May 2, 2001
External memory
SARAM
DARAM Block a
FFFF
May 2, 2001
147F
C54 Architecture and Programming
03FF
12
Shorthand Notation
Term Smem Xmem Ymem lk dmad pmad src dst PA
May 2, 2001
What it means 16-bit single data memory operand 16-bit dual data memory operand used in dual-operand instructions and some single-operand instructions. Read through D bus. 16-bit dual data-memory operand used in dual-operand instructions. Read through C bus. 16-bit long constant 16-bit immediate data memory address (0 - 65,535) 16-bit immediate program memory address (0 - 65,535) This includes extended program memory devices Source accumulator (A or B) Destination accumulator (A or B) 16-bit port (I/O) immediate address (0 - 65,535)
C54 Architecture and Programming 14
May 2, 2001
15
Bit-Reversed Pre-modify
BK
Absolute
May 2, 2001
16
Indirect Addressing - *
LD STL *AR1+,A A,*AR2+ ;...
Indirect Addressing allows sequential access to arrays 8 address registers (AR0-7) can be used as 16-bit pointers to data ARs can be optionally modified How do we initialize the ARs?
May 2, 2001
17
STM (STore to Memorymapped register) stores an immediate value to the specified MMR or SPRAM address. STM writes value to register in the access phase of the pipeline to avoid latencies (more later)
May 2, 2001
;A or B
Short immediate instructions are 1 word, 1 cycle: All other immediate constants are 16 bits and require 2 words, 2 cycles.
May 2, 2001
19
Direct Addressing - @
Instruction Address opcode 9-bit DP 7-bit offset 7-bit offset
16 bits
Direct Addressing allows random, single-cycle access to 128 locations positively offset from a base address The direct 16-bit address is formed by concatenating the base address (DP) with the 7-bit offset contained in the instruction: How is the Data Page (DP) initialized?
May 2, 2001 C54 Architecture and Programming 20
= 85h
0000 0000 1 - Data Page 1 - Base Addr = 80h DP 0000 0000 1 000 0110 = 86h 0000 0000 1 000 0101 0000 0000 1 000 0111
C54 Architecture and Programming
= 85h = 87h
21
*( )
Guarantees access to any location in the memory map by supplying the entire 16-bit address Uses the indirect hardware to generate the address, hence the asterisk ( ) Always MINIMUM of 2 words, 2 cycles
May 2, 2001
22
Modifiers: BK + AR0 Since the only index offered is circular, regular index is only accessible if BK is set to 0, or made very large, e.g., FFFFh.
May 2, 2001 C54 Architecture and Programming 23
a1
...
y0
y 0 = an * xn n=0
19
20 Tap FIR implementation Our goal is to compute one output (y0) First, lets setup the link.cmd file and memory sections...
May 2, 2001 C54 Architecture and Programming 24
Coding Environment
lab1.obj -o lab1.out -m lab1.map
Link.cmd
Overview
ROM
MEMORY { PAGE 1: /* Data Memory */ SPRAM: org=00060h len=0020h InRAM: org=00400h len=0400h OutRAM: org=00800h len=0400h PAGE 0: /* Program Memory */ ROM: org=0F000h len=0F80h } SECTIONS { code :> init :> input :> output :> coeff :> }
May 2, 2001
code init_a[20] x a y
0 0 1 1 1
.usect input",20 .usect coeff",20 .usect output",1 .sect init init_a .int 1,2,3,4,5 .int 1,2,3,4,5 .int 1,2,3,4,5 .int 1,2,3,4,5 .mmregs FIR.asm .sect "code"
25
Processing Loop
fir:
FIR.asm
math: MAC
*AR2+,*AR3+,A
2. Multiply/Accumulate
MAC *AR2+, *AR3+, A
done:
Initialize Pointers
fir: STM STM STM STM #a,AR2 #a,AR2 #x,AR3 #x,AR3
FIR.asm AR2
math:
MAC
*AR2+,*AR3+,A
STM
done:
Stores #value to the MMR early in the pipeline to avoid latencies 2 words, 2 cycles
May 2, 2001
27
Load Accumulator
fir:
FIR.asm
done:
H
31-16
L
15-0
LD:
Loads dst[15:0] by default May be 1 or 2 cycles
28
May 2, 2001
Store Result
fir:
FIR.asm
done:
Accumulator A G
39-32
H
31-16
L
15-0
May 2, 2001
Streamline Loops
fir:
FIR.asm
math:
Copy Coefficients
FIR.asm
fir: STM RPT MVPD STM STM LD RPT MAC STL #a, AR2 #3 #(20-1) #init_a,*AR2+ #a,AR2 #x,AR3 #0,A #(20-1) *AR2+,*AR3+,A A, *(y)
AR2
math:
31
Program Flow
fir: STM RPT MVPD STM STM LD RPT MAC STL RET - or done: B done #a, AR2 FIR.asm #(20-1) #init_a,*AR2+ #a,AR2 #x,AR3 #0,A #(20-1) *AR2+,*AR3+,A A, *(y) Implementing a subroutine requires: CALL RET fir
2w, 4c 1w, 4c
math:
done:
Conditions: 3 max w/ restrictions, ANDed: Ex: CC fir, AEQ, AOV A/B: EQ,NEQ,LEQ,GEQ,LT,GT,OV,NOV TC,NTC,C,NC,BIO,NBIO
May 2, 2001
32
9 x 9 8 1
May 2, 2001
33
A or B
Guard
High
Low
Use Guard Bits (allow at least 128 signed summations) In a non-gain system temporary overflow is permitted. The output is guaranteed to remain bounded by the input. In a system with gain, the output is not guaranteed to remain bounded (i.e. result is larger than 32-bits).
Fractional Multiplication
. 9 . 9 . 8 1 . 8 value times value yields double size result result to be stored
May 2, 2001
35
FRCT shifts multiply results left by 1 The tools do not support fractions
To store 0.707 use: 32767 = 7FFFh = ~1 a0 .int 32768*707/1000
May 2, 2001
36
*AR2+,*AR3+0%,A
Input Buffer
l l
AR2
a0 a1 a2 ...
l l l
% modifier indicates circular is available for all ARs Why was +0% used? Because we are forced to use +0%, how do we make it look like +%?
STM #1,AR0
May 2, 2001
37
Circular Buffers must be aligned on the next 2n boundary greater than BK. On what boundary should a block size of 20 be aligned?
a .usect coeff, 20
align 32
May 2, 2001
Pipeline Issues
Typical C54x System Code
Analysis: Most 'C54x code requires no special attention Some MMR writes require care (MMR reads are not a problem) Latency requirements resolved via Latency Tables
C Code
No Problem
ASM Code
CALU Operations
No Problem
MMR Writes
Early Writes
Early: write occurs at least 6 cycles prior to a read Example: FIR setup code
May 2, 2001
39
References
[1] TMS320C54x Users Guide, available from the Texas Instruments Literature Response Center. [2] TMS320C54x DSP Design Workshop, Texas Instruments Technical Training. [3] S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, San Diego: California Technical Publishing, 1999. [4] Ingrid Verbauwhede, Dave Garrett, Low-Power DSPs for Wireless Communications, ISLPED 2000.
May 2, 2001
40