Seminar Report DSP
Seminar Report DSP
Seminar Report DSP
Acknowledgements
I would like to express my deep gratitude to Prof. V. M. Gadre for his profound guidance and support which helped me understanding the nuances of the seminar work. I am thankful to him for his timely suggestions which helped me a lot in completion of this report. I would also like to extend my sincere thanks to all the members of the TI-DSP lab for their help and support.
Abstract
Digital signal processing is one of the core technologies in rapidly growing application areas such as wireless communications, audio and video processing and industrial control. The report presents an overview of DSP processors and the trends and recent developments in the field of Digital Signal Processors. The report also discusses the architectural features of the two different kinds of DSPs one floating-point and one fixed-point. The first chapter presents the differences of traditional microprocessor and a DSP. It also presents some common architectural features shared by different DSPs for an efficient processing of DSP algorithms. The second chapter mainly deals with the recent changes in the DSP architectures, the evolution and trends in DSP processor architecture. The new generation of processors with VLIW and superscalar structures and processors with hybrid structures are studied in brief here. The fourth and fifth chapters take a review of the two different kind of digital signal processors namely i.e. floating-point and fixed-point processors. The concluding chapter takes a review of the report and gives a look to further trends.
Contents
1. Introduction and Overview of Digital Signal Processors 1.1 Introduction 1.2 Difference between DSPs and other Microprocessors 1.3 Important features of DSPs 1.3.1 MACs and Multiple execution units 1.3.2 Circular buffering 1.3.4 Dedicated Address Generation Units 1.3.5 Zero overhead looping 1.3.6 Data formats 1.3.7 Specialized Instruction Set 2. Digital Signal Processors: Trends and Developments 2.1 First generation conventional 2.2 Second generation Enhanced conventional 2.3 Third generation Novel design 2.4 Power consideration 2.5 Fourth generation Hybrids 2.6 Benchmarking DSPs 3. Architecture and Peripheral of TMS320C67x 3.1 Introduction 3.2 Architecture of C67xx 3.2.1 Central Processing Unit 3.2.2 General purpose Register files 3.2.3 Functional Units 3.2.4 Memory System 3.3 Peripherals of TMS32067xx 3.3.1 Enhanced DMA 3.3.2 Host Port Interface 3.3.3 External Memory Interface 3.3.4 Multichannel Buffered Serial Port 3.3.5 Timers 3.3.6 Multichannel Audio Serial Port 3.3.7 Power Down Logic 4. Overview of Fixed-point processor TMS320C55x 4.1 Introduction 4.2 Architectural Features 4.3 Low Power Enhancements 4.3.1 Parallelism 4.3.2 Alternate Computational Hardware 1 1 1 2 2 3 4 4 4 4 6 6 7 8 9 9 10 12 12 13 13 14 14 14 16 16 17 17 18 19 20 20
21 21 24 24 24
4.4
4.3.3 Memory Access 4.3.4 Automatic Power Management 4.3.5 Power Down Flexibility Peripherals of TMS320C55x 4.4.1 Clock Generator with PLL 4.4.2 DMA Controller 4.4.3 Host Port Interface
24 25 25 25 26 26 27 28 29
5. Conclusion References
Chapter 1
1.1 Introduction
Digital signal processing is one of the core technologies, in rapidly growing application areas, such as wireless communications, audio and video processing and industrial control. The number and variety of products that include some form of digital signal processing has grown dramatically over the last few years. DSP has become a key component, in many of the consumer, communications, medical and industrial products which implement the signal processing using microprocessors, Field Programmable Gate Arrays (FPGAs), Custom ICs etc. Due to increasing popularity of the above mentioned applications, the variety of the DSP-capable processors has expanded greatly. DSPs are processors or microcomputers whose hardware, software, and instruction sets are optimized for highspeed numeric processing applications, an essential for processing digital data, representing analog signals in real time. The DSP processors have gained increased popularity because of the various advantages like reprogram ability in the field, costeffectiveness, speed, energy efficiency etc.
Series, primarily directed at data manipulation. Similarly DSPs are designed to perform the mathematical calculations needed in Digital Signal Processing, [1]. Data manipulation involves storing and sorting of information. For instance, a word processing program does a basic task of storing, organizing and retrieving of the information. This is achieved by moving data from one location to another and testing for inequalities (A=B, A<B etc.). While mathematics is occasionally used in this type of application, it is infrequent and does not significantly affect the overall execution speed. In comparison to this, the execution speed of most of the DSP algorithms is limited almost completely by the number of multiplications and additions required. In addition to performing mathematical calculations very rapidly, DSPs must also have a predictable execution time, [1]. Most DSPs are used in applications where the processing is continuous, not having a defined start or end. The cost, power consumption, design difficulty etc increase along with the execution speed, which makes an accurate knowledge of the execution time, critical for selecting proper device, as well as algorithms that can be applied. DSPs can also perform the tasks in parallel instead of serial in case of traditional microprocessors.
operation (also called as MACs) in a single instruction cycle. The MAC operation is useful in DSP algorithms that involve computing a vector dot product, such as digital filters, correlation, and Fourier transforms. The MAC operation becomes useful as the DSP applications typically have very high computational requirements in comparison to other types of computing tasks, since they often must execute DSP algorithms (such as FIR filtering) in real time on lengthy segments of signals sampled at 10-100 KHz or higher. To facilitate this DSP processors often include several independent execution units that are capable of operating in parallel.
addressing with post-increment, which is used in situations where a repetitive computation is performed on data stored sequentially in memory. Some processors also support bit-reversed addressing, which increases the speed of certain fast Fourier transform (FFT) algorithms.
restricting which register can be used with which operations and which operations can be combined in an instruction. Some of the latest processors use VLIW (very long instruction word) architectures, where in multiple instructions are issued and executed per cycle. The instructions in such architectures are short and designed to perform much less work compared to those of conventional DSPs thus requiring less memory and increased speed because of the VLIW architecture.
Chapter 2
Even though DSP processors have undergone dramatic changes over past couple of decades, there are certain features central to most DSP processors as discussed in the earlier chapter. In this chapter we look at the trends and developments in the field of Digital Signal Processors and their architectures. The general DSP architectures can be divided into four generations, as per their evolution, which are discussed in following sections.
ALU
ALU
Adder
Two Accumulators
Eight Accumulators
Increases in cost and power consumption due to the additional hardware are largely offset by increased performance, allowing these processors to maintain costperformance and energy consumption similar to those of previous generation. Additionally peripheral device interfaces, counters and timer circuitry, important data acquisition etc are also incorporated in the DSP processor. The TMS320C20 from Texas Instruments and Motorola DSP 56002 are members of this generation of processors. The TMS320C20 combines both a pipelined architecture and Auxiliary Register Arithmetic Unit (ARAU). In addition to it the on chip RAM can be configured either as data or program memory. ARAU can provide address manipulation as well as compute 16-bit unsigned arithmetic thereby reducing some load on central ALU.
Making effective use of processors SIMD capabilities can require significant effort on the part of the programmer. Programmers often must arrange data in memory so that SIMD processing can proceed at full speed, and they may also have to reorganize algorithms to make maximum use of the processors resources. VLIW processors issue a fixed number of instructions either as one large instruction, or in a fixed instruction packet, and the scheduling of these instruction is performed by the compiler. For VLIW to be effective there must be sufficient parallelism in straight line code, to occupy the operation slots. Parallelism can be improved by loop unrolling to remove branch instructions and to use global scheduling techniques. If the loops cannot be sufficiently unrolled, then VLIW causes a disadvantage of low code density.
Superscalar processors on the other hand can issue varying number of instructions per cycle and can be scheduled statically by the compiler or dynamically by the processor itself. Thus superscalar designs hold code density advantages over VLIW, because the processor can determine if the subsequent instructions in a program sequence can be issued during execution, in addition to running unscheduled programs. The VLIW architectures disadvantage of code density and code compatibility were tried to be eliminated by involving the features of CISC and RISC processors in DSP processor. This evolved the hybrid architectures called explicitly parallel instruction computing (EPIC) and variable length execution sets (VLES). The latest DSP family from Texas Instruments, TMS320C64x, combines both VLIW and SIMD in one architecture known as VelociTI. This scheme improves the performance of VLIW by allowing execution packets (EP) to span across 256-bit fetch packet boundaries. Each EP consists of a group of 32-bit instructions and EP can vary in size.
additional DSP instructions offload the processing from the general purpose processor core. The first processor in this generation of processor was SH-DSP from Hitachi. Figure 2.5 shows the simplified SH-DSP family processor architecture. The important point to be noted is the difference in the bus architectures. The general purpose processor core implements the von Neumann architecture where as the DSP core implements Harvard. The integer MAC operations in this generation processors are carried out by the general purpose processor core while the DSP core processes the more complex DSP instructions.
The I, X, Y and Peripheral buses are the four internal buses through which the core communicate with the memories and peripheral devices. The I bus is comprised of a 32-bit address and data bus known as IAB and IDB respectively. This bus is used by both CPU and DSP core to access any memory block, i.e. X, Y or external. The X and Y bus is only accessible to the DSP core for the on-chip X and Y memories and each bus has a 15bit address bus and a 16-bit data bus. This address bus is actually padded with a zero in the LSB position since memory accesses are aligned on word lengths. Lastly the peripheral bus transmits bidirectional data to the I bus via the bus state controller (BSC).
10
11
Chapter 3
3.1 Introduction
In the previous chapter we had a glimpse of the general features of, different generation of processors and their architectures. The TMS320C6x are the first processors to use velociTI architecture, having implemented the VLIW architecture. The TMS320C62x is a 16-bit fixed point processor and the 67x is a floating point processor, with 32-bit integer support. The discussion in this chapter is focused on the TMS320C67x processor. The architecture and peripherals associated with this processor are also discussed. In general the TMS320C6x devices execute up to eight 32-bit instructions per cycle. The 67x devices core consist of C6x CPU which has following features. Program fetch unit Instruction dispatch unit Instruction decode unit Two data paths, each 32-bit wide and with four functional units The functional units consist of two multiplier and six ALUs Thirty-two 32-bit registers Control registers Control logic Test, emulation, and interrupt logic. Parallel execution of eight instructions. 8/16/32-bit data support, providing efficient memory support for a variety of applications. 40-bit arithmetic options add extra precision for computationally intensive applications.
12
All instructions except loads and stores operate on the register. All data transfers between the register files and memory take place only through two data-addressing units (.D1 and .D2). The CPU also has various control registers, control logic and test, emulation and logic. Access to control registers is provided from data path B.
14
Functional Unit
Description 32/40-bit arithmetic and compare operations Left most 1, 0, bit counting for 32 bits Normalization count for 32 and 40 bits 32 bit logical operations 32/64-bit IEEE floating-point arithmetic Floating-point/fixed-point conversions 32-bit arithmetic operations 32/40 bit shifts and 32-bit bit-field operations 32 bit logical operations Branching Constant generation Register transfers to/from the control register file 32/64-bit IEEE floating-point compare operations 32/64-bit IEEE floating-point reciprocal and square root reciprocal approximation 16 x 16 bit multiplies 32 x 32-bit multiplies Single-precision (32-bit) floating-point IEEE multiplies Double-precision (64-bit) floating-point IEEE multiplies 32-bit add, subtract, linear and circular address calculation
Figure 3.3 shows the memory structure in CPU of TMS320C67x. The external memory interface (EMIF) connects the CPU and external memory. This is discussed in section 3.3.
15
Event synchronization: Each channel is initiated by a specific event. Transfers may be either synchronized by element or by frame.
There are two data ordering standards in byte-addressable microcontrollers exist: Little-endian ordering, in which bytes are ordered from right to left, the most significant byte having the highest address. Big-endian ordering, in which bytes are ordered from left to right, the most significant byte having the lowest address.
17
The EMIF reads and writes both big- and little-endian devices. There is no distinction between ROM and asynchronous interface. For all memory types, the address is internally shifted to compensate for memory widths of less than 32 bits.
The Fig 3.4 shows the basic block diagram of McBSP unit. Data communication between McBSP and the devices interfaced takes place via two different pins for transmission and reception data transmit (DX) and data receive (RX) respectively. Control information in the form of clocking and frame synchronization is communicated via CLKX, CLKR, FSX, and FSR. 32-bit wide control registers are used to communicate McBSP with peripheral devices through internal peripheral bus. CPU or DMA write the DATA to be transmitted to the Data transmit register (DXR) which is shifted out to DX via the transmit shift register (XSR). Similarly, receive data on the DR pin is shifted into the receive shift register (RSR) and copied into the receive buffer register (RBR). RBR is then copied to DRR, which can be read by the CPU or the DMA controller. This allows internal data movement and external data communications simultaneously. 18
3.3.5 Timers
The C62x/C67x has two 32-bit general-purpose timers that can be used to: Time events Count events Generate pulses Interrupt the CPU Send synchronization events to the DMA controller
The timer works in one of the two signaling modes depending on whether clocked by an internal or an external source. The timer has an input pin (TINP) and an output pin (TOUT). The TINP pin can be used as a general purpose input, and the TOUT pin can be used as a general-purpose output. When an internal clock is provided, the timer generates timing sequences to trigger peripheral or external devices such as DMA controller or A/D converter respectively. When an external clock is provided, the timer can count external events and interrupt the CPU after a specified number of events.
19
20
Chapter 4
4.1 Introduction
The previous chapter covered a brief discussion about the architecture and peripherals of, an important family of Digital Signal Processors, from Texas Instruments, known as TMS320C67x. The important feature of these processors is its high performance due to the floating point data type support, but the important disadvantage being less power efficient. In this chapter we take a brief overview of one more important class of Digital Signal Processor family from Texas Instruments, TMS320C55x, fixed point processors. These processors are less efficient compared to the earlier one as regards to the performance but are highly power efficient as they support only integer data types. Also these devices are cheaper than the floating point counter part. The C55x family of processors is optimized for power efficiency, low system cost, and best-in-class performance for tight power budgets, [9]. The C55x core delivers twice the cycle efficiency of its predecessor C54x family through a dual-MAC (multiplyaccumulate) architecture with parallel instructions, additional accumulators, ALUs, and data registers. Due to the high power efficiency, the processor family finds immense applications in various wireless handsets, portable audio players, digital cameras, personal medical devices (e.g. Hearing Aids) etc.
21
to three data reads and two data writes in a single cycle. In parallel, the DMA controller can perform up to two data transfers per cycle independent of the CPU activity. The C55x CPU provides two multiply-accumulate (MAC) units, each capable of 17-bit x 17-bit multiplication in a single cycle. A central 40-bit arithmetic/logic unit (ALU) is supported by an additional 16-bit ALU. Use of the ALUs is under instruction set control, providing the ability to optimize parallel activity and power consumption. These resources are managed in the address unit (AU) and data unit (DU) of the C55x CPU. The C55x DSP generation supports a variable byte width instruction set for improved code density. The instruction unit (IU) performs 32-bit program fetches from internal or external memory and queues instructions for the program unit (PU). The program unit decodes the instructions, directs tasks to AU and DU resources, and manages the fully protected pipeline. Predictive branching capability avoids pipeline flushes on execution of conditional instructions. The 5510/5510A also includes a 24Kbyte instruction cache to minimize external memory accesses, improving data throughput and conserving system power. The table 4.1 shows the key architectural features and benefits of the C55x family of processors, and Figure 4.1 shows the simplified architecture of C55x, [9]. Features A 32 x 16-bit Instruction buffer queue Benefits Buffers variable length instructions and implements efficient block repeat operations Execute dual MAC operations in a single cycle Performs high precision arithmetic and logical operations Can shift a 40-bit result up to 31 bits to the left, or 32 bits to the right Performs simpler arithmetic in parallel to main ALU Hold results of computations and reduce the required memory traffic Provide the instructions to be processed as well as the operands for the various computational units in parallel to take advantage of the C55x parallelism. Improve flexibility of low-activity power management
Two 17-bit x17-bit MAC units One 40-bit ALU One 40-bit Barrel Shifter One 16-bit ALU Four 40-bit accumulators Twelve independent buses: Three data read buses Two data write buses Five data address buses One program read bus One program address bus User-configurable IDLE Domains
22
Figure 4.1 Simplified architecture of C55x CPU Following is the brief description about the main blocks. 1) Instruction buffer unit This unit buffers and decodes the instructions that make up the application program. In addition, this unit includes the decode logic that interprets the variable length instructions of the C55x. The instruction buffer unit increases the efficiency of the DSP by maintaining a constant stream of tasks for the various computational units to perform. 2) Program flow unit The program flow unit keeps track of the execution point within the program being executed. This unit includes the hardware used for efficient looping as well as dedicated hardware for speculative branching, conditional execution, and pipeline protection. This hardware is vital to the processing efficiency of the C55x as it helps reduce the number of processor cycles needed for program control changes such as branches and subroutine calls. 3) Address data flow unit This unit provides the address pointers for data accesses during program execution. The efficient addressing modes of the C55x are made possible by the components of the address data flow unit. Dedicated hardware for managing the five data buses keeps data flowing to the various computational units. The address data flow unit further increases the instruction level parallelism of the C55x
23
architecture by providing an additional general-purpose ALU for simple arithmetic operations. 4) Data computation unit This unit is the heart of the DSP, and performs the arithmetic computations on the data being processed. It includes the MACs, the main ALU, and the accumulator registers. Additional features include a barrel shifter, rounding and saturation control, and dedicated hardware for efficiently performing the Viterbi algorithm, which is commonly used in error control coding schemes. The instruction level parallelism provided by this unit is key to the processing efficiency of the C55x.
4.3.1 Parallelism
The C55x family of processors provides higher performance and lower power dissipation by increased parallelism. This is achieved by including two MAC units, two ALUs and multiple read/write buses. These enhancements allow processing of two data streams, or one stream at twice the speed, without the need to read coefficient values twice. Due to this memory access for a given task is minimized, thus improving power efficiency and performance.
24
variable-byte-length which means that each 32-bit word fetch actually retrieves more than one instruction. The variable length instructions improve the code density and conserve power by scaling the instruction to the amount of information needed.
25
4.4.1 Clock Generator with PLL The DSP clock generator supplies the DSP with a clock signal that is based on an input clock signal connected at the CLKIN pin. Included in the clock generator is a digital phase-lock loop (PLL), which can be enabled or disabled. The clock generator can be configured to create a CPU clock signal that has the desired frequency. The clock generator can be operated in one of the two modes. In the bypass mode, the PLL is bypassed, and the frequency of the output clock signal is equal to the frequency of the input clock signal divided by 1, 2, or 4. Because the PLL is disabled, this mode can be used to save power. In the lock mode, the input frequency can be both multiplied and divided to produce the desired output frequency, and the output clock signal is phase-locked to the input clock signal. The lock mode is entered if the PLL ENABLE bit of the clock mode register is set and the phase-locking sequence is complete. During the phase-locking sequence, the clock generator is kept in the bypass mode.
4.4.2 DMA Controller The DMA controller has the following important features: Operation that is independent of the CPU. Four standard ports, one for each data resource: internal dual-access RAM (DARAM), internal single-access RAM (SARAM), external memory, and peripherals. An auxiliary port to enable certain transfers between the host port interface (HPI) and memory. Six channels, which allow the DMA controller to keep track of the context of six independent block transfers among the standard ports. Bits for assigning each channel a low priority or a high priority. Event synchronization. DMA transfers in each channel can be made dependent on the occurrence of selected events. An interrupt for each channel. Each channel can send an interrupt to the CPU on completion of certain operational events. Software-selectable options for updating addresses for the sources and destinations of data transfers. A dedicated idle domain. The DMA controller can be put into a low-power state by turning off this domain. Each multichannel buffered serial port (McBSP) on the C55x DSP has the ability to temporarily take the DMA domain out of this idle state when the McBSP needs the DMA controller.
26
4.4.3 Host Port Interface (HPI) The HPI provides a 16-bit-wide parallel port through which an external host processor (host) can directly access the memory of the DSP. The host and the DSP can exchange information via memory internal or external to the DSP and within the address reach of the HPI. The HPI uses 20-bit addresses, where each address is assigned to a 16bit word in memory. The DMA controller handles all HPI accesses. Through the DMA controller, one of two HPI access configurations can be chosen. In one configuration, the HPI shares internal memory with the DMA channels. In the other configuration, the HPI has exclusive access to the internal memory. The HPI cannot directly access other peripherals registers. If the host requires data from other peripherals, that data must be moved to memory first, either by the CPU or by activity in one of the six DMA channels. Likewise, data from the host must be transferred to memory before being transferred to other peripherals. Figure 4.2 shows the position of HPI in the host DSP system.
27
Chapter 5 Conclusion
There are many applications for which the Digital Signal Processor becomes an ideal choice as they provide the best possible combination of performance, power and cost. Most of the DSP applications can be simplified into multiplications and additions, so the MAC formed a main functional unit in early DSP processors. The designers later incorporated more features, like pipelining, SIMD, VLIW etc, in the processors to deliver improved performance. There has been a drive to develop new benchmarking schemes as the improvement in the processor architecture made the earlier benchmarking schemes, obsolete and less reliable. Power issues are gaining importance as DSP processors are incorporated in to handheld, mobile and portable devices. This leads to development of an important class of DSP processors namely fixed-point processors. Based on the current trends seen in the DSP processor development we may predict that the manufacturers will follow the path of general purpose processors. With new IC manufacturing technologies available we may expect to see more on-chip peripherals and memory; and in fact the system on chip may not be too far away.
28
References
[1] Steven W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, Second Edition, California Technical Publishing, 1999. [2] Berkeley Design Technology, Inc., The Evolution of DSP Processors, World Wide Web, http://www.bdti.com/articles/evolution.pdf, Nov. 2006. [3] Berkeley Design Technology, Inc., Choosing a Processor: Benchmark and Beyond, World Wide Web, http://www.bdti.com/articles/20060301_TIDC_Choosing.pdf, Nov. 2006. [4] University of Rochester, DSP Architectures: Past, Present and Future, World Wide Web, http://www.ece.rochester.edu/research/wcng/papers/CAN_r1. pdf, Nov. 2006. [5] Gene Frantz, Digital Signal Processor Trends, Proceedings of the IEEE Micro, Vol. 20, No. 6, 2000, pp. 52-59. [6] Texas Instruments, TMS320VC5510/5510A, Fixed-Point Digital Signal Processors, Data Manual, Dallas, TX, July 2006. [7] Texas Instruments, TMS320C62X/C67X, Programmers Guide, Dallas, TX, May 1999. [8] Texas Instruments, TMS320C6000, Peripherals, Reference Guide, Dallas, TX, March 2001. [9] Texas Instruments, Inc TMS320C55x, Technical Overview, Dallas, TX, Feb. 2000. [10] Texas Instruments, TMS320C6713B, Floating-Point Digital Signal Processors, Data Sheet, Dallas, TX, June 2006. [11] Texas Instruments, TMS320C55x DSP Peripherals Overview Reference Guide, Dallas, TX, April 2006.
29