Tics On Embedded Systems
Tics On Embedded Systems
INTRODUCTION Bioinformatics applications represent the increasingly important workloads. Their characteristics and implications on the underlying hardware design, however, are largely unknown. Currently, biological data processing ubiquitously relies on the high-end systems equipped with expensive, general-purpose processors. The future generation of bioinformatics requires the more flexible and cost-effective computing platforms to meet its rapidly growing market. The programmable, application-specific embedded systems appear to be an attractive solution in terms of easy of programming, design cost, power, portability and time-to-market. The first step towards such systems is to characterize bioinformatics applications on the target architecture. Such studies can help in understanding the design issues and the trade-offs in specializing hardware and software systems to meet the needs of bioinformatics market. This paper evaluates several representative
bioinformatics tools on the VLIW based embedded systems. We investigate the basic characteristics of the benchmarks, impact of function units, the efficiency of VLIW execution, cache behavior and the impact of compiler optimizations. The architectural implications observed from this study can be applied to the design optimizations. To the best of our knowledge, this is one of the first such studies that have ever been attempted.
EXPERIMENTAL METHODOLOGY Simulation Framework Our experimental framework is based on the Trimaran system designed for research in instruction-level parallelism [10]. Trimaran uses the IMPACT compiler [11] as its front-end. The IMPACT compiler performs C parsing, code profiling, block formation
and traditional optimizations [12]. It also exploits support for speculation and predicated execution using superblock [13] and hyperblock [14] optimizations. The Trimaran backend ELCOR performs instruction selection, register allocation and machine dependent code optimizations for the specified machine architecture. The Trimaran simulator generator generates the simulator targeted for a parameterized VLIW microprocessor architecture. Machine Configuration The simulated machine architecture comprises a VLIW microprocessor core and a twolevel memory hierarchy. The VLIW processor exploits instruction level parallelism with the help of compiler to achieve higher instruction throughput with minimal hardware. The core of the CPU consists of 64 general purpose registers, 64 floating point registers, 64 predicate registers, 64 control registers and 16 branch registers. There is no support for register renaming like in a superscalar architecture. Predicate registers are special 1-bit registers that specify a true or false value. Comparison operations use predicate registers as their target register. The core can execute up to eight operations every cycle, one each for the eight functional units it has. There are 4 integer units, 2 floating point units, 1 memory unit and 1 branch unit. The memory unit performs load/store operations. The branch unit performs branch, call and comparison operations. The level-one (L1) memory is organized as separate instruction and data caches. The processors level-two (L2) cache is unified.
Memory Hierarchy
8KB, direct map, 32 byte/line, cache hit 1 cycle 8KB, 2-way, 32 byte/line, cache hit 1 cycle 64KB, 4-way, 64 byte/line, L2 hit 5 cycles, 35 cycles external memory latency
Machine configuration VLIW Core Issue Width General Purpose Registers Floating-Point Registers Predicate Registers
8 64, 32-bit 64, 64-bit 64, 1-bit (used to store the Boolean values of instructions using predication) 64, 32-bit (containing the internal state of the processor) 16, 64-bit (containing target address and static predictions of branches) 4, most integer arithmetic operations: 1 cycle, integer multiply 3 cycles, integer divide 8 cycles 2, floating point multiply 3 cycles, floating point divide 8 cycles 1 1, 1 cycle latency
RESULTS
Impact of Function Units On the VLIW processors, the number and type of function units affects the available resources for the compiler to schedule the operations. The presence of several instance of certain function unit allows the compiler to schedule several operations using that unit at the same time.The impact of the integer and memory units on the benchmark performance in this subsection is investigated. Cache Performance Direct map instruction caches yield high miss rates on nearly all of the studied benchmarks. The conflict misses due to the lack of associativity dominate the cache misses. The instruction cache miss rates drop significantly with the increased associativity: the 4-way set associative, 8KB L1 I-cache shows a miss rate of less than 1%. This indicates that bioinformatics applications usually have small code footprints.
A small, highly associative instruction cache can attain good performance on the bioinformatics applications. Compiler Optimizations The IMPACT compiler provides a set of classic optimizations such as constant propagation, copy propagation, constant folding, and strength reduction. These optimizations do not necessitate any additional microarchitectural support. On the IMPACT compiler, level 0 option does not contain any optimization. Level 1 option contains local optimizations. Level 2 option contains local and global optimizations. Level 3 option contains local, global and jump optimizations. Level 4 option contains local, global, jump and loop optimizations.
The hyperblock optimization results in speedups ranging from 1.1X to 2.0X. On the average, superblock and hyperblock optimizations improve performance by a factor of 1.3X and 1.5X. These speedups present an opportunity for improving the efficiency of VLIW execution on the bioinformatics applications.
Fxx CONCLUSION Classic compiler optimizations provide a factor of 1.0X to 1.15X performance improvement. More aggressive compiler optimizations such as superblock and hyperblock optimizations provide additional 1.1X to 2.0X performance enhancement, suggesting that they are important for the VLIW machine to sustain the desirable performance on the bioinformatics applications.