HLS Introduction Gajski Design and Test
HLS Introduction Gajski Design and Test
High-Level Synthesis
An Introduction to
High-Level Synthesis
Philippe Coussy Michael Meredith
Université de Bretagne-Sud, Lab-STICC Forte Design Systems
8
0740-7475/09/$26.00 c 2009 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 9
July/August 2009 9
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 10
High-Level Synthesis
This first step traditionally includes several code opti- control dependencies, data dependencies between
mizations such as dead-code elimination, false data basic blocks can be added to the CDFG model as
dependency elimination, and constant folding and shown in the hierarchical task graph representation
loop transformations. used in the SPARK tool.14,16
The formal model produced by the compilation
classically exhibits the data and control dependen- Allocation
cies between the operations. Data dependencies Allocation defines the type and the number of
can be easily represented with a data flow graph hardware resources (for instance, functional units,
(DFG) in which every node represents an operation storage, or connectivity components) needed to sat-
and the arcs between the nodes represent the input, isfy the design constraints. Depending on the HLS
output, and temporary variables.12 A pure DFG mod- tool, some components may be added during sched-
els data dependencies only. In some cases, it is pos- uling and binding tasks. For example, the connectiv-
sible to get this model by removing the control ity components (such as buses or point-to-point
dependencies of the initial specification from the connections among components) can be added be-
model at compile time. To do so, loops are com- fore or after binding and scheduling tasks. The com-
pletely unrolled by converting to noniterative code ponents are selected from the RTL component
blocks, and conditional assignments are resolved by library. It’s important to select at least one component
creating multiplexed values. The resulting DFG explic- for each operation in the specification model. The li-
itly exhibits all the intrinsic parallelism of the specifi- brary must also include component characteristics
cation. However, the required transformations can (such as area, delay, and power) and its metrics to
lead to a large formal representation that requires be used by other synthesis tasks.
considerable memory to be stored during synthesis.
Moreover, this representation does not support
loops with unbounded iteration count and nonstatic Scheduling
control statements such as goto. This limits the use All operations required in the specification model
of pure DFG representations to a few applications. must be scheduled into cycles. In other words, for
The DFG model has been extended by adding control each operation such as a ¼ b op c, variables b and
dependencies: the control and data flow graph c must be read from their sources (either storage
(CDFG).12-15 A CDFG is a directed graph in which components or functional-unit components) and
the edges represent the control flow. The nodes in a brought to the input of a functional unit that can ex-
CDFG are commonly referred to as basic blocks and ecute operation op, and the result a must be brought
are defined as a straight-line sequence of statements to its destinations (storage or functional units).
that contain no branches or internal entrance or Depending on the functional component to which
exit points. Edges can be conditional to represent the operation is mapped, the operation can be sched-
if and switch constructs. A CDFG exhibits data
uled within one clock cycle or scheduled over several
dependencies inside basic blocks and captures the cycles. Operations can be chained (the output of an
control flow between those basic blocks. operation directly feeds an input of another opera-
CDFGs are more expressive than DFGs because tion). Operations can be scheduled to execute in par-
they can represent loops with unbounded iterations. allel provided there are no data dependencies
However, the parallelism is explicit only within between them and there are sufficient resources
basic blocks, and additional analysis or transforma- available at the same time.
tions are required to expose parallelism that might
exist between basic blocks. Such transformations in- Binding
clude for example loop unrolling, loop pipelining, Each variable that carries values across cycles
loop merging, and loop tiling. These techniques, by must be bound to a storage unit. In addition, several
revealing the parallelism between loops and between variables with nonoverlapping or mutually exclusive
loop iterations, are used to optimize the latency or lifetimes can be bound to the same storage units.
the throughput and the size and number of memory Every operation in the specification model must
accesses. These transformations can be realized auto- be bound to one of the functional units capable
matically,14 or they can be user-driven.10 In addition to of executing the operation. If there are several
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 11
Control
inputs
Status signals
Bus 3
Controller Data path
Control
outputs
units with such capability, the binding algorithm of functional units (such as ALUs, multipliers, shifters,
must optimize this selection. Storage and functional- and other custom functions), and interconnect ele-
unit binding also depend on connectivity binding, ments (such as tristate drivers, multiplexers, and
which requires that each transfer from component buses). All these register-transfer components can be
to component be bound to a connection unit allocated in different quantities and types and con-
such as a bus or a multiplexer (see, for example, nected arbitrarily through buses. Each component
http://www-labsticc.univ-ubs.fr/www-gaut/). Ideally, can take one or more clock cycles to execute, can
high-level synthesis estimates the connectivity delay be pipelined, and can have input or output registers.
and area as early as possible so that later HLS steps In addition, the entire data path and controller can
can better optimize the design. An alternative be pipelined in several stages.
approach is to specify the complete architecture dur- Primary input and output ports of the design inter-
ing allocation so that initial floorplanning results can face with the external world to transfer both data and
be used during binding and scheduling (see http:// control (used for interface protocol handshaking and
www.cecs.uci.edu/~nisc). synchronization). Data inputs and outputs are con-
nected to the data path, and control inputs and out-
Generation puts are connected to the controller. There are also
Once decisions have been made in the preceding control signals from the controller to the data path
tasks of allocation, scheduling, and binding, the goal and status signals from the data path to the controller.
of the RTL architecture generation step is to apply all However, some architectures may not have all the
the design decisions made and generate an RTL connectivity just described, and in general some of
model of the synthesized design. the controller functions may be implemented as
part of the data pathfor example, a counter plus
Architecture. The RTL architecture is implemented other logic in the data path that generates control
by a set of register-transfer components. It usually signals.
includes a controller and a data path (see Figure 2). The controller is a finite state machine that
A data path consists of a set of storage elements orchestrates the flow of data in the data path by set-
(such as registers, register files, and memories), a set ting the values of control signals (also called control
July/August 2009 11
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 12
High-Level Synthesis
word) such as the select inputs of functional units, When the RTL description includes only partial
registers, and multiplexers. The inputs to the control- binding of resources, the logic synthesis step that
ler may come from primary inputs (control inputs) follows HLS must perform the binding task and the
or from the data path components such as compara- associated optimization. Leaving components un-
tors and so on (status signals). The controller con- bound in the generated RTL provides the RTL and
sists of a state register (SR), next-state logic, and physical synthesis the flexibility to optimize the bind-
output logic. The SR stores the present state of the ings on the basis of updated timing estimates that
processor, which is equal to the present state of take into account wire loads due to physical (floor-
the finite-state machine (FSM) model describing planning and place-and-route) considerations.
the controller’s operation. The next-state logic com-
putes the next state to be loaded into the SR, whereas Several design flows
the output logic generates the control signals and the Allocation, scheduling, and binding can be
control outputs. performed simultaneously or in specific sequence
The controller of a simple dedicated coprocessor depending on the strategy and algorithms used. How-
is classically implemented with hardwired logic ever, they are all interrelated. If they are performed to-
gates. On the other hand, a controller can be pro- gether, the synthesis process becomes too complex to
grammable with a read-write or read-only program be applied to realistic examples. The order in which
memory for a specific custom processor. In this they are realized depends on the design constraints
case, the program memory can store instructions or and the tool’s objectives. For example, allocation
just control words, which are longer but require no will be performed first when scheduling tries to min-
decoding. In such a circumstance, SR is called a pro- imize the latency or to maximize the throughput
gram counter, the next-state logic is an address gener- under a resource constraint. Allocation will be deter-
ator, and the output logic is RAM or ROM. mined during scheduling when scheduling tries to
minimize the area under timing constraints.17
Resource-constrained approaches are used when a
Output model. According to the decisions made in
designer wants to define the data path architec-
the binding tasks, the description of the architecture
ture,18,19 or wants to accelerate an application by
can be written on RTL with different levels of detail
using an FPGA device with a limited amount of avail-
(that is, without binding or with partial or complete
able resources.20 Time-constrained approaches are
binding). For example, a ¼ b þ c executing in state
used when the objective is to reduce a circuit’s area
(n) can be written as Figure 3 indicates:
while meeting an application’s throughput require-
ments, as in multimedia or telecommunication
Without any binding: applications.21
state (n): a = b + c; In practice, the resource-constrained problem can
go to state (n + 1);
be solved by using a time-constrained approach or
With storage binding: tool (and vice versa). In this case, the designer
state (n): RF(1) = RF(3) + RF(4); relaxes the timing constraints until the provided cir-
go to state (n + 1);
cuit area is acceptable. Latency, throughput, resource
With functional-unit binding: count, and area are now well-known constraints and
state (n): a = ALU1 (+, b, c);
go to state (n + 1); objectives. However, recent work has considered fea-
tures such as clock period, memory bandwidth, mem-
With storage and functional-unit binding:
state (n): RF(1)= ALU1 (+, RF(3), RF(4)); ory mapping, power consumption, and so forth that
go to state (n + 1); make the synthesis problem even more difficult to
solve.10,22-24
With storage, functional-unit, and connectivity binding:
state (n): Bus1 = RF(3); Bus2 = RF(4); Another example of how synthesis tasks can be or-
Bus3 = ALU1 (+, Bus1, Bus2); dered concerns the allocation and the binding steps.
RF(1) = Bus3;
go to state (n + 1); The types and numbers of resources determined in
the allocation task are taken as input for the binding
Figure 3. RTL description written with different task. In practical HLS tools, however, resources are
binding details. often allocated only partially. Additional resources
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 13
are allocated during the binding step according to approach of both CoWare’s Processor Designer and
the design constraints and objectives. These addi- Tensilica’s Xtensa.
tional resources can be of any type: functional Other languages, not based on C or C++ but which
units, multiplexers, or registers. Hence, functional are tailored to specific domains, have also been pro-
units can be first allocated to schedule and bind posed. Esterel is a synchronous language for the de-
the operations. Both registers and multiplexers can velopment of reactive systems. Esterel Studio can
then be allocated (created) during the variable-to- generate either software or hardware. Bluespec’s
register binding step. BSV is language targeted for specifying concurrency
The synthesis tasks can be performed manually or with rule-based atomic transactions.
automatically. Obviously, many strategies are possible, The input specification to HLS tools must be writ-
as exemplified by available EDA tools: these tools ten with some hardware implementation in mind to
might perform each of the aforementioned tasks get the best results. For example, video line buffering
only partially in automatic fashion and leave the must be coded as part of the algorithm to generate
rest to the designer.10 high-throughput designs.25 Ideally, such code restruc-
turing still preserves much of the abstraction level. It
Industrial tools is beyond the scope of this article to provide an over-
Here, we take a brief look first at commercially view of the specific writing styles required in the spec-
available HLS tools for specifying the input descrip- ification for various tools. As tools’ capabilities evolve
tion, and then we more carefully examine two state- further, fewer modifications will be required to con-
of-the-art industrial HLS tools. vert an algorithm written for software into an algo-
rithm that is suitable as input to HLS. An analogous
Input languages and tools evolution has occurred with RTL synthesisfor exam-
The input specification must capture the intended ple, with the optimization of arithmetic expressions.
design functionality at a high abstraction level. Rather
than coding low-level implementation details, the de- Catapult C synthesis
signer uses the automation provided by the HLS tool Catapult takes an algorithm written in ANSI C++
to guide the design decisions, which heavily depend and a set of user directives as input and generates
on performance goals and the target technology. For an RTL that is optimized for the specified target tech-
instance, if there is parallelism in an HLS specifica- nology. The input can be compiled by any standard
tion, it is extracted using dataflow analysis in accor- compiler compliant with C++; pragmas and directives
dance with the target technology’s capabilities and do not change the functional behavior of the input
performance goals (on a slow technology, more paral- specification.
lelism is required to achieve the performance goal).
The latest generation of HLS tools, in most cases, Synthesis input. The input specification is sequen-
uses either ANSI C, C++, or languages such as SystemC tial and does not include any notion of time or explicit
that are based on C or C++ that add hardware-specific parallelism: it does not hardcode the interface or the
constructs such as timing, hardware hierarchy, inter- design’s architecture. Keeping the input abstract is es-
face ports, signals, explicit specification of parallelism, sential because any hard-coding of interface and ar-
and others. Some HLS tools that support C or C++ chitectural details significantly limits the range of
or derivatives are Mentor’s Catapult C (C, C++), designs that HLS could generate. Required directives
Forte’s Cynthesizer (SystemC), NEC’s CyberWorkbench specify the target technology (component library)
(C with hardware extensions), Synfora’s PICO (C), and and the clock period. Optional directives control inter-
Cadence’s C-to-Silicon (SystemC). Other languages face synthesis, array-to-memory mappings, amount of
used for high-level modeling are MathWork’s Matlab parallelism to uncover by loop unrolling, loop pipelin-
and Simulink. Tools that support Matlab or Simulink ing, hardware hierarchy and block communication,
are Xilinx’ AccelDSP (Matlab) and Synopsys’ Simplify- scheduling (latency or cycle) constraints, allocation
DSP (Simulink); both use an IP approach to gener- directives to constrain the number or type of hardware
ate the hardware implementation. An alternative resources, and so on.
approach for generating an implementation is to use Native C++ integer types as well as C++ bit-accurate
a configurable processor approach, which is the integer and fixed-point data types are supported
July/August 2009 13
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 14
High-Level Synthesis
for synthesis. The generated RTL faithfully reflects the communication channels, and the required hand-
bit-accurate behavior specified in the source. Publicly shaking interfaces are generated to guarantee the cor-
available integer and fixed-point data types provided rect execution of the specified behavior. The blocks
by Mentor Graphics’ (http://www.mentor.com/esl) can be synthesized to be driven by different clocks.
Algorithmic C data types library (ANSI C++ header The clock-domain-crossing logic is generated by Cat-
files) as well as the synthesizable subset of the apult. Communication is optimized according to
SystemC integer and fixed-point data types are sup- user directives to enable maximal block-level concur-
ported for synthesis. The support of C++ language rency of execution of the blocks using FIFO buffers
constructs meets and exceeds the requirements (for streamed data) and ping-pong memories to en-
stated in the most current draft of the Synthesis Subset able block-level pipelining and thus improve the
OSCI standard (http://www.systemc.org). throughput of the overall design.
All the HLS steps consider accurate component
Generating hardware from ANSI C or C++. One area and timing numbers for the target ASIC or
of the advantages of keeping the source untimed is FPGA technology for the designer’s RTL synthesis
that a very wide range of interfaces and architectures tool of choice. Accurate timing and area numbers
can be generated without changing the input source for components are essential to generate RTL that
specification. Another advantage of an untimed meets timing and is optimized for area. During syn-
source is that it avoids errors resulting from manual thesis, Catapult queries the component library so
coding of architectural details. The interface and that it can allocate various combinational or pipelin-
the architecture of the generated hardware are all ing components with different performance and area
under the control of the designer via synthesis direc- trade-offs. The queried component library is prechar-
tives. Catapult’s GUI provides an interactive environ- acterized for the target technology and the target RTL
ment with both the control and the analysis tools to synthesis tool. Component libraries can also be built
enable efficient exploration of the design space. by the designer to incorporate specific characteriza-
Interface synthesis makes it possible to map the tion for memories, buses, I/O interfaces, or other
transfer of data that is implied by passing of C++ func- units of functionality such as pipelined components.
tion arguments to various hardware interfaces such as
wires, registers, memories, buses, or more complex Verification and power estimation flows. The
user-defined interfaces. All the necessary signals synthesis process generates the required verification
and timing constraints are generated during the syn- infrastructure in SystemC so that the input stimuli
thesis process so that the generated RTL conforms from the original C++ testbench can be applied to
and is optimized to the desired interfaces. the generated RTL to verify its functionality against
For example, an array in the C source might result the (golden) source C++ specification using simu-
in a hardware interface that streams the data or trans- lation. The synthesis process also generates the
fers the data through a memory, a register bank, and required verification infrastructure (wrappers and
so forth. Selection of a streaming interface implies scripts) to enable the use of sequential equivalence
that the environment provides data in sequential checking between the source C++ specification and
index order whereas selection of a memory interface the generated RTL. Automatic generation of the verifi-
implies that the environment provides data by writing cation infrastructure is essential because the interface
the array into memory. The granularity of the transfer of the generated hardware heavily depends on inter-
sizefor example, number of array elements pro- face synthesis.
vided as a stream transfer or as a memory wordis Power estimation flows with third-party tools let
also specifiable as a user directive. the designer gather switching activity for the design
Hierarchy (block-level concurrency) can be speci- and obtain RTL and gate-level power estimates. By
fied by user directives. For example, a C function can exploring various architectures, the designer can rap-
be synthesized as a separate hardware block instead idly converge to a low-power design that meets the
of being inlined. Hierarchy can also be specified in a required performance and area goals.
style that corresponds to the Kahn process network Catapult has been successfully used in more than
computation model (still in sequential ANSI C++). 200 ASIC tape-outs and several hundred FPGA designs.
The blocks are connected with the appropriate Typical applications include computation-intensive
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 15
July/August 2009 15
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 16
High-Level Synthesis
option of using the gates for implementation or of interaction of hardware and software becomes both
using RTL representations of the data path compo- a challenge and an opportunity for further automa-
nents for logic synthesis. tion. Additional research is vital, because we are
still a long way from HLS that automatically searches
Synthesis output. Cynthesizer produces RTL Veri- the design space without the designer’s guidance and
log for use with logic synthesis tools provided by delivers optimal results for different design constraints
EDA vendors for ASIC and FPGA technology. and technologies.
The RTL consists of an FSM and a set of explicitly
instantiated data path components such as multipliers, References
adders, and multiplexers. More-complex custom data 1. A. Sangiovanni-Vincentelli, ‘‘The Tides of EDA,’’ IEEE
path components that implement arithmetic expres- Design & Test, vol. 20, no. 6, 2003, pp. 59-75.
sions used in the design are automatically created, 2. D. MacMillen et al., ‘‘An Industrial View of Electronic
and the designer can specify sections of C++ code Design Automation,’’ IEEE Trans. Computer-Aided
to be implemented as data path components. The Design of Integrated Circuits and Systems, vol. 19,
multiplexers directing the dataflow through the data no. 12, 2000, pp. 1428-1448.
path components and registers are controlled by a 3. D.W. Knapp, Behavioral Synthesis: Digital System
conventional FSM consisting of a binary-encoded or Design Using the Synopsys Behavioral Compiler,
one-hot state register and next-state logic imple- Prentice Hall, 1996.
mented in Verilog always blocks. 4. J.P. Elliot, Understanding Behavioral Synthesis:
A Practical Guide to High-Level Design, Kluwer
Strengths of the SystemC flow. SystemC is a good Academic Publishers, 1999.
fit for HLS because it supports a high level of abstrac- 5. W. Wolf, ‘‘A Decade of Hardware/Software Co-design,’’
tion and can directly describe hardware. It combines Computer, vol. 36, no. 4, 2003, pp. 38-43.
the high-level and object-oriented features of C++ with 6. H. Chang et al., Surviving the SoC Revolution: A Guide
hardware constructs that let a designer directly repre- to Platform-Based Design, Kluwer Academic Publishers,
sent structural hierarchy, signals, ports, clock edges, 1999.
and so on. This combination of characteristics pro- 7. M. Keating and P. Bricaud, Reuse Methodology Manual
vides a very efficient design and verification flow in for System-on-a-Chip Designs, Kluwer Academic Pub-
which behavioral models of multiple modules can lishers, 1998.
be concurrently simulated to verify their combined al- 8. D. Gajski et al., SpecC: Specification Language and
gorithm and interface behavior. Most functional Methodology, Kluwer Academic Publishers, 2000.
errors can be found and eliminated at this high- 9. B. Bailey, G. Martin, and A. Piziali, ESL Design and
speed behavioral level, which eliminates the need Verification: A Prescription for Electronic System Level
for time-consuming RTL simulation to validate interfa- Methodology, Morgan Kaufman Publishers, 2007.
ces and system-level operation, and substantially 10. P. Coussy and A. Morawiec, eds., High-Level Synthesis:
reduces the overall number of slow RTL simulations From Algorithm to Digital Circuit, Springer, 2008.
required. Once the behavior is functionally correct, 11. D. Ku and G. De Micheli, High Level Synthesis of ASICs
the models that were simulated are used directly for under Timing and Synchronization Constraints, Kluwer
synthesis, eliminating opportunities for mistakes or Academic Publishers, 1992.
misunderstanding. 12. D. Gajski et al., High-Level Synthesis: Introduction to
Chip and System Design, Kluwer Academic Publishers,
ONE INDICATOR OF HOW FAR HLS has come since its 1992.
early days in the late 1970s is the preponderance of 13. R.A. Walker and R. Camposano, eds., A Survey of High-
different HLS tools that are available, both academi- Level Synthesis Systems, Springer, 1991
cally and commercially. However, many features 14. S. Gupta et al., SPARK: A Parallelizing Approach to the
must yet be added before these tools become as High-Level Synthesis of Digital Circuits, Kluwer Academic
widely adopted as layout and logic synthesis tools. Publishers, 2004.
Moreover, many specific embedded-system applica- 15. A. Orailoglu and D.D. Gajski, ‘‘Flow Graph Representa-
tions need particular attention. As HLS moves from tion,’’ Proc. 23rd Design Automation Conf. (DAC 86),
block-level to subsystem to full-system design, the IEEE Press, pp. 503-509.
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 17
16. M. Girkar and C.D. Polychronopoulos, ‘‘Automatic Extrac- research group. His research interests include system-
tion of Functional Parallelism from Ordinary Programs,’’ level design and methodologies, HLS, CAD for SoCs,
IEEE Trans. Parallel and Distributed Systems, vol. 3, embedded systems, and low-power design for FPGAs.
no. 2, 1992, pp.166-178. He has a PhD in electrical and computer engineering
17. P. Paulin and J.P. Knight, ‘‘Algorithms for High-Level from the Université de Bretagne-Sud. He is a member
Synthesis,’’ IEEE Design and Test, vol. 6, no. 6, 1989, of the IEEE and the ACM.
pp. 18-31.
18. M. Reshadi and D. Gajski, ‘‘A Cycle-Accurate Compilation Daniel D. Gajski holds the Henry Samueli Endowed
Algorithm for Custom Pipelined Datapaths,’’ Proc. Int’l Chair in Computer System Design at the University of
Symp. Hardware/Software Codesign and System Synthe- California, Irvine, where he is also the director of the
sis (CODES+ISSS 05), ACM Press, 2005, pp. 21-26. Center for Embedded Computer Systems. His research
19. I. Auge and F. Petrot, ‘‘User Guided High Level interests include embedded systems and informa-
Synthesis,’’ High-Level Synthesis: From Algorithm to tion technology, design methodologies and e-design
Digital Circuit, P. Coussy and A. Morawiec, eds., environments, specification languages and CAD soft-
Springer, 2008, pp. 171-196. ware, and the science of design. He has a PhD in com-
20. D. Chen, J. Cong, and P. Pan, FPGA Design Automa- puter and information sciences from the University of
tion: A Survey, Now Publishers, 2006. Pennsylvania. He is a life member and Fellow of the
21. W. Geurts et al., Accelerator Data-Path Synthesis for IEEE.
High-Throughput Signal Processing Applications, Kluwer
Academic Publishers, 1996. Michael Meredith is the vice president of technical
22. L. Zhong and N.K. Jha, ‘‘Interconnect-Aware Low-Power marketing at Forte Design Systems and serves as
High-Level Synthesis,’’ IEEE Trans. Computer-Aided president of the Open SystemC Initiative. His research
Design of Integrated Circuits and Systems, vol. 24, no. 3, interests include development of printed-circuit board
2005, pp. 336-351. layouts, schematic capture, timing-diagram entry, ver-
23. M. Kudlur, K. Fan, and S. Mahlke, ‘‘Streamroller: Auto- ification, and high-level synthesis tools.
matic Synthesis of Prescribed Throughput Accelerator
Pipelines,’’ Proc. Int’l Conf. Hardware/Software Codesign Andres Takach is chief scientist at Mentor Graph-
and System Synthesis (CODES+ISSS 06), ACM Press, ics. His research interests include high-level synthesis,
pp. 270-275. low-power design, and hardware-software codesign.
24. M.C. Molina et al., ‘‘Area Optimization of Multi-cycle He has a PhD from Princeton University in Electrical
Operators in High-Level Synthesis,’’ Proc. Design, Auto- and Computer Engineering. He is chair of the OSCI
mation and Test in Europe Conf. (DATE 07), IEEE CS Synthesis Working Group.
Press, 2007, pp. 1-6.
25. G. Stitt, F. Vahid, and W. Najjar, ‘‘A Code Refinement Direct questions and comments about this article
Methodology for Performance-Improved Synthesis from to Philippe Coussy, Lab-STICC, Centre de recherche,
C,’’ Proc. IEEE/ACM Int’l Conf. Computer-Aided Design rue de saint maude, BP 92116, Lorient 56321, France;
(ICCAD 06), ACM Press, 2006, pp. 716-723. philippe.coussy@univ-ubs.fr.
Philippe Coussy is an associate professor in the For further information about this or any other comput-
Lab-STICC at the Université de Bretagne-Sud, France, ing topic, please visit our Digital Library at http://www.
where he leads the high-level synthesis (HLS) computer.org/csdl.
July/August 2009 17
Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.