0% found this document useful (0 votes)
211 views

HLS Introduction Gajski Design and Test

In the hardware domain, specification languages and design methodologies have evolved similarly. Until the late 1960s, ICs were designed, optimized, and laid out by hand. High-level Synthesis raises the design abstraction level and allows rapid generation of optimized RTL hardware for performance, area, and power requirements.

Uploaded by

spauls
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views

HLS Introduction Gajski Design and Test

In the hardware domain, specification languages and design methodologies have evolved similarly. Until the late 1960s, ICs were designed, optimized, and laid out by hand. High-level Synthesis raises the design abstraction level and allows rapid generation of optimized RTL hardware for performance, area, and power requirements.

Uploaded by

spauls
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

[3B2-8] mdt2009040008.

3d 17/7/09 13:24 Page 8

High-Level Synthesis

An Introduction to
High-Level Synthesis
Philippe Coussy Michael Meredith
Université de Bretagne-Sud, Lab-STICC Forte Design Systems

Daniel D. Gajski Andres Takach


University of California, Irvine Mentor Graphics

today would even think of program-


Editor’s note: ming a complex software application
High-level synthesis raises the design abstraction level and allows rapid gener-
solely by using an assembly language.
ation of optimized RTL hardware for performance, area, and power require-
In the hardware domain, specification
ments. This article gives an overview of state-of-the-art HLS techniques and
languages and design methodologies
tools.
Tim Cheng, Editor in Chief have evolved similarly.1,2 For this reason,
until the late 1960s, ICs were designed,
optimized, and laid out by hand. Simula-
THE GROWING CAPABILITIES of silicon technology tion at the gate level appeared in the early 1970s, and
and the increasing complexity of applications in re- cycle-based simulation became available by 1979. Tech-
cent decades have forced design methodologies niques introduced during the 1980s included place-and-
and tools to move to higher abstraction levels. Raising route, schematic circuit capture, formal verification,
the abstraction levels and accelerating automation of and static timing analysis. Hardware description lan-
both the synthesis and the verification processes have guages (HDLs), such as Verilog (1986) and VHDL
for this reason always been key factors in the evolu- (1987), have enabled wide adoption of simulation
tion of the design process, which in turn has allowed tools. These HDLs have also served as inputs to logic
designers to explore the design space efficiently and synthesis tools leading to the definition of their synthe-
rapidly. sizable subsets. During the 1990s, the first generation of
In the software domain, for example, machine commercial high-level synthesis (HLS) tools was avail-
code (binary sequence) was once the only language able commercially.3,4 Around the same time, research
that could be used to program a computer. In the interest on hardware-software codesignincluding
1950s, the concept of assembly language (and assem- estimation, exploration, partitioning, interfacing, com-
bler) was introduced. Finally, high-level languages munication, synthesis, and cosimulationgained mo-
(HLLs) and associated compilation techniques were mentum.5 The concept of IP core and platform-based
developed to improve software productivity. HLLs, design started to emerge.6-8 In the 2000s, there has
which are platform independent, follow the rules of been a shift to an electronic system-level (ESL) para-
human language with a grammar, a syntax, and a se- digm that facilitates exploration, synthesis, and verifica-
mantics. They thus provide flexibility and portability tion of complex SoCs.9 This includes the introduction
by hiding details of the computer architecture. As- of languages with system-level abstractions, such as
sembly language is today used only in limited scenar- SystemC (http://www.systemc.org), SpecC (http://
ios, primarily to optimize the critical parts of a www.cecs.uci.edu/~specc), or SystemVerilog (IEEE
program when there is an absolute need for speed 1800-2005; http://standards.ieee.org), and the intro-
and code compactness, or both. However, with the duction of transaction-level modeling (TLM). The
growing complexity of both modern system architec- ESL paradigm shift caused by the rise of system com-
tures and software applications, using HLLs and com- plexities, a multitude of components in a product
pilers clearly generates better overall results. No one (hundreds of processors in a car, for instance),

8 
0740-7475/09/$26.00 c 2009 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 9

a multitude of versions of a chip (for better product dif-


ferentiation), and an interdependency of component Specification

suppliers forced the market to focus on hardware


and software productivity, dependability, interoperabil- Compilation

ity, and reusability. In this context, processor custom-


ization and HLS have become necessary paths to Formal model

efficient ESL design.10 The new HLS flows, in addition


to reducing the time for creating the hardware, also Allocation Scheduling
help reduce the time to verify it as well as facilitate Library
Binding
other flows such as power analysis.
Raising the hardware design’s abstraction level is
Generation
essential to evaluating system-level exploration for ar-
chitectural decisions such as hardware and software
RTL architecture
design, synthesis and verification, memory organiza-
tion, and power management. HLS also enables
Logic synthesis
reuse of the same high-level specification, targeted
to accommodate a wide range of design constraints ...
and ASIC or FPGA technologies.
Typically, a designer begins the specification of an Figure 1. High-level synthesis (HLS) design steps.
application that is to be implemented as a custom
processor, dedicated coprocessor or any other cus-
tom hardware unit such as interrupt controller, Key concepts
bridge, arbiter, interface unit, or a special function Starting from the high-level description of an appli-
unit with a high-level description capture of the cation, an RTL component library, and specific design
desired functionality, using an HLL. This first step constraints, an HLS tool executes the following tasks
thus involves writing a functional specification (an (see Figure 1):
untimed description) in which a function consumes 1. compiles the specification,
all its input data simultaneously, performs all compu- 2. allocates hardware resources (functional units,
tations without any delay, and provides all its output storage components, buses, and so on),
data simultaneously. At this abstraction level, varia-
3. schedules the operations to clock cycles,
bles (structure and array) and data types (typically 4. binds the operations to functional units,
floating point and integer) are related neither to the 5. binds variables to storage elements,
hardware design domain (bits, bit vectors) nor to 6. binds transfers to buses, and
the embedded software. Realistic hardware imple- 7. generates the RTL architecture.
mentation thus requires conversion of floating-point
and integer data types into bit-accurate data types Tasks 2 through 6 are interdependent, and for a de-
of specific length (not a standard byte or word size, signer to achieve the optimal solution, they would ide-
as in software) with acceptable computation accu- ally be optimized in conjunction. To handle real-world
racy, while generating an optimized hardware archi- designs, however, the tasks are commonly executed in
tecture starting from this bit-accurate specification. sequence to manage the computational complexity of
HLS tools transform an untimed (or partially timed) synthesis. The particular order of some of the synthesis
high-level specification into a fully timed implementa- tasks, as well as a measure of how well the interdepen-
tion.10-13 They automatically or semiautomatically gen- dencies are estimated and accounted for, significantly
erate a custom architecture to efficiently implement impacts the generated design’s quality. More details
the specification. In addition to the memory banks are available elsewhere.10-13
and the communication interfaces, the generated ar-
chitecture is described at the RTL and contains a Compilation and modeling
data path (registers, multiplexers, functional units, HLS always begins with the compilation of the
and buses) and a controller, as required by the given functional specification. This first step transforms
specification and the design constraints. the input description into a formal representation.

July/August 2009 9

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 10

High-Level Synthesis

This first step traditionally includes several code opti- control dependencies, data dependencies between
mizations such as dead-code elimination, false data basic blocks can be added to the CDFG model as
dependency elimination, and constant folding and shown in the hierarchical task graph representation
loop transformations. used in the SPARK tool.14,16
The formal model produced by the compilation
classically exhibits the data and control dependen- Allocation
cies between the operations. Data dependencies Allocation defines the type and the number of
can be easily represented with a data flow graph hardware resources (for instance, functional units,
(DFG) in which every node represents an operation storage, or connectivity components) needed to sat-
and the arcs between the nodes represent the input, isfy the design constraints. Depending on the HLS
output, and temporary variables.12 A pure DFG mod- tool, some components may be added during sched-
els data dependencies only. In some cases, it is pos- uling and binding tasks. For example, the connectiv-
sible to get this model by removing the control ity components (such as buses or point-to-point
dependencies of the initial specification from the connections among components) can be added be-
model at compile time. To do so, loops are com- fore or after binding and scheduling tasks. The com-
pletely unrolled by converting to noniterative code ponents are selected from the RTL component
blocks, and conditional assignments are resolved by library. It’s important to select at least one component
creating multiplexed values. The resulting DFG explic- for each operation in the specification model. The li-
itly exhibits all the intrinsic parallelism of the specifi- brary must also include component characteristics
cation. However, the required transformations can (such as area, delay, and power) and its metrics to
lead to a large formal representation that requires be used by other synthesis tasks.
considerable memory to be stored during synthesis.
Moreover, this representation does not support
loops with unbounded iteration count and nonstatic Scheduling
control statements such as goto. This limits the use All operations required in the specification model
of pure DFG representations to a few applications. must be scheduled into cycles. In other words, for
The DFG model has been extended by adding control each operation such as a ¼ b op c, variables b and
dependencies: the control and data flow graph c must be read from their sources (either storage
(CDFG).12-15 A CDFG is a directed graph in which components or functional-unit components) and
the edges represent the control flow. The nodes in a brought to the input of a functional unit that can ex-
CDFG are commonly referred to as basic blocks and ecute operation op, and the result a must be brought
are defined as a straight-line sequence of statements to its destinations (storage or functional units).
that contain no branches or internal entrance or Depending on the functional component to which
exit points. Edges can be conditional to represent the operation is mapped, the operation can be sched-
if and switch constructs. A CDFG exhibits data
uled within one clock cycle or scheduled over several
dependencies inside basic blocks and captures the cycles. Operations can be chained (the output of an
control flow between those basic blocks. operation directly feeds an input of another opera-
CDFGs are more expressive than DFGs because tion). Operations can be scheduled to execute in par-
they can represent loops with unbounded iterations. allel provided there are no data dependencies
However, the parallelism is explicit only within between them and there are sufficient resources
basic blocks, and additional analysis or transforma- available at the same time.
tions are required to expose parallelism that might
exist between basic blocks. Such transformations in- Binding
clude for example loop unrolling, loop pipelining, Each variable that carries values across cycles
loop merging, and loop tiling. These techniques, by must be bound to a storage unit. In addition, several
revealing the parallelism between loops and between variables with nonoverlapping or mutually exclusive
loop iterations, are used to optimize the latency or lifetimes can be bound to the same storage units.
the throughput and the size and number of memory Every operation in the specification model must
accesses. These transformations can be realized auto- be bound to one of the functional units capable
matically,14 or they can be user-driven.10 In addition to of executing the operation. If there are several

10 IEEE Design & Test of Computers

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 11

Control
inputs

Control ... RF/Scratch pad


signals
Bus 1
Bus 2
State
register
(SR)
Next-
state Output
logic logic ALU MUL ... Memory

Status signals
Bus 3
Controller Data path

Control
outputs

Figure 2. Typical architecture.

units with such capability, the binding algorithm of functional units (such as ALUs, multipliers, shifters,
must optimize this selection. Storage and functional- and other custom functions), and interconnect ele-
unit binding also depend on connectivity binding, ments (such as tristate drivers, multiplexers, and
which requires that each transfer from component buses). All these register-transfer components can be
to component be bound to a connection unit allocated in different quantities and types and con-
such as a bus or a multiplexer (see, for example, nected arbitrarily through buses. Each component
http://www-labsticc.univ-ubs.fr/www-gaut/). Ideally, can take one or more clock cycles to execute, can
high-level synthesis estimates the connectivity delay be pipelined, and can have input or output registers.
and area as early as possible so that later HLS steps In addition, the entire data path and controller can
can better optimize the design. An alternative be pipelined in several stages.
approach is to specify the complete architecture dur- Primary input and output ports of the design inter-
ing allocation so that initial floorplanning results can face with the external world to transfer both data and
be used during binding and scheduling (see http:// control (used for interface protocol handshaking and
www.cecs.uci.edu/~nisc). synchronization). Data inputs and outputs are con-
nected to the data path, and control inputs and out-
Generation puts are connected to the controller. There are also
Once decisions have been made in the preceding control signals from the controller to the data path
tasks of allocation, scheduling, and binding, the goal and status signals from the data path to the controller.
of the RTL architecture generation step is to apply all However, some architectures may not have all the
the design decisions made and generate an RTL connectivity just described, and in general some of
model of the synthesized design. the controller functions may be implemented as
part of the data pathfor example, a counter plus
Architecture. The RTL architecture is implemented other logic in the data path that generates control
by a set of register-transfer components. It usually signals.
includes a controller and a data path (see Figure 2). The controller is a finite state machine that
A data path consists of a set of storage elements orchestrates the flow of data in the data path by set-
(such as registers, register files, and memories), a set ting the values of control signals (also called control

July/August 2009 11

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 12

High-Level Synthesis

word) such as the select inputs of functional units, When the RTL description includes only partial
registers, and multiplexers. The inputs to the control- binding of resources, the logic synthesis step that
ler may come from primary inputs (control inputs) follows HLS must perform the binding task and the
or from the data path components such as compara- associated optimization. Leaving components un-
tors and so on (status signals). The controller con- bound in the generated RTL provides the RTL and
sists of a state register (SR), next-state logic, and physical synthesis the flexibility to optimize the bind-
output logic. The SR stores the present state of the ings on the basis of updated timing estimates that
processor, which is equal to the present state of take into account wire loads due to physical (floor-
the finite-state machine (FSM) model describing planning and place-and-route) considerations.
the controller’s operation. The next-state logic com-
putes the next state to be loaded into the SR, whereas Several design flows
the output logic generates the control signals and the Allocation, scheduling, and binding can be
control outputs. performed simultaneously or in specific sequence
The controller of a simple dedicated coprocessor depending on the strategy and algorithms used. How-
is classically implemented with hardwired logic ever, they are all interrelated. If they are performed to-
gates. On the other hand, a controller can be pro- gether, the synthesis process becomes too complex to
grammable with a read-write or read-only program be applied to realistic examples. The order in which
memory for a specific custom processor. In this they are realized depends on the design constraints
case, the program memory can store instructions or and the tool’s objectives. For example, allocation
just control words, which are longer but require no will be performed first when scheduling tries to min-
decoding. In such a circumstance, SR is called a pro- imize the latency or to maximize the throughput
gram counter, the next-state logic is an address gener- under a resource constraint. Allocation will be deter-
ator, and the output logic is RAM or ROM. mined during scheduling when scheduling tries to
minimize the area under timing constraints.17
Resource-constrained approaches are used when a
Output model. According to the decisions made in
designer wants to define the data path architec-
the binding tasks, the description of the architecture
ture,18,19 or wants to accelerate an application by
can be written on RTL with different levels of detail
using an FPGA device with a limited amount of avail-
(that is, without binding or with partial or complete
able resources.20 Time-constrained approaches are
binding). For example, a ¼ b þ c executing in state
used when the objective is to reduce a circuit’s area
(n) can be written as Figure 3 indicates:
while meeting an application’s throughput require-
ments, as in multimedia or telecommunication
Without any binding: applications.21
state (n): a = b + c; In practice, the resource-constrained problem can
go to state (n + 1);
be solved by using a time-constrained approach or
With storage binding: tool (and vice versa). In this case, the designer
state (n): RF(1) = RF(3) + RF(4); relaxes the timing constraints until the provided cir-
go to state (n + 1);
cuit area is acceptable. Latency, throughput, resource
With functional-unit binding: count, and area are now well-known constraints and
state (n): a = ALU1 (+, b, c);
go to state (n + 1); objectives. However, recent work has considered fea-
tures such as clock period, memory bandwidth, mem-
With storage and functional-unit binding:
state (n): RF(1)= ALU1 (+, RF(3), RF(4)); ory mapping, power consumption, and so forth that
go to state (n + 1); make the synthesis problem even more difficult to
solve.10,22-24
With storage, functional-unit, and connectivity binding:
state (n): Bus1 = RF(3); Bus2 = RF(4); Another example of how synthesis tasks can be or-
Bus3 = ALU1 (+, Bus1, Bus2); dered concerns the allocation and the binding steps.
RF(1) = Bus3;
go to state (n + 1); The types and numbers of resources determined in
the allocation task are taken as input for the binding
Figure 3. RTL description written with different task. In practical HLS tools, however, resources are
binding details. often allocated only partially. Additional resources

12 IEEE Design & Test of Computers

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 13

are allocated during the binding step according to approach of both CoWare’s Processor Designer and
the design constraints and objectives. These addi- Tensilica’s Xtensa.
tional resources can be of any type: functional Other languages, not based on C or C++ but which
units, multiplexers, or registers. Hence, functional are tailored to specific domains, have also been pro-
units can be first allocated to schedule and bind posed. Esterel is a synchronous language for the de-
the operations. Both registers and multiplexers can velopment of reactive systems. Esterel Studio can
then be allocated (created) during the variable-to- generate either software or hardware. Bluespec’s
register binding step. BSV is language targeted for specifying concurrency
The synthesis tasks can be performed manually or with rule-based atomic transactions.
automatically. Obviously, many strategies are possible, The input specification to HLS tools must be writ-
as exemplified by available EDA tools: these tools ten with some hardware implementation in mind to
might perform each of the aforementioned tasks get the best results. For example, video line buffering
only partially in automatic fashion and leave the must be coded as part of the algorithm to generate
rest to the designer.10 high-throughput designs.25 Ideally, such code restruc-
turing still preserves much of the abstraction level. It
Industrial tools is beyond the scope of this article to provide an over-
Here, we take a brief look first at commercially view of the specific writing styles required in the spec-
available HLS tools for specifying the input descrip- ification for various tools. As tools’ capabilities evolve
tion, and then we more carefully examine two state- further, fewer modifications will be required to con-
of-the-art industrial HLS tools. vert an algorithm written for software into an algo-
rithm that is suitable as input to HLS. An analogous
Input languages and tools evolution has occurred with RTL synthesisfor exam-
The input specification must capture the intended ple, with the optimization of arithmetic expressions.
design functionality at a high abstraction level. Rather
than coding low-level implementation details, the de- Catapult C synthesis
signer uses the automation provided by the HLS tool Catapult takes an algorithm written in ANSI C++
to guide the design decisions, which heavily depend and a set of user directives as input and generates
on performance goals and the target technology. For an RTL that is optimized for the specified target tech-
instance, if there is parallelism in an HLS specifica- nology. The input can be compiled by any standard
tion, it is extracted using dataflow analysis in accor- compiler compliant with C++; pragmas and directives
dance with the target technology’s capabilities and do not change the functional behavior of the input
performance goals (on a slow technology, more paral- specification.
lelism is required to achieve the performance goal).
The latest generation of HLS tools, in most cases, Synthesis input. The input specification is sequen-
uses either ANSI C, C++, or languages such as SystemC tial and does not include any notion of time or explicit
that are based on C or C++ that add hardware-specific parallelism: it does not hardcode the interface or the
constructs such as timing, hardware hierarchy, inter- design’s architecture. Keeping the input abstract is es-
face ports, signals, explicit specification of parallelism, sential because any hard-coding of interface and ar-
and others. Some HLS tools that support C or C++ chitectural details significantly limits the range of
or derivatives are Mentor’s Catapult C (C, C++), designs that HLS could generate. Required directives
Forte’s Cynthesizer (SystemC), NEC’s CyberWorkbench specify the target technology (component library)
(C with hardware extensions), Synfora’s PICO (C), and and the clock period. Optional directives control inter-
Cadence’s C-to-Silicon (SystemC). Other languages face synthesis, array-to-memory mappings, amount of
used for high-level modeling are MathWork’s Matlab parallelism to uncover by loop unrolling, loop pipelin-
and Simulink. Tools that support Matlab or Simulink ing, hardware hierarchy and block communication,
are Xilinx’ AccelDSP (Matlab) and Synopsys’ Simplify- scheduling (latency or cycle) constraints, allocation
DSP (Simulink); both use an IP approach to gener- directives to constrain the number or type of hardware
ate the hardware implementation. An alternative resources, and so on.
approach for generating an implementation is to use Native C++ integer types as well as C++ bit-accurate
a configurable processor approach, which is the integer and fixed-point data types are supported

July/August 2009 13

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 14

High-Level Synthesis

for synthesis. The generated RTL faithfully reflects the communication channels, and the required hand-
bit-accurate behavior specified in the source. Publicly shaking interfaces are generated to guarantee the cor-
available integer and fixed-point data types provided rect execution of the specified behavior. The blocks
by Mentor Graphics’ (http://www.mentor.com/esl) can be synthesized to be driven by different clocks.
Algorithmic C data types library (ANSI C++ header The clock-domain-crossing logic is generated by Cat-
files) as well as the synthesizable subset of the apult. Communication is optimized according to
SystemC integer and fixed-point data types are sup- user directives to enable maximal block-level concur-
ported for synthesis. The support of C++ language rency of execution of the blocks using FIFO buffers
constructs meets and exceeds the requirements (for streamed data) and ping-pong memories to en-
stated in the most current draft of the Synthesis Subset able block-level pipelining and thus improve the
OSCI standard (http://www.systemc.org). throughput of the overall design.
All the HLS steps consider accurate component
Generating hardware from ANSI C or C++. One area and timing numbers for the target ASIC or
of the advantages of keeping the source untimed is FPGA technology for the designer’s RTL synthesis
that a very wide range of interfaces and architectures tool of choice. Accurate timing and area numbers
can be generated without changing the input source for components are essential to generate RTL that
specification. Another advantage of an untimed meets timing and is optimized for area. During syn-
source is that it avoids errors resulting from manual thesis, Catapult queries the component library so
coding of architectural details. The interface and that it can allocate various combinational or pipelin-
the architecture of the generated hardware are all ing components with different performance and area
under the control of the designer via synthesis direc- trade-offs. The queried component library is prechar-
tives. Catapult’s GUI provides an interactive environ- acterized for the target technology and the target RTL
ment with both the control and the analysis tools to synthesis tool. Component libraries can also be built
enable efficient exploration of the design space. by the designer to incorporate specific characteriza-
Interface synthesis makes it possible to map the tion for memories, buses, I/O interfaces, or other
transfer of data that is implied by passing of C++ func- units of functionality such as pipelined components.
tion arguments to various hardware interfaces such as
wires, registers, memories, buses, or more complex Verification and power estimation flows. The
user-defined interfaces. All the necessary signals synthesis process generates the required verification
and timing constraints are generated during the syn- infrastructure in SystemC so that the input stimuli
thesis process so that the generated RTL conforms from the original C++ testbench can be applied to
and is optimized to the desired interfaces. the generated RTL to verify its functionality against
For example, an array in the C source might result the (golden) source C++ specification using simu-
in a hardware interface that streams the data or trans- lation. The synthesis process also generates the
fers the data through a memory, a register bank, and required verification infrastructure (wrappers and
so forth. Selection of a streaming interface implies scripts) to enable the use of sequential equivalence
that the environment provides data in sequential checking between the source C++ specification and
index order whereas selection of a memory interface the generated RTL. Automatic generation of the verifi-
implies that the environment provides data by writing cation infrastructure is essential because the interface
the array into memory. The granularity of the transfer of the generated hardware heavily depends on inter-
sizefor example, number of array elements pro- face synthesis.
vided as a stream transfer or as a memory wordis Power estimation flows with third-party tools let
also specifiable as a user directive. the designer gather switching activity for the design
Hierarchy (block-level concurrency) can be speci- and obtain RTL and gate-level power estimates. By
fied by user directives. For example, a C function can exploring various architectures, the designer can rap-
be synthesized as a separate hardware block instead idly converge to a low-power design that meets the
of being inlined. Hierarchy can also be specified in a required performance and area goals.
style that corresponds to the Kahn process network Catapult has been successfully used in more than
computation model (still in sequential ANSI C++). 200 ASIC tape-outs and several hundred FPGA designs.
The blocks are connected with the appropriate Typical applications include computation-intensive

14 IEEE Design & Test of Computers

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 15

algorithms in communications and video and image


processing. Ports Clock SC_MODULE
required for
Reset
SC_CTHREAD
Cynthesizer SystemC synthesis
Cynthesizer takes a SystemC module containing hi-
SC_CTHREAD SC_METHOD
erarchy, multiple processes, interface protocols, and Signal-level Signal-level
algorithms and produces RTL Verilog optimized to a ports for ports for
reading data writing data
specific target technology and clock speed. The target
Submodule Submodule
technology is specified by a user-provided .lib file or,
Signals
for an FPGA implementation, by the user’s identifying
the targeted Xilinx or Altera part.
Member Data members
functions (storage)
Synthesis input. The input to the HLS flow used
with Cynthesizer is a pin- and protocol-accurate Sys-
temC model. Because SystemC is a C++ class library,
no language translation is required to reuse algo- Figure 4. SystemC input for synthesis.
rithms written in C++. The synthesizable subset is
quite broad, including classes and structures, opera- The computation code is written without any wait()
tor overloading, and C++ template specialization. statements and scheduled by the tool to satisfy la-
Constructs that are not supported for synthesis tency, pipelining, and other constraints given by the
include dynamic allocation (malloc, free, new, and designer.
delete), pointer arithmetic, and virtual functions. Triggered methods implemented as SystemC
The designer puts untimed high-level C++ into a SC_METHOD instances can also be used to implement
hardware context using SystemC to represent the behaviors that are triggered by activity on signals in
hardware elements such as ports, clock edges, struc- a sensitivity list, similar to a Verilog always block.
tural hierarchy, bit-accurate data types, and concur- This allows a mix of high- and low-level coding styles
rent processes. As Figure 4 shows, a synthesizable to be used, if needed.
SC_MODULE can contain multiple SC_CTHREAD instan- Complex subsystems are built and verified by com-
ces and multiple SC_METHOD instances along with sub- bining modules using structural hierarchy, just as in
module instances and signals for internal connections. Verilog or VHDL. The high-level models used as the
I/O ports are signal-level SystemC sc_in and sc_out input to synthesis can be simulated directly to vali-
ports. SC_MODULE instances are C++ classes, so they date both the algorithms and the way the algorithm
can also contain C++ member functions and data code interacts with the interface protocol code. Mul-
members (variables), which represent the module be- tiple modules are simulated together to validate that
havior and local storage respectively. they interoperate correctly to implement the function-
Clocked thread processes implemented as ality of the hierarchical subsystem.
SystemC SC_CTHREAD instances are used for the major-
ity of the module functionality. They contain an Targeting a specific process technology. To en-
infinite loop that implements the bulk of the function- sure that the synthesized RTL meets timing require-
ality along with reset code that initializes I/O ports ments at a given clock rate using a specific foundry
and variables. The SystemC clocked thread construct and process technology, the HLS tool requires accu-
provides the needed reset semantics. Within a thread, rate estimates of the timing characteristics of each op-
the designer can combine untimed computation eration. Cynthesizer uses an internal data path
code with cycle-accurate protocol code. The de- optimization engine to create a library of gate-level
signer determines the protocol by writing SystemC adders, multipliers, and so on. This takes a few
code containing port I/O statements and wait() state- hours for a specific process technology and clock
ments. Cynthesizer uses a hybrid scheduling ap- speed and can be performed by the designer given
proach in which the protocol code is scheduled in any library file. Cynthesizer uses the timing and
a cycle-accurate way, honoring the clock edges speci- area characteristics of these components to make
fied by the designer as SystemC wait() statements. trade-offs and optimize the RTL. Designers have the

July/August 2009 15

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 16

High-Level Synthesis

option of using the gates for implementation or of interaction of hardware and software becomes both
using RTL representations of the data path compo- a challenge and an opportunity for further automa-
nents for logic synthesis. tion. Additional research is vital, because we are
still a long way from HLS that automatically searches
Synthesis output. Cynthesizer produces RTL Veri- the design space without the designer’s guidance and
log for use with logic synthesis tools provided by delivers optimal results for different design constraints
EDA vendors for ASIC and FPGA technology. and technologies. 
The RTL consists of an FSM and a set of explicitly
instantiated data path components such as multipliers,  References
adders, and multiplexers. More-complex custom data 1. A. Sangiovanni-Vincentelli, ‘‘The Tides of EDA,’’ IEEE
path components that implement arithmetic expres- Design & Test, vol. 20, no. 6, 2003, pp. 59-75.
sions used in the design are automatically created, 2. D. MacMillen et al., ‘‘An Industrial View of Electronic
and the designer can specify sections of C++ code Design Automation,’’ IEEE Trans. Computer-Aided
to be implemented as data path components. The Design of Integrated Circuits and Systems, vol. 19,
multiplexers directing the dataflow through the data no. 12, 2000, pp. 1428-1448.
path components and registers are controlled by a 3. D.W. Knapp, Behavioral Synthesis: Digital System
conventional FSM consisting of a binary-encoded or Design Using the Synopsys Behavioral Compiler,
one-hot state register and next-state logic imple- Prentice Hall, 1996.
mented in Verilog always blocks. 4. J.P. Elliot, Understanding Behavioral Synthesis:
A Practical Guide to High-Level Design, Kluwer
Strengths of the SystemC flow. SystemC is a good Academic Publishers, 1999.
fit for HLS because it supports a high level of abstrac- 5. W. Wolf, ‘‘A Decade of Hardware/Software Co-design,’’
tion and can directly describe hardware. It combines Computer, vol. 36, no. 4, 2003, pp. 38-43.
the high-level and object-oriented features of C++ with 6. H. Chang et al., Surviving the SoC Revolution: A Guide
hardware constructs that let a designer directly repre- to Platform-Based Design, Kluwer Academic Publishers,
sent structural hierarchy, signals, ports, clock edges, 1999.
and so on. This combination of characteristics pro- 7. M. Keating and P. Bricaud, Reuse Methodology Manual
vides a very efficient design and verification flow in for System-on-a-Chip Designs, Kluwer Academic Pub-
which behavioral models of multiple modules can lishers, 1998.
be concurrently simulated to verify their combined al- 8. D. Gajski et al., SpecC: Specification Language and
gorithm and interface behavior. Most functional Methodology, Kluwer Academic Publishers, 2000.
errors can be found and eliminated at this high- 9. B. Bailey, G. Martin, and A. Piziali, ESL Design and
speed behavioral level, which eliminates the need Verification: A Prescription for Electronic System Level
for time-consuming RTL simulation to validate interfa- Methodology, Morgan Kaufman Publishers, 2007.
ces and system-level operation, and substantially 10. P. Coussy and A. Morawiec, eds., High-Level Synthesis:
reduces the overall number of slow RTL simulations From Algorithm to Digital Circuit, Springer, 2008.
required. Once the behavior is functionally correct, 11. D. Ku and G. De Micheli, High Level Synthesis of ASICs
the models that were simulated are used directly for under Timing and Synchronization Constraints, Kluwer
synthesis, eliminating opportunities for mistakes or Academic Publishers, 1992.
misunderstanding. 12. D. Gajski et al., High-Level Synthesis: Introduction to
Chip and System Design, Kluwer Academic Publishers,
ONE INDICATOR OF HOW FAR HLS has come since its 1992.
early days in the late 1970s is the preponderance of 13. R.A. Walker and R. Camposano, eds., A Survey of High-
different HLS tools that are available, both academi- Level Synthesis Systems, Springer, 1991
cally and commercially. However, many features 14. S. Gupta et al., SPARK: A Parallelizing Approach to the
must yet be added before these tools become as High-Level Synthesis of Digital Circuits, Kluwer Academic
widely adopted as layout and logic synthesis tools. Publishers, 2004.
Moreover, many specific embedded-system applica- 15. A. Orailoglu and D.D. Gajski, ‘‘Flow Graph Representa-
tions need particular attention. As HLS moves from tion,’’ Proc. 23rd Design Automation Conf. (DAC 86),
block-level to subsystem to full-system design, the IEEE Press, pp. 503-509.

16 IEEE Design & Test of Computers

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.
[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 17

16. M. Girkar and C.D. Polychronopoulos, ‘‘Automatic Extrac- research group. His research interests include system-
tion of Functional Parallelism from Ordinary Programs,’’ level design and methodologies, HLS, CAD for SoCs,
IEEE Trans. Parallel and Distributed Systems, vol. 3, embedded systems, and low-power design for FPGAs.
no. 2, 1992, pp.166-178. He has a PhD in electrical and computer engineering
17. P. Paulin and J.P. Knight, ‘‘Algorithms for High-Level from the Université de Bretagne-Sud. He is a member
Synthesis,’’ IEEE Design and Test, vol. 6, no. 6, 1989, of the IEEE and the ACM.
pp. 18-31.
18. M. Reshadi and D. Gajski, ‘‘A Cycle-Accurate Compilation Daniel D. Gajski holds the Henry Samueli Endowed
Algorithm for Custom Pipelined Datapaths,’’ Proc. Int’l Chair in Computer System Design at the University of
Symp. Hardware/Software Codesign and System Synthe- California, Irvine, where he is also the director of the
sis (CODES+ISSS 05), ACM Press, 2005, pp. 21-26. Center for Embedded Computer Systems. His research
19. I. Auge and F. Petrot, ‘‘User Guided High Level interests include embedded systems and informa-
Synthesis,’’ High-Level Synthesis: From Algorithm to tion technology, design methodologies and e-design
Digital Circuit, P. Coussy and A. Morawiec, eds., environments, specification languages and CAD soft-
Springer, 2008, pp. 171-196. ware, and the science of design. He has a PhD in com-
20. D. Chen, J. Cong, and P. Pan, FPGA Design Automa- puter and information sciences from the University of
tion: A Survey, Now Publishers, 2006. Pennsylvania. He is a life member and Fellow of the
21. W. Geurts et al., Accelerator Data-Path Synthesis for IEEE.
High-Throughput Signal Processing Applications, Kluwer
Academic Publishers, 1996. Michael Meredith is the vice president of technical
22. L. Zhong and N.K. Jha, ‘‘Interconnect-Aware Low-Power marketing at Forte Design Systems and serves as
High-Level Synthesis,’’ IEEE Trans. Computer-Aided president of the Open SystemC Initiative. His research
Design of Integrated Circuits and Systems, vol. 24, no. 3, interests include development of printed-circuit board
2005, pp. 336-351. layouts, schematic capture, timing-diagram entry, ver-
23. M. Kudlur, K. Fan, and S. Mahlke, ‘‘Streamroller: Auto- ification, and high-level synthesis tools.
matic Synthesis of Prescribed Throughput Accelerator
Pipelines,’’ Proc. Int’l Conf. Hardware/Software Codesign Andres Takach is chief scientist at Mentor Graph-
and System Synthesis (CODES+ISSS 06), ACM Press, ics. His research interests include high-level synthesis,
pp. 270-275. low-power design, and hardware-software codesign.
24. M.C. Molina et al., ‘‘Area Optimization of Multi-cycle He has a PhD from Princeton University in Electrical
Operators in High-Level Synthesis,’’ Proc. Design, Auto- and Computer Engineering. He is chair of the OSCI
mation and Test in Europe Conf. (DATE 07), IEEE CS Synthesis Working Group.
Press, 2007, pp. 1-6.
25. G. Stitt, F. Vahid, and W. Najjar, ‘‘A Code Refinement Direct questions and comments about this article
Methodology for Performance-Improved Synthesis from to Philippe Coussy, Lab-STICC, Centre de recherche,
C,’’ Proc. IEEE/ACM Int’l Conf. Computer-Aided Design rue de saint maude, BP 92116, Lorient 56321, France;
(ICCAD 06), ACM Press, 2006, pp. 716-723. philippe.coussy@univ-ubs.fr.

Philippe Coussy is an associate professor in the For further information about this or any other comput-
Lab-STICC at the Université de Bretagne-Sud, France, ing topic, please visit our Digital Library at http://www.
where he leads the high-level synthesis (HLS) computer.org/csdl.

July/August 2009 17

Authorized licensed use limited to: KTH THE ROYAL INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 08:00:46 EDT from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy