Ug Hls Best Practices 683152 709284
Ug Hls Best Practices 683152 709284
Ug Hls Best Practices 683152 709284
Pro Edition
Best Practices Guide
Contents
3. FPGA Concepts................................................................................................................ 7
3.1. FPGA Architecture Overview.................................................................................... 7
3.1.1. Adaptive Logic Module (ALM).......................................................................8
3.1.2. Digital Signal Processing (DSP) Block..........................................................10
3.1.3. Random-Access Memory (RAM) Blocks........................................................ 10
3.2. Concepts of FPGA Hardware Design........................................................................ 10
3.2.1. Maximum Frequency (fMAX)....................................................................... 10
3.2.2. Latency.................................................................................................. 11
3.2.3. Pipelining................................................................................................11
3.2.4. Throughput............................................................................................. 12
3.2.5. Datapath................................................................................................ 12
3.2.6. Control Path............................................................................................ 12
3.2.7. Occupancy.............................................................................................. 13
3.3. Methods of Hardware Design................................................................................. 14
3.3.1. How Source Code Becomes a Custom Hardware Datapath............................. 14
3.3.2. Scheduling.............................................................................................. 17
3.3.3. Mapping Parallelism Models to FPGA Hardware............................................. 23
3.3.4. Memory Types......................................................................................... 32
4. Interface Best Practices................................................................................................ 39
4.1. Choose the Right Interface for Your Component....................................................... 40
4.1.1. Pointer Interfaces.................................................................................... 41
4.1.2. Avalon Memory Mapped Host Interfaces...................................................... 43
4.1.3. Avalon Memory Mapped Agent Memories.....................................................46
4.1.4. Avalon Memory Mapped Agent Registers..................................................... 48
4.1.5. Avalon Streaming Interfaces......................................................................48
4.1.6. Pass-by-Value Interface............................................................................ 50
4.2. Control LSUs For Your Variable-Latency MM Host Interfaces ...................................... 53
4.3. Avoid Pointer Aliasing........................................................................................... 54
5. Loop Best Practices.......................................................................................................56
5.1. Reuse Hardware By Calling It In a Loop.................................................................. 57
5.2. Parallelize Loops.................................................................................................. 58
5.2.1. Pipeline Loops......................................................................................... 59
5.2.2. Unroll Loops............................................................................................ 60
5.2.3. Example: Loop Pipelining and Unrolling....................................................... 61
5.3. Construct Well-Formed Loops................................................................................ 63
5.4. Minimize Loop-Carried Dependencies...................................................................... 64
5.5. Avoid Complex Loop-Exit Conditions....................................................................... 65
5.6. Convert Nested Loops into a Single Loop.................................................................66
5.7. Place if-Statements in the Lowest Possible Scope in a Loop Nest.............................. 67
5.8. Declare Variables in the Deepest Scope Possible.......................................................67
5.9. Raise Loop II to Increase fMAX................................................................................68
5.10. Control Loop Interleaving.................................................................................... 68
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
2
Contents
B. Document Revision History for Intel HLS Compiler Pro Edition Best Practices Guide.. 102
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
3
683152 | 2021.12.13
Send Feedback
The default Intel Quartus Prime Design Suite installation location depends on your
operating system:
Windows C:\intelFPGA_pro\21.4
Linux /home/<username>/intelFPGA_pro/21.4
Documentation for the Intel HLS Compiler Pro Edition is split across a few publications.
Use the following table to find the publication that contains the Intel HLS Compiler Pro
Edition information that you are looking for:
Table 1. Intel High Level Synthesis Compiler Pro Edition Documentation Library
Title and Description
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
683152 | 2021.12.13
Send Feedback
As you optimize your component, apply the best practices techniques in the following
areas, roughly in the order listed. Also, review the example designs and tutorials
provided with the Intel High Level Synthesis (HLS) Compiler to see how some of these
techniques can be implemented.
For an example of the full Intel HLS Compiler design flow, watch the HLS Walkthrough
series at the Intel FPGA channel on YouTube or complete the full-design tutorial
found in <quartus_installdir>/hls/examples/tutorials/usability.
• Understand FPGA Concepts on page 7
A key best practice to help you get the most out of the Intel HLS Compiler is to
understand important concepts about FPGAs. With an understanding of FPGA
architecture, and some FPGA hardware design concepts and methods, you can
create better designs that take advantage of your target FPGA devices.
• Interface Best Practices on page 39
With the Intel High Level Synthesis Compiler, your component can have a variety
of interfaces: from basic wires to the Avalon Streaming and Avalon Memory-
Mapped Host interfaces. Review the interface best practices to help you choose
and configure the right interface for your component.
• Loop Best Practices on page 56
The Intel High Level Synthesis Compiler pipelines your loops to enhance
throughput. Review these loop best practices to learn techniques to optimize your
loops to boost the performance of your component.
• Memory Architecture Best Practices on page 72
The Intel High Level Synthesis Compiler infers efficient memory architectures (like
memory width, number of banks and ports) in a component by adapting the
architecture to the memory access patterns of your component. Review the
memory architecture best practices to learn how you can get the best memory
architecture for your component from the compiler.
• System of Tasks Best Practices on page 87
Using a system of HLS tasks in your component enables a variety of design
structures that you can implement including executing multiple loops in parallel
and sharing an expensive compute block.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
2. Best Practices for Coding and Compiling Your Component
683152 | 2021.12.13
Related Information
Intel FPGA channel on YouTube
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
6
683152 | 2021.12.13
Send Feedback
3. FPGA Concepts
A key best practice to help you get the most out of the Intel HLS Compiler is to
understand important concepts about FPGAs. With an understanding of FPGA
architecture, and some FPGA hardware design concepts and methods, you can create
better designs that take advantage of your target FPGA devices.
FPGAs occupy a unique computational niche relative to other compute devices, such
as central and graphics processing units (CPUs and GPUs), and custom accelerators,
such as application-specific integrated circuits (ASICs). CPUs and GPUs have a fixed
hardware structure to which a program maps, while ASICs and FPGAs can build
custom hardware to implement a program.
While a custom ASIC generally outperforms an FPGA on a specific task, ASICs take
significant time and money to develop. FPGAs are a cheaper off-the-shelf alternative
that you can reprogram for each new application.
The total number of ALMs, DSP blocks, and RAM blocks used by a design is often
referred to as the FPGA area or area that the design uses.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
3. FPGA Concepts
683152 | 2021.12.13
Adaptive Logic
Module (ALM)
Programmable
Routing Switch
A simplified ALM consists of a lookup table (LUT) and an output register from which
the compiler can build any arbitrary Boolean logic circuit.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
8
3. FPGA Concepts
683152 | 2021.12.13
MUX
LUT REG
3.1.1.2. Register
A register is the most basic storage element in an FPGA. It has an input (in), an
output (out), and a clock signal (clk). It is synchronous, that is, it synchronizes
output changes to a clock. In an ALM, a register may store the output of the LUT.
in out
REG
clk
Note: The clock signal is implied and not shown in some figures.
clk
in
out
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
9
3. FPGA Concepts
683152 | 2021.12.13
The input data propagates to the output on every clock cycle. The output remains
unchanged between clock cycles.
MUX
a
b REG out
The physical propagation delay of the signal across Boolean logic between two
consecutive register stages limits the clock speed. This propagation delay is a function
of the complexity of the combinational logic in the path.
The path with the most combinational logic elements (and the highest delay) limits the
speed of the entire circuit. This speed limiting path is often referred to as the critical
path.
The fMAX is calculated as the inverse of the critical path delay. You may want to have
high fMAX since it results in high performance in the absence of other bottlenecks.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
10
3. FPGA Concepts
683152 | 2021.12.13
3.2.2. Latency
Latency is the measure of how long it takes to complete one or more operations in a
digital circuit. You can measure latency at different granularities. For example, you can
measure the latency of a single operation or the latency of the entire circuit.
You can measure latency in time (for example, microseconds) or in clock cycles.
Typically, clock cycles are the preferred way to express latency because measuring
latency in clock cycles disconnects latency from your circuit clock frequency. By
expressing latency independent of circuit clock frequency, it is easier to discern the
true impact of circuit changes to the performance of the circuit.
You may want to have low latency, but lowering latency might result in decreased
fMAX.
3.2.3. Pipelining
Pipelining is a design technique used in synchronous digital circuits to increase fMAX.
Pipelining involves adding registers to the critical path, which decreases the amount of
logic between each register. Less logic takes less time to execute, which enables an
increase in fMAX.
The critical path in a circuit is the path between any two consecutive registers with the
highest latency. That is, the path between two consecutive registers where the
operations take the longest to complete.
Pipelining is especially useful when processing a stream of data. A pipelined circuit can
have different stages of the pipeline operating on different input stream data in the
same clock cycle, which leads to better data processing throughput.
Example
Consider a simple circuit with operations A and B on the critical path. If operation A
takes 5 ns to complete and operation B takes 15 ns to complete, then the time delay
on the critical path is 20 ns. This results in an fMAX of 50 MHz (1/max_delay).
Figure 1. Unpipelined Logic Block with an fMAX of 50 MHz and Latency of Two Clock
Cycles
in out
A B
REG REG
(5 ns)
(10 ns) (15
(10 ns)
clk
If a pipeline register is added between A and B, the critical path changes. The delay on
the critical path is now 15ns. Pipelining this block results in an fMAX of 66.67 MHz, and
the maximum delay between two consecutive registers is 15 ns.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
11
3. FPGA Concepts
683152 | 2021.12.13
Figure 2. Pipelined Logic Block with an fMAX of 66.67 MHz and Latency of Three clock
cycles
in out
A
A PIPELINE B
B
REG REG
(5 ns)
(10 ns) REG (15 ns)
(10 ns)
clk
While pipelining generally results in a higher fMAX, it increases latency. In the previous
example, the latency of the block containing A and B increases from two to three clock
cycles after pipelining.
Related Information
Pipeline Loops on page 59
3.2.4. Throughput
Throughput of a digital circuit is the rate at which data is processed.
In the absence of other bottlenecks, higher fMAX results in higher throughput (for
example, samples/second).
3.2.5. Datapath
A datapath is a chain of registers and Boolean logic in a digital circuit that performs
computations.
For example, the datapath in Pipelining on page 11 consists of all of the elements
shown, from the input register to the last output register.
Memory blocks are outside the datapath and reads and writes to memory are also
considered to be outside of the datapath.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
12
3. FPGA Concepts
683152 | 2021.12.13
• Loop control
Loop controls control the flow of data through the hardware generated for loops in
your code, including any loop carried dependencies.
• Branch control
Branch controls implement conditional statements in your code. Branch control
can include parallelizing parts of conditional statements to improve performance.
The control path also consumes FPGA area, and the compiler uses techniques like
clustering the datapath to help reduce the control path and save area. To learn about
clustering, refer to Clustering the Datapath on page 18.
3.2.7. Occupancy
The occupancy of a datapath at a point in time refers to the proportion of the datapath
that contains valid data.
The occupancy of a circuit over the execution of a program is the average occupancy
over time from the moment the program starts to run until it has completed.
Unoccupied portions of the datapath are often referred to as bubbles. Bubbles are
analogous to a "no operation" (no-op) instructions for a CPU that have no effect on
the final output.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
13
3. FPGA Concepts
683152 | 2021.12.13
Clock
Clock
Clock
Clock
Higher levels of abstraction can reduce the design time and increase the portability of
your design.
The sections that follow discuss how Intel HLS Compiler maps high-level languages to
a hardware datapath.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
14
3. FPGA Concepts
683152 | 2021.12.13
For fixed architectures, such as CPUs and GPUs, a compiler compiles code into a set of
instructions that run on functional units that have a fixed functionality. For these fixed
architectures to be useful in a broad range of applications, some of their available
functional units are not useful to every program. Unused functional units mean that
your program does not fully occupy the fixed architecture hardware.
FPGAs are not subject to these restrictions of fixed functional units. On an FPGA, you
can synthesize a specialized hardware datapath that can be fully occupied for an
arbitrary set of instructions, which means you can be more efficient with the silicon
area of your chip.
By implementing your algorithm in hardware, you can fill your chip with custom
hardware that is always (or almost always) working on your problem instead of having
idle functional units.
The Intel HLS Compiler maps statements from the source code to individual
specialized hardware operations, as shown in the example in the following image:
a b
c = a + b;
In general, each instruction maps to its own unique instance of a hardware operation.
However, a single statement can map to more than one hardware operation, or
multiple statements can combine into a single hardware operation when the compiler
finds that it can generate hardware that is more efficient.
The compiler takes these hardware operations and connects them into a graph based
on their dependencies. When operations are independent, the compiler automatically
infers parallelism by executing those operations simultaneously in time.
The following figure shows a dependency graph created for the hardware datapath.
The dependency graph shows how the instruction is mapped to hardware operations
and how the hardware operations are connected based on their dependencies. The
loads in this example instruction are independent of each other and can therefore run
simultaneously.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
15
3. FPGA Concepts
683152 | 2021.12.13
a0 b0 a1 b1
c0 = a0 / b0;
c1 = a1 / b1;
return c0 + c1; c0
÷ c1
÷
return
+
store
The datapath interacts with this memory through load/store units (LSUs), which are
inferred from array accesses in the source code.
The following figure illustrates a simple example of mapping arrays and their accesses
to hardware:
32-bits
Datapath
. . . . . . . . . .
int a[1024];
1024
store W a
a[i] = …; //store to memory
A RAM can have a limited number of read ports and write ports, but a datapath can
have many LSUs. When the number of LSUs does not match the available number of
read and write ports, the compiler uses techniques like replication, double pumping,
sharing, and arbitration. For descriptions of these techniques, refer to Component
Memory on page 32.
For more details about configuring memories, refer the following topics:
• Memory Architecture Best Practices on page 72
• Component Memories (Memory Attributes) in the Intel High Level Synthesis
Compiler Reference Manual.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
16
3. FPGA Concepts
683152 | 2021.12.13
FPGAs provide specialized hardware block RAMs that you can configure and combine
to match the size of your arrays. Customizing your memory configuration for your
design can provide terabytes-per-second of on-chip memory bandwidth because each
of these memories can interact with the datapath simultaneously.
Arrays might also be implemented in your component datapath. In this case, the array
contents are stored as registers in the datapath when your algorithm is pipelined (as
discussed in Pipelining on page 25). Storing array contents as registers in the
datapath can improve performance in some cases, but it is a design decision whether
to implement an array as registers or as memories.
When you access an array that is implemented as registers, LSUs are not used. The
compiler might choose to use a select or a barrel shifter instead.
i writeVal
readVal
3.3.2. Scheduling
Scheduling refers to the process of determining the clock cycles at which each
operation in the datapath executes.
Related Information
Pipelining on page 11
The Intel HLS Compiler generates pipelined datapaths that are dynamically scheduled.
A dynamically scheduled portion of the datapath does not pass data to its successor
until its successor signals that it is ready to receive it.
This signaling is accomplished using handshaking control logic. For example, a variable
latency load from memory may refuse to accept its predecessors' data until the load is
complete.
Handshaking helps remove bubbles in the pipeline, which increases occupancy. For
more information about bubbles, refer to Occupancy.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
17
3. FPGA Concepts
683152 | 2021.12.13
A B
Dynamically scheduling all operations adds overhead in the form of additional FPGA
area needed to implement the required handshaking control logic.
To reduce this overhead, the compiler groups fixed latency operations into clusters. A
cluster of fixed latency operations, such as arithmetic operations, needs fewer
handshaking interfaces, thereby reducing the area overhead.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
18
3. FPGA Concepts
683152 | 2021.12.13
Cluster
A B
If A, B, and C from Figure 4 on page 18 do not contain variable latency operations, the
compiler can cluster them together, as illustrated in Figure 5 on page 19.
Clustering the logic reduces area by removing the need for signals to stall data flow in
addition to other handshaking logic within the cluster.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
19
3. FPGA Concepts
683152 | 2021.12.13
Cluster Types
The Intel HLS Compiler can create the following types of clusters:
• Stall-Enable Cluster (SEC): This cluster type passes the handshaking logic to
every pipeline stage in the cluster in parallel. If the cluster is stalled by logic from
further down in the datapath, all logic in the SEC stalls at the same time.
Enable
Enable
Enable
Enable
oValid iStall
• Stall-Free Cluster (SFC): This cluster type adds a first in, first out (FIFO) buffer
to the end of the cluster that can accommodate at least the entire latency of the
pipeline in the cluster. This FIFO is often called an exit FIFO because it is attached
to the exit of the cluster datapath.
Because of this FIFO, the pipeline stages in the cluster do not require any
handshaking logic. The stages can run freely and drain into the capacity FIFO,
even if the cluster is stalled from logic further down in the datapath.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
20
3. FPGA Concepts
683152 | 2021.12.13
Enable
Enable
Enable
Enable
In Full
Exit FIFO
Out req
oValid iStall
Cluster Characteristics
The exit FIFO of the stall free cluster results in some tradeoffs:
• Area: Because an SEC does not use an exit FIFO, it can save FPGA area compared
to an SFC.
If you have a design with many small, low-latency clusters, you can save a
substantial amount of area by asking the compiler to use SECs instead of SFCs
with the hls_use_stall_enable_clusterscomponent attribute. For details,
refer to hls_use_stall_enable_clusters Component Attribute in the Intel
HLS Compiler Reference Manual.
• Latency: Logic that uses SFCs might have a larger latency than logic that uses
SECs because of the write-read latency of the exit FIFO.
• fMAX: In an SFC, the oStall signal has less fanout than in an SEC.
For a cluster with many pipeline stages, you can improve your design fMAX by
using an SFC.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
21
3. FPGA Concepts
683152 | 2021.12.13
• Handshaking: The exit FIFO in SFCs allow them to take advantage of hyper-
optimized handshaking between clusters. For more information, refer to Hyper
Optimized Handshaking.
SECs do not support this capability.
• Bubble Handling: SECs remove only leading bubbles. A leading bubble is a
bubble that arrives before the first piece of valid data arrives in the cluster. SECs
do not remove any arriving afterwards.
SFCs can use the exit FIFO to remove all bubbles from the pipeline if the SFC gets
a downstream stall signal.
• Stall Behavior: When an SEC receives a downstream stall, it stalls any logic
upstream of it within one clock cycle.
When an SFC receives a downstream stall, the exit FIFO allows it to consume
additional valid data depending on how deep the exit FIFO is and how many
bubbles are in the cluster datapath.
Valid
Stall
Data [ ]
Hyper-Optimized Handshaking
If the distance across the FPGA between these two clusters is large, handshaking may
become the critical path that affects peak fMAX. in the design
To improve these cases, the Intel HLS Compiler can add pipelining registers to the
stall/valid protocol to ease the critical path and improve fMAX. This enhanced
handshaking protocol is called hyper-optimized handshaking.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
22
3. FPGA Concepts
683152 | 2021.12.13
oStall iValid
Cluster 1
iStall oValid
(iStall_1) (oValid_1)
oStall iValid
(oStall_2)
Cluster 2
iStall oValid
The following timing diagram illustrates an example of upstream cluster 1 and
downstream cluster 2 with two pipelining registers inserted in-between:
oValid_1
iStall_1
oStall_2
iData_2 [ ]
Restriction: Hyper-optimized handshaking is currently available only for the Intel Agilex™ and Intel
Stratix® 10 device families.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
23
3. FPGA Concepts
683152 | 2021.12.13
The following image illustrates an example of an adder and a multiplier, which are
scheduled to execute simultaneously while operating on separate inputs:
c f
Unrolling Loops
You can unroll loops in the design by using loop attributes. Loop unrolling decreases
the number of iterations executed at the expense of increasing hardware resource
consumption corresponding to executing multiple iterations of the loop simultaneously.
The Intel HLS Compiler never attempts to unroll any loops in your source code
automatically. You must always control loop unrolling by using the corresponding
pragma. For details, refer to Loop Unrolling (unroll Pragma) in the Intel High Level
Synthesis Compiler Reference Manual.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
24
3. FPGA Concepts
683152 | 2021.12.13
Conditional Statements
In this example, the function foo can be run unconditionally. The code that cannot be
run unconditionally, like the memory assignments, retain a condition.
Related Information
Unroll Loops on page 60
3.3.3.1.2. Pipelining
Similar to the implementation of a CPU with multiple pipeline stages, the compiler
generates a deeply-pipelined hardware datapath. For more information, refer to
Concepts of FPGA Hardware Design and How Source Code Becomes a Custom
Hardware Datapath.
Pipelining allows for many data items to be processed concurrently (in the same clock
cycle) while making efficient use of the hardware in the datapath by keeping it
occupied.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
25
3. FPGA Concepts
683152 | 2021.12.13
Mem[100] += 42 * Mem[101];
Store
(Mem[100])
Multiple invocations of this code when running on a CPU would not be pipelined. The
output of an invocation is completed before inputs are passed to the next invocation of
the code.
On an FPGA device, this kind of unpipelined invocation results in poor throughput and
low occupancy of the datapath because many of the operations are sitting idle while
other parts of the datapath are operating on the data. The following figure shows what
throughput and occupancy of invocations looks like in this scenario:
Figure 14. Unpipelined Execution Resulting in Low Throughput and Low Occupancy
Clock Cycles
t0
t1
t2
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
26
3. FPGA Concepts
683152 | 2021.12.13
The Intel HLS Compiler pipelines your design as much as possible. New inputs can be
sent into the datapath each cycle, giving you a fully occupied datapath for higher
throughput, as shown in the following figure:
Figure 15. Pipelining the Datapath Results in High Throughput and High Occupancy
Clock Cycles
t2
Load Load
42
t0
t1
t1
t2
t3
t4
t0
Store t5
You can gain even further throughput by vectorizing the pipelined hardware.
Vectorizing the hardware improves throughput, but requires more FPGA area for the
additional copies of the pipelined hardware:
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
27
3. FPGA Concepts
683152 | 2021.12.13
Figure 16. Vectorizing the Pipelined Datapath Resulting in High Throughput and High
Occupancy
t6 t7 t8
Load Load Load Load Load Load
42 42 42
t3 t4 t5
Clock Cycles
t0
t1
t2
t3
t4
t5
Understanding where the data you need to pipeline is coming from is key to achieving
high performance designs on the FPGA. You can use the following sources of data to
take advantage of pipelining:
• Components
• Loop iterations
When the Intel HLS Compiler pipelines a loop, it attempts to schedule the loop
execution such that the next iteration of the loop enters the pipeline before the
previous iteration has completed. This pipelining of loop iterations can lead to higher
throughput.
The number of clock cycles between iterations of the loop is called the Initiation
Interval (II).
For the highest performance, a loop iteration would start every clock cycle, which
corresponds to an II of 1.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
28
3. FPGA Concepts
683152 | 2021.12.13
Data dependencies that are carried from one loop iteration to another can affect the
ability to achieve II of 1. These dependencies are called loop-carried dependencies.
The II of a loop must be high enough to accommodate all loop carried dependencies.
Tip: The II required to satisfy this constraint is a function of the fMAX of the design. If the
fMAX is lower, the II might also be lower. Conversely, if the fMAX is higher, a higher II
might be required.
The Intel HLS Compiler automatically identifies these dependencies and tries to build
hardware to resolve them while minimizing the II, subject to the target fMAX.
Naively generating hardware for the code in Figure 17 on page 29 results in two
loads: one from memory b and one from memory c. Because the compiler knows that
the access to c[i-1] was written to in the previous iteration, the load from c[i-1]
can be optimized away.
long
dependency
Store
Store
For additional information about pipelining loops, refer to Pipeline Loops on page 59.
When the Intel HLS Compiler cannot initially achieve II of 1, it chooses from several
optimization strategies:
• Interleaving on page 29
• Speculative Execution on page 30
These optimizations are applied automatically by the Intel HLS Compiler, and
additionally can be controlled through pragma statements in the design.
Interleaving
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
29
3. FPGA Concepts
683152 | 2021.12.13
When a loop nest has an inner loop II that is greater than 1, the Intel HLS Compiler
can attempt to interleave iterations of the outer loop into iterations of the inner loop to
better utilize the hardware resources and achieve higher throughput.
i0,j0 i0,j0
i0,j1 i1,j0
i0,j2 i0,j1
i1,j0 i1,j1
i1,j1 i0,j2
i1,j2 i1,j2
If it is determined that the exit condition is satisfied, the effects of these extra
iterations are suppressed.
Typically, the exit condition for a loop iteration must be evaluated before the program
determines whether to start the next loop iteration or continue into the rest of the
function. This requirement means that the loop initiation interval (II) cannot be lower
than the number of cycles required to compute the exit condition. Speculated
iterations can help lower the loop II because operations within the loop can occur in
the function pipeline at the same time as the exit condition is evaluated.
This speculative execution can achieve lower II and higher throughput, but it can incur
additional overhead between loop invocations (equivalent to the number of speculated
iterations). A larger loop trip count helps to minimize this overhead.
For any speculated iteration, instructions with side effects outside of the loop (like
writing to memory or a stream) are not completed until the loop exit condition for the
iteration has been evaluated. For loop iterations that are in flight but incomplete when
the loop exit condition is met, side effect data is discarded.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
30
3. FPGA Concepts
683152 | 2021.12.13
While speculated iterations can improve loop II, they occupy the pipeline until they are
completed. A new loop invocation cannot start until all of the speculated iterations
have completed. For example, the next iteration of an outer loop cannot start until all
the speculated iterations of an inner loop have completed.
i0
i1
i2
i3
i4
i5
Loop Orchestration Without Speculation
Loop exit calculation takes 3 clock cycles
Loop II limited to II = 3
i0
i1
i2
i3
i4
i5
i6 non-operatiive
i7 non-operative
Related Information
"Loop Iteration Speculation (speculated_iterations Pragma)" in the Intel High
Level Synthesis Compiler Reference Manual
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
31
3. FPGA Concepts
683152 | 2021.12.13
The Intel HLS Compiler attempts to schedule the execution of component invocations
such that the next invocation of a component enters the pipeline before the previous
invocation has completed.
For larger code structures to execute in parallel with each other, you must write them
as separate components or tasks that launch simultaneously. These components or
tasks then run independently, and synchronize and communicate using pipes or
streams, as shown in the following figure:
for(i=0..N) { for(i=0..N) {
… …
mypipe::write(x); y = mypipe::read();
… …
} }
For details, see Systems of Tasks in the Intel High Level Synthesis Compiler Pro
Edition Reference Manual.
If you declare an array inside your component, the Intel HLS Compiler creates
component memory in hardware. Component memory is sometimes referred to as
local memory or on-chip memory because it is created from memory resources (such
as RAM blocks) available on the FPGA.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
32
3. FPGA Concepts
683152 | 2021.12.13
The following source code snippet results in the creation of a component memory
system, an interface to an external memory system, and access to these memory
systems:
#include <HLS/hls.h>
To learn more about controlling memory system architectures, review the following
topics:
• Component Memories (Memory Attributes) in the Intel High Level Synthesis
Compiler Pro Edition Reference Manual
• Memory Architecture Best Practices on page 72
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
33
3. FPGA Concepts
683152 | 2021.12.13
As much as possible, the Intel HLS Compiler tries to create stall-free memory systems
for your component.
A Memory
B Memory R Share
LSU 2 Read 2
C Memory R Arb
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
34
3. FPGA Concepts
683152 | 2021.12.13
Replicate A memory bank replicate is a copy of the data in the memory bank with
its own ports. All replicates in a bank contain the same data. Each
replicate can be accessed independent of the others
Private A private copy is a copy of the data in a replicate that is created for
Copy nested loops to enable concurrent iterations of the outer loop.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
35
3. FPGA Concepts
683152 | 2021.12.13
The following figure illustrates the relationship between banks, replicates, ports, and
private copies:
Width = 4 bytes
C) Memory with two replicates D) Memory with two private copies
Width = 4 bytes
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
36
3. FPGA Concepts
683152 | 2021.12.13
The compiler uses a variety of strategies to ensure that concurrent accesses are stall-
free including:
• Adjusting the number of ports the memory system has. This can be done either by
replicating the memory to enable more read ports or by clocking the RAM block at
twice the component clock speed, which enables four ports per replicate instead of
two.
Clocking the RAM block at twice the component clock speed to double the number
of available ports to the memory system is called double pumping.
All of a replicate's physical access ports can be accessed concurrently.
• Partitioning memory content into one or more banks, such that each bank
contains a subset of the data contained in the original memory (corresponds to
the top-right box of Schematic Representation of Local Memories Showing the
Relationship between Banks, Replicates, Ports, and Private Copies).
The banks of a component memory can be accessed concurrently by the datapath.
• Replicating a bank to create multiple coherent replicates (corresponds to the
bottom-left box of Schematic Representation of Local Memories Showing the
Relationship between Banks, Replicates, Ports, and Private Copies). Each replicate
in a bank contains identical data.
The replicates are loaded concurrently.
• Creating private copies of an array that is declared inside of a loop nest
(corresponds to the bottom-right box of Schematic Representation of Local
Memories Showing the Relationship between Banks, Replicates, Ports, and Private
Copies).
These private copies enable loop pipelining because each pipeline-parallel loop
iteration accesses it own private copy of the array declared within the loop body.
Private copies are not expected to contain the same data.
Despite the compiler’s best efforts, the component memory system can still be
stallable. This might happen due to resource constraints or memory attributes defined
in your source code. In that case, the compiler tries to minimize the hardware
resources consumed by the arbitrated memory system.
If the component accesses memory outside of the component, the compiler creates a
hardware interface through which the datapath accesses this external memory. The
interface is described using a pointer or Avalon® memory-mapped host interface as a
function argument to the component. One interface is created for every pointer or
memory-mapped host interface component argument.
Unlike component memory, the compiler does not define the structure of the external
memory. The compiler instantiates a specialized LSU for each access site based on the
type of interface and the memory access patterns.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
37
3. FPGA Concepts
683152 | 2021.12.13
The compiler also tries various strategies to maximize the efficient use of the available
memory interface bandwidth such as eliminating unnecessary accesses and statically
coalescing contiguous accesses.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
38
683152 | 2021.12.13
Send Feedback
Each interface type supported by the Intel HLS Compiler Pro Edition has different
benefits. However, the system that surrounds your component might limit your
choices. Keep your requirements in mind when determining the optimal interface for
your component.
The Intel HLS Compiler Pro Edition comes with a number of tutorials that illustrate
important Intel HLS Compiler concepts and demonstrate good coding practices.
Review the following tutorials to learn about different interfaces as well as best
practices that might apply to your design:
Tutorial Description
You can find these tutorials in the following location on your Intel Quartus Prime system:
<quartus_installdir>/hls/examples/tutorials
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
4. Interface Best Practices
683152 | 2021.12.13
Tutorial Description
Related Information
• Avalon Memory-Mapped Interface Specifications
• Avalon Streaming Interface Specifications
The best interface for your component might not be immediately apparent, so you
might need to try different interfaces for your component to achieve the optimal QoR.
Take advantage of the rapid component compilation time provided by the Intel HLS
Compiler Pro Edition and the resulting High Level Design reports to determine which
interface gives you the optimal QoR for your component.
This section uses a vector addition example to illustrate the impact of changing the
component interface while keeping the component algorithm the same. The example
has two input vectors, vector a and vector b, and stores the result to vector c. The
vectors have a length of N (which could be very large).
The Intel HLS Compiler Pro Edition extracts the parallelism of this algorithm by
pipelining the loops if no loop dependency exists. In addition, by unrolling the loop (by
a factor of 8), more parallelism can be extracted.
Ideally, the generated component has a latency of N/8 cycles. In the examples in the
following section, a value of 1024 is used for N, so the ideal latency is 128 cycles
(1024/8).
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
40
4. Interface Best Practices
683152 | 2021.12.13
The following sections present variations of this example that use different interfaces.
Review these sections to learn how different interfaces affect the QoR of this
component.
You can work your way through the variations of these examples by reviewing the
tutorial available in <quartus_installdir>/hls/examples/tutorials/
interfaces/overview.
The vector addition component example with pointer interfaces can be coded as
follows:
component void vector_add(int* a,
int* b,
int* c,
int N) {
#pragma unroll 8
for (int i = 0; i < N; ++i) {
c[i] = a[i] + b[i];
}
}
The following diagram shows the Function View in the Graph Viewer that is generated
when you compile this example. Because the loop is unrolled by a factor of 8, the
diagram shows that vector_add.B2 has 8 loads for vector a, 8 loads for vector b,
and 8 stores for vector c. In addition, all of the loads and stores are arbitrated on the
same memory, resulting in inefficient memory accesses.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
41
4. Interface Best Practices
683152 | 2021.12.13
Figure 24. Graph Viewer Function View for vector_add Component with Pointer
Interfaces
The following Loop Analysis report shows that the component has an undesirably high
loop initiation interval (II). The II is high because vectors a, b, and c are all accessed
through the same Avalon-MM Host interface. The Intel HLS Compiler Pro Edition uses
stallable arbitration logic to schedule these accesses, which results in poor
performance and high FPGA area use.
In addition, the compiler cannot assume there are no data dependencies between loop
iterations because pointer aliasing might exist. The compiler cannot determine that
vectors a, b, and c do not overlap. If data dependencies exist, the Intel HLS Compiler
cannot pipeline the loop iterations effectively.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
42
4. Interface Best Practices
683152 | 2021.12.13
Compiling the component with an Intel Quartus Prime compilation flow targeting an
Intel Arria® 10 device results in the following QoR metrics, including high ALM usage,
high latency, high II, and low fMAX. All of which are undesirable properties in a
component.
ALMs 15593.5
DSPs 0
RAMs 30
1The compilation flow used to calculate the QoR metrics used Intel Quartus Prime Pro Edition Version 17.1.
2The fMAX measurement was calculated from a single seed.
You can configure the Avalon MM host interface for the vector addition component
example using the ihc::mm_host class as follows:
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
43
4. Interface Best Practices
683152 | 2021.12.13
The memory interfaces for vector a, vector b, and vector c have the following
attributes specified:
• The vectors are each assigned to different address spaces with the ihc::aspace
attribute, and each vector receives a separate Avalon MM host interface.
With the vectors assigned to different physical interfaces, the vectors can be
accessed concurrently without interfering with each other, so memory arbitration
is not needed.
• The width of the interfaces for the vectors is adjusted with the ihc::dwidth
attribute.
• The alignment of the interfaces for the vectors is adjusted with the ihc::align
attribute.
The following diagram shows the Function View in the System Viewer that is
generated when you compile this example.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
44
4. Interface Best Practices
683152 | 2021.12.13
Figure 25. System Viewer Function View for vector_add Component with Avalon MM
Host Interface
The diagram shows that vector_add.B2 has two loads and one store. The default
Avalon MM Host settings used by the code example in Pointer Interfaces on page 41
had 16 loads and 8 stores.
By expanding the width and alignment of the vector interfaces, the original pointer
interface loads and stores were coalesced into one wide load each for vector a and
vector b, and one wide store for vector c.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
45
4. Interface Best Practices
683152 | 2021.12.13
Also, the memories are stall-free because the loads and stores in this example access
separate memories.
Compiling this component with an Intel Quartus Prime compilation flow targeting an
Intel Arria 10 device results in the following QoR metrics:
DSPs 0 0
RAMs 30 0
1The compilation flow used to calculate the QoR metrics used Intel Quartus Prime Pro Edition Version 17.1.
2The fMAX measurement was calculated from a single seed.
All QoR metrics improved by changing the component interface to a specialized Avalon
MM Host interface from a pointer interface. The latency is close to the ideal latency
value of 128, and the loop initiation interval (II) is 1.
Important: This change to a specialized Avalon MM Host interface from a pointer interface
requires the system to have three separate memories with the expected width. The
initial pointer implementation requires only one system memory with a 64-bit wide
data bus. If the system cannot provide the required memories, you cannot use this
optimization.
Agent memories are owned by the component and expose an MM agent interface for
an MM Host to read from and write to.
When you allocate an agent memory, you must define its size. Defining the size puts a
limit on how large a value of N that the component can process. In this example, the
RAM size is 1024 words. This RAM size means that N can have a maximal size of 1024
words.
The vector addition component example can be coded with an Avalon MM agent
interface as follows:
component void vector_add(
hls_avalon_agent_memory_argument(1024*sizeof(int)) int* a,
hls_avalon_agent_memory_argument(1024*sizeof(int)) int* b,
hls_avalon_agent_memory_argument(1024*sizeof(int)) int* c,
int N) {
#pragma unroll 8
for (int i = 0; i < N; ++i) {
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
46
4. Interface Best Practices
683152 | 2021.12.13
The following diagram shows the Function View in the System Viewer that is
generated when you compile this example.
Figure 26. System Viewer Function View of vector_add Component with Avalon MM
Agent Interface
Compiling this component with an Intel Quartus Prime compilation flow targeting an
Intel Arria 10 device results in the following QoR metrics:
DSPs 0 0 0
continued...
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
47
4. Interface Best Practices
683152 | 2021.12.13
RAMs 30 0 48
1The compilation flow used to calculate the QoR metrics used Intel Quartus Prime Pro Edition Version 17.1.
2The fMAX measurement was calculated from a single seed.
The QoR metrics show by changing the ownership of the memory from the system to
the component, the number of ALMs used by the component are reduced, as is the
component latency. The fMAX of the component is increased as well. The number of
RAM blocks used by the component is greater because the memory is implemented in
the component and not the system. The total system RAM usage (not shown) should
not increase because RAM usage shifted from the system to the FPGA RAM blocks.
For more information about the control-and-status register, refer to Agent Interfaces
in the Intel High Level Synthesis Compiler Pro Edition: Reference Manual.
The vector addition example can be coded with an Avalon ST interface as follows:
struct int_v8 {
int data[8];
};
component void vector_add(
ihc::stream_in<int_v8>& a,
ihc::stream_in<int_v8>& b,
ihc::stream_out<int_v8>& c,
int N) {
for (int j = 0; j < (N/8); ++j) {
int_v8 av = a.read();
int_v8 bv = b.read();
int_v8 cv;
#pragma unroll 8
for (int i = 0; i < 8; ++i) {
cv.data[i] = av.data[i] + bv.data[i];
}
c.write(cv);
}
}
An Avalon ST interface has a data bus, and ready and busy signals for handshaking.
The struct is created to pack eight integers so that eight operations at a time can
occur in parallel to provide a comparison with the examples for other interfaces.
Similarly, the loop count is divided by eight.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
48
4. Interface Best Practices
683152 | 2021.12.13
The following diagram shows the Function View in the System Viewer that is
generated when you compile this example.
Figure 27. System Viewer Function View of vector_add Component with Avalon ST
Interface
The main difference from other versions of the example component is the absence of
memory.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
49
4. Interface Best Practices
683152 | 2021.12.13
The streaming interfaces are stallable from the upstream sources and the downstream
output. Because the interfaces are stallable, the loop initiation interval (II) is
approximately 1 (instead of exactly 1). If the component does not receive any bubbles
(gaps in data flow) from upstream or stall signals from downstream, then the
component achieves the desired II of 1.
If you know that the stream interfaces will never stall, you can further optimize this
component by taking advantage of the usesReady and usesValid stream
parameters.
Compiling this component with an Intel Quartus Prime compilation flow targeting an
Intel Arria 10 device results in the following QoR metrics:
DSPs 0 0 0 0
RAMs 30 0 48 0
1The compilation flow used to calculate the QoR metrics used Intel Quartus Prime Pro Edition Version 17.1.
2The fMAX measurement was calculated from a single seed.
The vector addition example can be coded to pass the vector array elements by value
as follows. A struct is used to pass the entire array (of 8 data elements) by value.
When you use a struct to pass the array, you must also do the following things:
• Define element-wise copy constructors.
• Define element-wise copy assignment operators.
• Add the hls_register memory attribute to all struct members in the
definition.
struct int_v8 {
hls_register int data[8];
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
50
4. Interface Best Practices
683152 | 2021.12.13
#pragma unroll
for (int i=0; i< 8; i++) {
data[i] = org.data[i] ;
}
return *this;
}
//copy constructor
int_v8 (const int_v8& org) {
#pragma unroll
for (int i=0; i< 8; i++) {
data[i] = org.data[i] ;
}
}
This component takes and processes only eight elements of vector a and vector b, and
returns eight elements of vector c. To compute 1024 elements for the example, the
component needs to be called 128 times (1024/8). While in previous examples the
component contained loops that were pipelined, here the component is invoked many
times, and each of the invocations are pipelined.
The following diagram shows the Function View in the System Viewer that is
generated when you compile this example.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
51
4. Interface Best Practices
683152 | 2021.12.13
Figure 28. System Viewer Function View of vector_add Component with Pass-By-Value
Interface
The latency of this component is one, and it has a loop initiation interval (II) of one.
Compiling this component with an Intel Quartus Prime compilation flow targeting an
Intel Arria 10 device results in the following QoR metrics:
DSPs 0 0 0 0 0
RAMs 30 0 48 0 0
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
52
4. Interface Best Practices
683152 | 2021.12.13
1The compilation flow used to calculate the QoR metrics used Intel Quartus Prime Pro Edition Version 17.1.
2The fMAX measurement was calculated from a single seed.
The QoR metrics for the vector_add component with a pass-by-value interface
shows fewer ALM used, a high component fMAX, and optimal values for latency and II.
In this case, the II is the same as the component invocation interval. A new invocation
of the component can be launched every clock cycle. With an initiation interval of 1,
128 component calls are processed in 128 cycles so the overall latency is 128.
To see if you need to use LSU controls, review the High-Level Design Reports for your
component, especially the Function Memory Viewer, to see if the memory access
pattern (and its associated LSUs) inferred by the Intel HLS Compiler Pro Edition match
your expected memory access pattern. If they do not match, consider controlling the
LSU type, LSU coalescing, or both.
The Intel HLS Compiler Pro Edition creates either burst-coalesced LSUs or pipelined
LSUs.
In general, use burst-coalesced LSUs when an LSU is expected to process many load/
store requests to memory words that are consecutive. The burst-coalesced LSU
attempts to "dynamically coalesce" the requests into larger bursts in order to utilize
memory bandwidth more efficiently.
The pipelined LSU consumes significantly less FPGA area, but processes load/store
requests individually without any coalescing. This processing is useful when your
design is tight on area or when the accesses to the variable-latency MM Host interface
are not necessarily consecutive.
The following code example shows both types of LSU being implemented for a
variable-latency MM Host interface:
component void
dut(mm_host<int, dwidth<128>, awidth<32>, aspace<4>, latency<0>> &Buff1,
mm_host<int, dwidth<32>, awidth<32>, aspace<5>, latency<0>> &Buff2) {
int Temp[SIZE];
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
53
4. Interface Best Practices
683152 | 2021.12.13
Static coalescing is typically beneficial because it reduces the total number of LSUs in
your design by statically combining multiple load/store operations into wider load/
store operations
However, there are cases where static coalescing leads to unaligned accesses, which
you might not want to occur. There are also cases where multiple loads/stores get
coalesced even though you intended for only a subset of them to be operational at a
time. In these cases, consider disable static coalescing for the load/store operations
that you did not want to be coalesced.
For the following code example, the Intel HLS Compiler does not statically coalesce
the two load operations into one wide load operation:
component int
dut(mm_host<int, dwidth<256>, awidth<32>, aspace<1>, latency<0>> &Buff1,
int i, bool Cond1, bool Cond2) {
Related Information
Avalon Memory-Mapped Host Interfaces and Load-Store Units
Consider a loop where each iteration reads data from one array, and then it writes
data to another array in the same physical memory. Without adding the restrict type-
qualifier to these pointer arguments, the compiler must assume that the two arrays
might overlap. Therefore, the compiler must keep the original order of memory
accesses to both arrays, resulting in poor loop optimization or even failure to pipeline
the loop that contains the memory accesses.
You can also use the restrict type-qualifier with Avalon memory-mapped (MM) host
interfaces.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
54
4. Interface Best Practices
683152 | 2021.12.13
For more details, review the parameter aliasing tutorial in the following location:
<quartus_installdir>/hls/examples/tutorials/best_practices/parameter_aliasing
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
55
683152 | 2021.12.13
Send Feedback
The Intel HLS Compiler Pro Edition lets you know if there are any dependencies that
prevent it from optimizing your loops. Try to eliminate these dependencies in your
code for optimal component performance. You can also provide additional guidance to
the compiler by using the available loop pragmas.
The Intel HLS Compiler Pro Edition comes with a number of tutorials that illustrate
important Intel HLS Compiler concepts and demonstrate good coding practices.
Review the following tutorials to learn about loop best practices that might apply to
your design:
Tutorial Description
You can find these tutorials in the following location on your Intel Quartus Prime system:
<quartus_installdir>/hls/examples/tutorials
best_practices/loop_fusion Demonstrates the latency and resource utilization improvements of loop fusion.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
5. Loop Best Practices
683152 | 2021.12.13
Tutorial Description
loop_controls/ Demonstrates a method to reduce the area utilization of a loop that meets the
max_interleaving following conditions:
• The loop has an II > 1
• The loop is contained in a pipelined loop
• The loop execution is serialized across the invocations of the pipelined loop
best_practices/ Demonstrates how to use the hls_register attribute to reduce loop II and
optimize_ii_using_hls_registe how to use hls_max_concurrency to improve component throughput
r
best_practices/ Demonstrates how to improve fMAX by correcting a bottleneck that arises when
parallelize_array_operation performing operations on an array in a loop.
best_practices/ Demonstrates a method to reduce the II of a loop that includes a floating point
relax_reduction_dependency accumulator, or other reduction operation that cannot be computed at high speed
in a single clock cycle.
best_practices/ Demonstrates the following versions of a 32-tap finite impulse response (FIR)
resource_sharing_filter filter design:
• optimized-for-throughput variant
• optimized-for-area variant
For example, the following code example results in multiple hardware copies of the
function foo in the component myComponent because the function foo is inlined:
int foo(int a)
{
return 4 + sqrt(a) /
}
component
void myComponent()
{
...
int x =
x += foo(0);
x += foo(1);
x += foo(2);
...
}
If you place the function foo in a loop, the hardware for foo can be reused for each
invocation. The function is still inlined, but it is inlined only once.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
57
5. Loop Best Practices
683152 | 2021.12.13
component
void myComponent()
{
...
int x = 0;
#pragma unroll 1
for (int i = 0; i < 3; i++)
{
x += foo(i);
}
...
}
You could also use a switch/case block if you want to pass your reusable function
different values that are not related to the loop induction variable i:
component
void myComponent()
{
...
int x = 0;
#pragma unroll 1
for (int i = 0; i < 3; i++)
{
int val = 0;
switch(i)
{
case 0:
val = 3;
break;
case 1:
val = 6;
break;
case 2:
val = 1;
break;
}
x += foo(val);
}
...
}
You can learn more about reusing hardware and minimizing inlining be reviewing the
resource sharing tutorial available in <quartus_installdir>/hls/examples/
tutorials/best_practices/resource_sharing_filter.
You can take advantage of the spatial compute structure to accelerate the loops by
having multiple iterations of a loop executing concurrently. To have multiple iterations
of a loop execute concurrently, unroll loops when possible and structure your loops so
that dependencies between loop iterations are minimized and can be resolved within
one clock cycle.
These practices show how to parallelize different iterations of the same loop. If you
have two different loops that you want to parallelize, consider using a system of tasks.
For details, see System of Tasks Best Practices on page 87.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
58
5. Loop Best Practices
683152 | 2021.12.13
Consider the following basic loop with three stages and three iterations. A loop stage
is defined as the operations that occur in the loop within one clock cycle.
Figure 29. Basic loop with three stages and three iterations
If each stage of this loop takes one clock cycle to execute, then this loop has a latency
of nine cycles.
The following figure shows the pipelining of the loop from Figure 29 on page 59.
Figure 30. Pipelined loop with three stages and four iterations
The pipelined loop has a latency of five clock cycles for three iterations (and six cycles
for four iterations), but there is no area tradeoff. During the second clock cycle, Stage
1 of the pipeline loop is processing iteration 2, Stage 2 is processing iteration 1, and
Stage 3 is inactive.
This loop is pipelined with a loop initiation interval (II) of 1. An II of 1 means that
there is a delay of 1 clock cycle between starting each successive loop iteration.
The Intel HLS Compiler Pro Edition attempts to pipeline loops by default, and loop
pipelining is not subject to the same constant iteration count constraint that loop
unrolling is.
Not all loops can be pipelined as well as the loop shown in Figure 30 on page 59,
particularly loops where each iteration depends on a value computed in a previous
iteration.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
59
5. Loop Best Practices
683152 | 2021.12.13
For example, consider if Stage 1 of the loop depended on a value computed during
Stage 3 of the previous loop iteration. In that case, the second (orange) iteration
could not start executing until the first (blue) iteration had reached Stage 3. This type
of dependency is called a loop-carried dependency.
In this example, the loop would be pipelined with II=3. Because the II is the same as
the latency of a loop iteration, the loop would not actually be pipelined at all. You can
estimate the overall latency of a loop with the following equation:
where latencyloop is the number of cycles the loop takes to execute and latencybody is the
number of cycles a single loop iteration takes to execute.
The Intel HLS Compiler Pro Edition supports pipelining nested loops without unrolling
inner loops. When calculating the latency of nested loops, apply this formula
recursively. This recursion means that having II>1 is more problematic for inner loops
than for outer loops. Therefore, algorithms that do most of their work on an inner loop
with II=1 still perform well, even if their outer loops have II>1.
Related Information
Speculative Execution on page 30
Consider the following basic loop with three stages and three iterations. Each stage
represents the operations that occur in the loop within one clock cycle.
Figure 31. Basic loop with three stages and three iterations
If each stage of this loop takes one clock cycle to execute, then this loop has a latency
of nine cycles.
The following figure shows the loop from Figure 31 on page 60 unrolled three times.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
60
5. Loop Best Practices
683152 | 2021.12.13
Figure 32. Unrolled loop with three stages and three iterations
Three iterations of the loop can now be completed in only three clock cycles, but three
times as many hardware resources are required.
You can control how the compiler unrolls a loop with the #pragma unroll directive,
but this directive works only if the compiler knows the trip count for the loop in
advance or if you specify the unroll factor. In addition to replicating the hardware, the
compiler also reschedules the circuit such that each operation runs as soon as the
inputs for the operation are ready.
For an example of using the #pragma unroll directive, see the best_practices/
resource_sharing_filter tutorial.
Input
Matrix
i =1 i =2 i =3
0 0 0
Result 0 0 0 0 0 0 0 0
Matrix
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
61
5. Loop Best Practices
683152 | 2021.12.13
You can improve the performance of this component by unrolling the loops that iterate
across each entry of a particular column. If the loop operations are independent, then
the compiler executes them in parallel.
Floating-point operations typically must be carried out in the same order that they are
expressed in your source code to preserve numerical precision. However, you can use
the -ffp-reassociate compiler flag to relax the ordering of floating-point
operations. With the order of floating-point operations relaxed, the following
conditions occur in this loop:
• The multiplication operations can occur in parallel.
• The addition operations can be composed into an adder tree instead of an adder
chain.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/
tutorials/best_practices/ floating_point_ops
You can improve the throughput by unrolling the j-loop at line 11, but to allow the
compiler to unroll the loop, you must ensure that it has constant bounds. You can
ensure constant bounds by starting the j-loop at j = 0 instead of j = i + 1. You
must also add a predication statement to prevent r_matrix from being assigned with
invalid data during iterations 0,1,2,…i of the j-loop.
01: #define ROWS 4
02: #define COLS 4
03:
04: component void dut(...) {
05: float a_matrix[COLS][ROWS]; // store in column-major format
06: float r_matrix[ROWS][COLS]; // store in row-major format
07:
08: // setup...
09:
10: for (int i = 0; i < COLS; i++) {
11:
12: #pragma unroll
13: for (int j = 0; j < COLS; j++) {
14: float dotProduct = 0;
15:
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
62
5. Loop Best Practices
683152 | 2021.12.13
Now the j-loop is fully unrolled. Because they do not have any dependencies, all four
iterations run at the same time.
You could continue and also unroll the loop at line 10, but unrolling this loop would
result in the area increasing again. By allowing the compiler to pipeline this loop
instead of unrolling it, you can avoid increasing the area and pay about only four more
clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details
pane of the Loops Analysis page in the high-level design report (report.html) gives
you tips on how to improve it.
The following factors are factors that can typically affect loop II:
• loop-carried dependencies
See the tutorial at <quartus_installdir>/hls/examples/tutorials/
best_practices/loop_memory_dependency
• long critical loop path
• inner loops with a loop II > 1
Well-formed nested loops can also help maximize the performance of your component.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
63
5. Loop Best Practices
683152 | 2021.12.13
{
//statements
}
}
The loop structure that follows has a loop-carried dependency because each loop
iteration reads data written by the previous iteration. As a result, each read operation
cannot proceed until the write operation from the previous iteration completes. The
presence of loop-carried dependencies reduces the pipeline parallelism that the Intel
HLS Compiler Pro Edition can achieve, which reduces component performance.
for(int i = 1; i < N; i++)
{
A[i] = A[i - 1] + i;
}
The Intel HLS Compiler Pro Edition performs a static memory dependency analysis on
loops to determine the extent of parallelism that it can achieve. If the Intel HLS
Compiler Pro Edition cannot determine that there are no loop-carried dependencies, it
assumes that loop-dependencies exist. The ability of the compiler to test for loop-
carried dependencies is impeded by unknown variables at compilation time or if array
accesses in your code involve complex addressing.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
64
5. Loop Best Practices
683152 | 2021.12.13
The compiler can perform range analysis effectively when loops have constant bounds.
You can place an if-statement inside your loop to control in which iterations the loop
body executes.
If there are no implicit memory dependencies across loop iterations, you can use the
ivdep pragma to tell the Intel HLS Compiler Pro Edition to ignore possible memory
dependencies.
For details about how to use the ivdep pragma, see Loop-Carried Dependencies
(ivdep Pragma) in the Intel High Level Synthesis Compiler Pro Edition Reference
Manual.
Use the speculated_iterations pragma to specify how many cycles the loop exit
condition can take to compute.
Related Information
Loop Iteration Speculation (speculated_iterations Pragma)
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
65
5. Loop Best Practices
683152 | 2021.12.13
The following code examples illustrate the conversion of a nested loop into a single
loop:
Nested Loop Converted Single Loop
You can also specify the loop_coalesce pragma to coalesce nested loops into a
single loop without affecting the loop functionality. The following simple example
shows how the compiler coalesces two loops into a single loop when you specify the
loop_coalesce pragma.
The compiler coalesces the two loops together so that they run as if they were a single
loop written as follows:
int i = 0;
int j = 0;
while(i < N){
sum[i][j] += i+j;
j++;
if (j == M){
j = 0;
i++;
}
}
For more information about the loop_coalesce pragma, see "Loop Coalescing
(loop_coalesce Pragma)" in Intel High Level Synthesis Compiler Pro Edition
Reference Manual.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
66
5. Loop Best Practices
683152 | 2021.12.13
These conditions can cause the outer loop to take different paths (divergent loops),
which can reduce the QoR of your component because these condition prevent the
Intel HLS Compiler from pipelining the loops.
For more details, review the divergent loops tutorial available in the following location:
<quartus_installdir>/hls/examples/tutorials/best_practices/divergent_loops
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
67
5. Loop Best Practices
683152 | 2021.12.13
The array a requires more resources to implement than the array b. To reduce
hardware usage, declare array a outside the inner loop unless it is necessary to
maintain the data through iterations of the outer loop.
Tip: Overwriting all values of a variable in the deepest scope possible also reduces the
resources necessary to represent the variable.
Example
Consider a case where your component has two distinct sequential pipelineable loops:
an initialization loop with a low trip count and a processing loop with a high trip count
and no loop-carried memory dependencies. In this case, the compiler does not know
that the initialization loop has a much smaller impact on the overall throughput of your
design. If possible, the compiler attempts to pipeline both loops with an II of 1.
Because the initialization loop has a loop-carried dependence, it will have a feedback
path in the generated hardware. To achieve an II with such a feedback path, some
clock frequency might be sacrificed. Depending on the feedback path in the main loop,
the rest of your design could have run at a higher operating frequency.
If you specify #pragma ii 2 on the initialization loop, you tell the compiler that it
can be less aggressive in optimizing II for this loop. Less aggressive optimization
allows the compiler to pipeline the path limiting the fmax and could allow your overall
component design to achieve a higher fmax.
The initialization loop takes longer to run with its new II. However, the decrease in the
running time of the long-running loop due to higher fmax compensates for the
increased length in running time of the initialization loop.
With loop interleaving, the dynamic II of a loop can be approximated by the static II of
the loop divided by the degree of interleaving, that is, by the number of concurrent
invocations of the loop that are in flight.
Interleaving allows the iterations of more than one invocation of a loop to execute in
parallel, provided that the static II of that loop is greater than 1. By default, the
maximum amount of interleaving for a loop is equal to the static II of that loop.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
68
5. Loop Best Practices
683152 | 2021.12.13
Review the following tutorial to learn more about loop interleaving and how to control
it:<quartus_installdir>/hls/examples/tutorials/loop_controls/
max_interleaving.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
69
683152 | 2021.12.13
Send Feedback
The Intel HLS Compiler Pro Edition comes with a number of tutorials that illustrate
important Intel HLS Compiler concepts and demonstrate good coding practices.
Review the following tutorials to learn about fMAX bottleneck best practices that might
apply to your design:
Tutorial Description
You can find these tutorials in the following location on your Intel Quartus Prime system:
<quartus_installdir>/hls/examples/tutorials
best_practices/fpga_reg Demonstrates how manually adding pipeline registers can increase fMAX
best_practices/overview Demonstrates how fMAX can depend on the interface used in your component.
best_practices/ Demonstrates how to improve fMAX by correcting a bottleneck that arises when
parallelize_array_operation performing operations on an array in a loop.
best_practices/ Demonstrates how to improve fMAX by reducing the width of the FIFO belonging
reduce_exit_fifo_width to the exit node of a stall-free cluster
best_practices/ Demonstrates how fMAX can depend on the loop-carried feedback path.
relax_reduction_dependency
The fMAX target is a strong suggestion and the compiler does not error out if it is not
able to achieve this fMAX, whereas the #pragma II triggers an error if the compiler
cannot achieve the requested II. The fMAX achieved for each block of code is shown in
the Loops report.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
6. fMAX Bottleneck Best Practices
683152 | 2021.12.13
The following table outlines the behavior of the scheduler in the Intel HLS Compiler:
Explicitly Specify fMAX? Explicitly Specify II? Compiler Behavior
No Yes Best effort to achieve the II for the corresponding loop (may not
achieve the best possible fMAX).
Yes No Best effort to achieve fMAX specified (may not achieve the best
possible II).
Yes Yes Best effort to achieve the fMAX specified at the given II. The
compiler errors out if it cannot achieve the requested II.
Note: If you are using an fMAX target in the command line or for a component, use #pragma
II = <N> for performance-critical loops in your design.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
71
683152 | 2021.12.13
Send Feedback
In most cases, you can optimize the memory architecture by modifying the access
pattern. However, the Intel HLS Compiler Pro Edition gives you some control over the
memory architecture.
The Intel HLS Compiler Pro Edition comes with a number of tutorials that illustrate
important Intel HLS Compiler concepts and demonstrate good coding practices.
Review the following tutorials to learn about memory architecture best practices that
might apply to your design:
You can find these tutorials in the following location on your Intel Quartus Prime system:
<quartus_installdir>/hls/examples/tutorials/component_memories
attributes_on_mm_agent_arg Demonstrates how to apply memory attributes to Avalon Memory Mapped (MM)
agent arguments.
exceptions Demonstrates how to use memory attributes on constants and struct members.
memory_bank_configuration Demonstrates how to control the number of load/store ports of each memory
bank and optimize your component area usage, throughput, or both by using one
or more of the following memory attributes:
• hls_max_replicates
• hls_singlepump
• hls_doublepump
• hls_simple_dual_port_memory
memory_geometry Demonstrates how to control the number of load/store ports of each memory
bank and optimize your component area usage, throughput, or both by using one
or more of the following memory attributes:
• hls_bankwidth
• hls_numbanks
• hls_bankbits
continued...
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
7. Memory Architecture Best Practices
683152 | 2021.12.13
Tutorial Description
non_trivial_initialization Demonstrates how to use the C++ keyword constexpr to achieve efficient
initialization of read-only variables.
The following code examples demonstrate how you can use the following memory
attributes to override coalesced memory to conserve memory blocks on your FPGA:
• hls_bankwidth(N)
• hls_numbanks(N)
• hls_singlepump
• hls_max_replicates(N)
The original code coalesces two memory accesses, resulting in a memory system that
is 256 locations deep by 64 bits wide (256x64 bits) (two on-chip memory blocks):
component unsigned int mem_coalesce_default(unsigned int raddr,
unsigned int waddr,
unsigned int wdata){
unsigned int data[512];
data[2*waddr] = wdata;
data[2*waddr + 1] = wdata + 1;
unsigned int rdata = data[2*raddr] + data[2*raddr + 1];
return rdata;
}
The following images show how the 256x64 bit memory for this code sample is
structured, as well how the component memory structure is shown in the high-level
design report (report.html)
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
73
7. Memory Architecture Best Practices
683152 | 2021.12.13
LSU LSU
32 bits 32 bits
255
256
… Empty Empty
511
The modified code implements a single on-chip memory block that is 512 words deep
by 32 bits wide with stallable arbitration:
component unsigned int mem_coalesce_override(unsigned int raddr,
unsigned int waddr,
unsigned int wdata){
//Attributes that stop memory coalescing
hls_bankwidth(4) hls_numbanks(1)
//Attributes that specify a single-pumped single-replicate memory
hls_singlepump hls_max_replicates(1)
unsigned int data[512];
data[2*waddr] = wdata;
data[2*waddr + 1] = wdata + 1;
unsigned int rdata = data[2*raddr] + data[2*raddr + 1];
return rdata;
}
The following images show how the 512x32 bit memory with stallable arbitration for
this code sample is structured, as well how the component memory structure is shown
in the high-level design report (report.html).
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
74
7. Memory Architecture Best Practices
683152 | 2021.12.13
LSU LSU
32 bits 32 bits
32 bits
Word Bank 0
0
1
…
255
256
…
511
While it might appear that you save hardware area by reducing the number of RAM
blocks needed for the component, the introduction of stallable arbitration increases
the amount of hardware needed to implement the component. In the following table,
you can compare the number ALMs and FFs required by the components.
The following code examples demonstrate how you can use the following memory
attributes to override banked memory to conserve memory blocks on your FPGA:
• hls_bankwidth(N)
• hls_numbanks(N)
• hls_singlepump
• hls_doublepump
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
75
7. Memory Architecture Best Practices
683152 | 2021.12.13
The original code creates two banks of single-pumped on-chip memory blocks that are
16 bits wide:
component unsigned short mem_banked(unsigned short raddr,
unsigned short waddr,
unsigned short wdata){
unsigned short data[1024];
data[2*waddr] = wdata;
data[2*waddr + 9] = wdata +1;
return rdata;
}
To save banked memory, you can implement one bank of double-pumped 32-bit-wide
on-chip memory block by adding the following attributes before the declaration of
data[1024]. These attributes fold the two half-used memory banks into one fully-
used memory bank that is double pumped, so that it can be accessed as quickly as the
two half-used memory banks.
hls_bankwidth(2) hls_numbanks(1)
hls_doublepump
unsigned short data[1024];
When you merge memories, multiple component variables share the same memory
block. You can merge memories by width (width-wise merge) or depth (depth-wise
merge). You can merge memories where the data in the memories have different
datatypes.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
76
7. Memory Architecture Best Practices
683152 | 2021.12.13
The following diagram shows how four memories can be merged width-wise and
depth-wise.
Depth-Wise Merge Width-Wise Merge
64 words b b
256 words
c
64 words c
64b
d
All variables with the same <mem_name> label set in their hls_merge attributes are
merged.
int rdata;
return rdata;
}
The code instructs the Intel HLS Compiler Pro Edition to implement local memories a
and b as two on-chip memory blocks, each with its own load and store instructions.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
77
7. Memory Architecture Best Practices
683152 | 2021.12.13
32b
St
b 128 words
Ld
Because the load and store instructions for local memories a and b are mutually
exclusive, you can merge the accesses, as shown in the example code below. Merging
the memory accesses reduces the number of load and store instructions, and the
number of on-chip memory blocks, by half.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) {
int a[128] hls_merge("mem","depth");
int b[128] hls_merge("mem","depth");
int rdata;
return rdata;
}
St a
256 words
Ld b
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
78
7. Memory Architecture Best Practices
683152 | 2021.12.13
There are cases where merging local memories with respect to depth might degrade
memory access efficiency. Before you decide whether to merge the local memories
with respect to depth, refer to the HLD report (<result>.prj/reports/
report.html) to ensure that they have produced the expected memory
configuration with the expected number of loads and stores instructions. In the
example below, the Intel HLS Compiler Pro Edition should not merge the accesses to
local memories a and b because the load and store instructions to each memory are
not mutually exclusive.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) {
int a[128] hls_merge("mem","depth");
int b[128] hls_merge("mem","depth");
int rdata;
a[waddr] = wdata;
b[waddr] = wdata;
rdata = a[raddr];
rdata += b[raddr];
return rdata;
}
In this case, the Intel HLS Compiler Pro Edition might double pump the memory
system to provide enough ports for all the accesses. Otherwise, the accesses must
share ports, which prevent stall-free accesses.
Figure 38. Local Memories for Component depth_manual with Non-Mutually Exclusive
Accesses
32b
St
a
Ld
256 words
St
Ld b
2x clk
2 Store Units 2 Load Units 1 M20k
All variables with the same <mem_name> label set in their hls_merge attributes are
merged.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
79
7. Memory Architecture Best Practices
683152 | 2021.12.13
short a[256];
short b[256];
short rdata = 0;
return rdata;
}
16b
St
b 256 words
Ld
In this case, the Intel HLS Compiler Pro Edition can coalesce the load and store
instructions to local memories a and b because their accesses are to the same
address, as shown below.
component short width_manual (int raddr, int waddr, short wdata) {
short rdata = 0;
return rdata;
}
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
80
7. Memory Architecture Best Practices
683152 | 2021.12.13
The (b0, b1, ... ,bn) arguments refer to the local memory address bit positions that the
Intel HLS Compiler Pro Edition should use for the bank-selection bits. Specifying the
hls_bankbits(b0, b1, ..., bn) attribute implies that the number of banks
equals 2number of bank bits.
Table 8. Example of Local Memory Addresses Showing Word and Bank Selection Bits
This table of local memory addresses shows an example of how a local memory might be addressed. The
memory attribute is set as hls_bankbits(3,4). The memory bank selection bits (bits 3, 4) in the table bits
are in bold text and the word selection bits (bits 0-2) are in italic text.
Restriction: Currently, the hls_bankbits(b0, b1, ..., bn) attribute supports only
consecutive bank bits.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
81
7. Memory Architecture Best Practices
683152 | 2021.12.13
(1)
#define DIM_SIZE 4
int a[DIM_SIZE][DIM_SIZE][DIM_SIZE];
// initialize array a…
int result = 0;
#pragma unroll
for (int dim1 = 0; dim1 < DIM_SIZE; dim1++)
#pragma unroll
for (int dim3 = 0; dim3 < DIM_SIZE; dim3++)
a[dim1][waddr&(DIM_SIZE-1)][dim3] = wdata;
#pragma unroll
for (int dim1 = 0; dim1 < DIM_SIZE; dim1++)
#pragma unroll
for (int dim3 = 0; dim3 < DIM_SIZE; dim3++)
result += a[dim1][raddr&(DIM_SIZE-1)][dim3];
return result;
}
As illustrated in the following figure, this code example generates multiple load and
store instructions, and therefore multiple load/store units (LSUs) in the hardware. If
the memory system is not split into multiple banks, there are fewer ports than
memory access instructions, leading to arbitrated accesses. This arbitration results in
a high loop initiation interval (II) value. Avoid arbitration whenever possible because it
increases the FPGA area utilization of your component and impairs the performance of
your component.
(1)
For this example, the initial component was generated with the hls_numbanks attribute set to
1 (hls_numbanks(1)) to prevent the compiler from automatically splitting the memory into
banks.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
82
7. Memory Architecture Best Practices
683152 | 2021.12.13
By default, the Intel HLS Compiler Pro Edition splits the memory into banks if it
determines that the split is beneficial to the performance of your component. The
compiler checks if any bits remain constant between accesses, and automatically
infers bank-selection bits.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
83
7. Memory Architecture Best Practices
683152 | 2021.12.13
int a[DIM_SIZE][DIM_SIZE][DIM_SIZE];
// initialize array a…
int result = 0;
#pragma unroll
for (int dim1 = 0; dim1 < DIM_SIZE; dim1++)
#pragma unroll
for (int dim3 = 0; dim3 < DIM_SIZE; dim3++)
a[dim1][waddr&(DIM_SIZE-1)][dim3] = wdata;
#pragma unroll
for (int dim1 = 0; dim1 < DIM_SIZE; dim1++)
#pragma unroll
for (int dim3 = 0; dim3 < DIM_SIZE; dim3++)
result += a[dim1][raddr&(DIM_SIZE-1)][dim3];
return result;
}
The following diagram shows that this example code creates a memory configuration
with four banks. Using bits 4 and 5 as bank selection bits ensures that each load/store
access is directed to its own memory bank.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
84
7. Memory Architecture Best Practices
683152 | 2021.12.13
In the Function Memory Viewer (inf the High-Level Design Reports), the Address bit
information shows the bank selection bits as b6 and b7, instead of b4 and b5:
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
85
7. Memory Architecture Best Practices
683152 | 2021.12.13
This difference occurs because the address bits reported in the Function Memory
Viewer are based on byte addresses and not element addresses. Because every
element in array a is four bytes in size, bits b4 and b5 in element address bits
correspond to bits b6 and b7 in byte addressing.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
86
683152 | 2021.12.13
Send Feedback
After you implement a system of tasks, you might need to balance the capacity of
your task functions to improve performance. For details, review the advice in
Balancing Capacity in a System of Tasks on page 88.
For example, in the following code sample, the first and second loops can be executing
different invocations of the component foo() if the invocations can be pipelined by
the Intel HLS Compiler Pro Edition:
component void foo() {
// first loop
for (int i = 0; i < n; i++) {
// Do something
}
// second loop
for (int i = 0; i < m; i++) {
// Do something else
}
}
However, the same invocation of the component foo() cannot execute the two loops
in parallel. System of tasks provides a way to achieve this by moving the loops into
asynchronous tasks. With the first loop in an asynchronous task, the second loop can
run concurrently with the first loop.
void first_loop() {
for (int i = 0; i < n; i++) {
// Do something
}
}
void second_loop() {
for (int i = 0; i < m; i++) {
// Do something else
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
8. System of Tasks Best Practices
683152 | 2021.12.13
}
}
To allow for calls from multiple places to a task, the Intel HLS Compiler Pro Edition
generates arbitration logic to the called task function. This arbitration logic can
increase the area utilization of the component. However, if the shared logic is large,
the trade-off can help you save FPGA resources. The savings can be especially noticed
when your component has a large compute block that is not always active.
If you do not use a system of tasks, function calls in your HLS component are in-lined
and optimized together with the calling code, which can be detrimental in some
situations. Use a system of tasks to prevent smaller blocks of your design from being
affected by the rest of the system.
The hierarchical design pattern implemented by using a system of tasks can give you
the following benefits:
• Modularity similar to what a hardware description language (HDL) might provide
• Unpipelineable or poorly pipelined loops can be isolated so that they do not affect
an entire loop nest.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
88
8. System of Tasks Best Practices
683152 | 2021.12.13
Typically, these performance issues are caused by a lack of capacity in the datapath of
the functions calling task function using the ihc::launch and ihc::collect calls.
You can improve system throughput in these cases by adding a buffer to the explicit
streams to account for the latency of the task functions.
Review the following tutorials to learn more about avoiding potential performance
issues in a component that uses a system of tasks:
• <quartus_installdir>/hls/examples/tutorials/system_of_tasks/
balancing_pipeline_latency
• <quartus_installdir>/hls/examples/tutorials/system_of_tasks/
balancing_loop_delay
• <quartus_installdir>/hls/examples/tutorials/system_of_tasks/
launch_and_collect_capacity
The Intel HLS Compiler Pro Edition emulator models the size of the buffer attached to
a stream. However, the emulator does not fully account for hardware latencies, and it
might exhibit different behavior between simulation and emulation in these cases.
In addition to the techniques outlined in the tutorials, follow the practices that follow
to try to maximize the data throughput of your design.
8.4.1. Enable the Intel HLS Compiler to Infer Data Path Buffer Capacity
Requirements
In many situations, the Intel HLS Compiler can add buffer capacity automatically to
the data path in a system of tasks design to achieve maximum throughput for your
design. Follow a few best practices to help the Intel HLS Compiler effectively add data
path buffer capacity to your design when needed.
As an example, consider the following design that runs two independent tasks. This
kind of structure can be generated by code like the following example:
component foo() {
// Parse/compute data for tasks
ihc::launch<task1>(data1);
ihc::launch<task2>(data2);
auto r1 = ihc::collect<task1>();
auto r2 = ihc::collect<task2>();
// Usage of r1, r2
}
The following diagram shows the state of the system of tasks at the start of the third
invocation of the component, and the location of data in the overall pipeline from
previous invocations.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
89
8. System of Tasks Best Practices
683152 | 2021.12.13
Figure 43. Data Flow of Multiple Component Invocations Through a System of Tasks
The circles represent pipelined stages of the component, while the number indicate the location of data from
different invocations of component foo. This digram shows three invocations of the component underway.
Entry 3
}
Task1 2
2 Underutilized
pipeline
Task2 stages
Exit 1
In this diagram, Entry represents the two independent launch calls, and the Exit
represents the two independent collect calls.
Entry provides work to both tasks only if both tasks can take in data (that is, both
task have available buffer capacity). Similarly, Exit consumes the results only when
both results are available.
If Task1 and Task2 have the same number of pipeline stages, then the data path
performs at full throughput. Some data path buffer capacity is needed in the caller
function to ensure that the caller can continue issuing launch calls while the
collect calls wait for the task functions to complete. The compiler adds this data
path buffer capacity automatically.
If the two tasks have different pipeline depths, then the design encounters a
bottleneck because the task with the smaller pipeline depth lacks the buffer capacity
to store finished results while waiting for the other task to finish. In this case, you can
add buffer capacity to either launch or the collect call of the task with the smaller
pipeline depth. For details about adding launch/collect buffer capacity, see Explicitly
Add Buffer Capacity to Your Design When Needed on page 91.
The Intel HLS Compiler tries to balance data path buffer capacity automatically, but it
can only add data path capacity automatically when your design follows certain
practices.
Use the following best practices to obtain the maximum throughput for your system of
tasks design:
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
90
8. System of Tasks Best Practices
683152 | 2021.12.13
Consider specifying the capacity parameter of the ihc::launch call if you see stall
patterns in your simulation waveforms that indicate an imbalance between the
following things:
• Any back-pressure introduced by the task function
• How often the caller launches the task function
Consider specifying the capacity parameter of the ihc::collect call if you see
stall patterns in your design waveforms that indicate a difference in the following
things:
• The cadence of data production in the task function
• The cadence reading that data by the caller function
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
91
8. System of Tasks Best Practices
683152 | 2021.12.13
• Do any tasks spend cycles stalled waiting for input data to reach them?
If yes, these tasks require launch capacity equal to the number of cycles it takes
input data to reach the task.
Adding launch capacity allows stalled tasks to buffer their start signals so that
they do not stall the other tasks that are scheduled to launch on the same cycle.
Because the consumer task depends on data from the producer task function, the
consumer task stalls until data from the producer task reaches it, so you should
add capacity to the ihc::launch call for the consumer task.
• Do any tasks finish their first execution before the slowest task in the
design (that is, the task that produces its initial return signal last)
finishes its first execution?
If yes, the tasks that finish before the slowest task in the design finishes require
additional collect capacity.
Add capacity equal to the number of cycles between when the task produces its
first return signal and when the slowest task in the design produces its first
return signal.
When you add collect capacity, tasks can buffer their return signals. Buffering
the return signal consumes it from the task, which allows the task to produce
more return signals without stalling.
Typically, tasks that communicate only through a return value do not require buffer
capacity. The top-level component handles the synchronization of the communication
of task functions, as the following figure shows:
Component Function
component foo
{
ihc : : launch
Task A
ihc : : collect
ihc : : launch
Task B
ihc : : collect
}
When task functions communicate via streams, you might need to add capacity when
there is a chance that a task might be launched before its inputs are ready. In the
following diagram, you might need to add launch capacity to the consumer task:
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
92
8. System of Tasks Best Practices
683152 | 2021.12.13
Task Function
ihc : : launch
stream
ihc : : collect
Task B
ihc : : collect Task Function
}
consumer
{
internal.read()
}
For an example of adding buffer capacity to a design, refer the following tutorial:
<quartus_installdir>/hls/examples/tutorials/system_of_tasks/
launch_and_collect_capacity
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
93
683152 | 2021.12.13
Send Feedback
After you optimize the algorithm bottlenecks of your design, you can fine-tune some
datatypes in your component by using arbitrary precision datatypes to shrink data
widths, which reduces FPGA area utilization. The Intel HLS Compiler Pro Edition
provides debug functionality so that you can easily detect overflows in arbitrary
precision datatypes.
The Intel HLS Compiler Pro Edition comes with a number of tutorials that illustrate
important Intel HLS Compiler concepts and demonstrate good coding practices.
Review the following tutorials to learn about datatype best practices that might apply
to your design:
Tutorial Description
You can find these tutorials in the following location on your Intel Quartus Prime system:
<quartus_installdir>/hls/examples/tutorials
best_practices/ac_datatypes Demonstrates the effect of using ac_int datatype instead of int datatype.
ac_datatypes/ Demonstrates the use of the ac_fixed constructor where you can get a better
ac_fixed_constructor QoR by using minor variations in coding style.
best_practices/ Demonstrates the effect of using single precision literals and functions instead of
single_vs_double_precision_ma double precision literals and functions.
th
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
9. Datatype Best Practices
683152 | 2021.12.13
Tutorial Description
hls_float/1_reduced_double Demonstrates how your applications can benefit from changing the underlying
type from double to hls_float<11,44> (reduced double).
hls_float/3_conversions Demonstrates when conversions appear in designs that use the hls_float data
type and how to take advantage of different conversion modes to generate
compile-type constants using hls_float types.
Using this option helps you avoid inadvertently having conversions between double-
precision and single-precision values when double-precision variables are not needed.
In FPGAs, using double-precision variables can negatively affect the data transfer rate,
the latency, and resource utilization of your component.
Additionally, constants are treated as signed int or signed double. If you want
efficient operations with narrower constants, cast constants to other, narrower data
types like ac_int<> or float.
If you use the Algorithmic C (AC) arbitrary precision datatypes, pay attention to the
type propagation rules.
9.2. Avoid Negative Bit Shifts When Using the ac_int Datatype
The ac_int datatype differs from other languages, including C and Verilog, in bit
shifting. By default, if the shift amount is of a signed datatype ac_int allows negative
shifts.
In hardware, this negative shift results in the implementation of both a left shifter and
a right shifter. The following code example shows a shift amount that is a signed
datatype.
int14 shift_left(int14 a, int14 b) {
return (a << b);
}
If you know that the shift is always in one direction, to implement an efficient shift
operator, declare the shift amount as an unsigned datatype as follows:
int14 efficient_left_only_shift(int14 a, uint14 b) {
return (a << b);
}
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
95
683152 | 2021.12.13
Send Feedback
Components kept in directories that result in a long path name might not compile
properly or fail in simulation. Check your compilation log or simulation log
(debug.log) to determine if the path length is a cause of the failures. Errors that
indicate that a file could not be found can indicate that your paths might be too long.
The Intel HLS Compiler uses the component name in many of the directories that it
creates. Long component names can introduce long path issues even if your
component is in a relatively shallow location in your directory structure.
Use an epsilon when comparing floating point value results in the testbench. Floating
points results from the RTL hardware are different from the x86 emulation flow.
The #pragma ivdep compiler pragma can cause functional incorrectness in your
component if your component has a memory dependency that you attempted to
ignore with the pragma. You can try to use the safelen modifier to control how many
memory accesses that you can permit before a memory dependency occurs.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
10. Advanced Troubleshooting
683152 | 2021.12.13
Many coding practices can result in behavior that is undefined by the C++
specification. Sometimes this undefined behavior works one way in emulation and a
different way in simulation.
A common example of this situation occurs when your design reads from uninitialized
variables, especially uninitialized struct variables.
Check your code for uninitialized values with the -Wuninitialized compiler flag, or
debug your emulation testbench with the valgrind debugging tool. The -
Wuninitialized compiler flag does not show uninitialized struct variables.
You can also check for misbehaving variables by using one or more stream interfaces
as debug streams. You can add one or more ihc::stream_out interfaces to your
component to have the component write out its internal state variables as it executes.
By comparing the output of the emulation flow and the simulation flow, you can see
where the RTL behavior diverges from the emulator behavior.
If you have a non-blocking stream access (for example, tryRead()) from a stream
with a FIFO (that is, the ihc::depth<> template parameter), then the first few
iterations of tryRead() might return false in simulation, but return true in
emulation.
In this case, invoke your component a few extra times from the testbench to
guarantee that it consumes all data in the stream. These extra invocations should not
cause functional problems because tryRead() returns false.
The information in this section describes some common sources of stallable arbitration
nodes or excess RAM utilization.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
97
10. Advanced Troubleshooting
683152 | 2021.12.13
By default, the Intel HLS Compiler Pro Edition tries to optimize your component for the
best throughput by trying to maximize the maximum operating frequency (fMAX).
A way to reduce area consumption is to relax the fMAX requirements by setting a target
fMAX value with the --clock i++ command option or the
hls_scheduler_target_fmax_mhz component attribute. The HLS compiler can
often achieve a higher fMAX than you specify, so when you set a target fMAX to a lower
value than you need, your design might still achieve an acceptable fMAX value, and a
design that consumes less area.
To learn more about the behavior of fMAX target value control see the following
tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/
set_component_target_fmax
If you specify a target fMAX , the compiler might conservatively increase II in order to
achieve your target fMAX .
If you specify a target fMAX and require II=1, you should use #pragma ii 1 on your
loops that require II=1. For more details, refer to Balancing Target fMAX and Target II
on page 70.
See Memory Architecture Best Practices on page 72 for details about how to configure
efficient memory systems.
In some cases, if you try to access different arrays of struct variables with a
conditional operator, the Intel HLS Compiler Pro Edition merges the arrays into the
same RAM block. You might see stallable arbitration in the Function Memory Viewer
because there are not enough Load/Store site on the memory system.
For example, the following code examples show an array of struct variables, a
conditional operator that results in stallable arbitration, and a workaround that avoids
stallable arbitration.
struct MyStruct {
float a;
float b;
}
MyStruct array1[64];
MyStruct array2[64];
The following conditional operator that uses these arrays of struct variables causes
stallable arbitration:
MyStruct value = (shouldChooseArray1) ? array1[idx] : array2[idx];
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
98
10. Advanced Troubleshooting
683152 | 2021.12.13
You can avoid the stallable arbitration that the conditional operator causes here by
removing the operator and using an explicit if statement instead.
MyStruct value;
if (shouldChooseArray1)
{
value = array1[idx];
} else
{
value = array2[idx];
}
Cluster Logic
Your design might consume more RAM blocks than you expect, especially if you store
many array variables in large registers.
The Area Analysis of System report in the high-level design report (report.html)
can help find this issue.
The three matrices are stored intentionally in RAM blocks, but the RAM blocks for the
matrices account for less than half of the RAM blocks consumed by the component.
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
99
10. Advanced Troubleshooting
683152 | 2021.12.13
If you look further down the report, you might see that many RAM blocks are
consumed by Cluster logic or State variable. You might also see that some of your
array values that you intended to be stored in registers were instead stored in large
numbers of RAM blocks.
Notice the number of RAM blocks that are consumed by Cluster Logic and State.
In some cases, you can reduce this RAM block usage by with the following techniques:
• Pipeline loops instead of unrolling them.
• Storing local variables in local RAM blocks (hls_memory memory attribute)
instead of large registers (hls_register memory attribute).
If your component contains a system of tasks, you might need to add launch/
collect capacity.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
100
683152 | 2021.12.13
Send Feedback
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
683152 | 2021.12.13
Send Feedback
2021.12.13 21.4 • Added references to the full-design tutorial and HLS Walkthrough
video series to Best Practices for Coding and Compiling Your
Component on page 5.
• Fixed typo in #unique_31/unique_31_Connect_42_fig_hjg_z2z_5mb on
page 17.
2021.06.21 21.2 • Revised references to Avalon Interfaces to align with new Avalon
Interconnect terminology. Avalon master interfaces are now Avalon
host interfaces, and Avalon slave interfaces are now Avalon agent
interfaces.
2021.03.29 21.1 • Added Deciding When To Specify The capacity Parameter on page 91 to
Explicitly Add Buffer Capacity to Your Design When Needed on page 91.
• Replaced references to Graph Viewer with references to System
Viewer to reflect the new name of the viewer.
2020.12.14 20.4 • Added Place if-Statements in the Lowest Possible Scope in a Loop Nest
on page 67.
• Revised Clustering the Datapath on page 18 to improve the description
of stall-free clusters and their exit FIFOs.
• Added Balancing Target fMAX and Target II on page 70. This content
was available previously in the “Loop Analysis Report” section of the
Intel HLS Compiler User Guide.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
B. Document Revision History for Intel HLS Compiler Pro Edition Best Practices Guide
683152 | 2021.12.13
2020.04.13 20.1 • Added new tutorials to Loop Best Practices on page 56.
• Added new tutorials to Interface Best Practices on page 39.
• Updated Executing Multiple Loops in Parallel on page 87 to reflect new
syntax of ihc::launch and ihc::collect functions.
2019.12.16 19.4 • Removed information about Intel HLS Compiler Standard Edition.
For best practices information for the Intel HLS Compiler Standard
Edition, see Intel HLS Compiler Standard Edition Best Practices Guide.
• Added information to Example: Specifying Bank-Selection Bits for Local
Memory Addresses on page 81 to explain the difference between the
element-address bank-selection bits selected with the hls_bankbits
attribute and the byte- address bank-selection bits reported in the
Function Memory Viewer in the High-Level Design Reports.
• References to the Component Viewer have been replaced with
references to the Function View of the Graph Viewer.
• Reference to the Component Memory Viewer have been replaced with
references to the Function Memory Viewer.
Document Revision History for Intel HLS Compiler Best Practices Guide
Previous versions of the Intel HLS Compiler Best Practices Guide contained information
for both Intel HLS Compiler Standard Edition and Intel HLS Compiler Pro Edition.
Document Version Intel Quartus Changes
Prime Version
2019.09.30 19.3 • Added Control LSUs For Your Variable-Latency MM Host Interfaces on
page 53.
•
Updated Memory Architecture Best Practices on page 72 to list
updated and improved tutorials and new memory attributes.
• Split memory architecture examples for overriding coalesced memory
architectures and overriding banked memory architectures into the
following sections:
—
Example: Overriding a Coalesced Memory Architecture on
page 73
—
Example: Overriding a Coalesced Memory Architecture
—
Example: Specifying Bank-Selection Bits for Local Memory
Addresses on page 81
—
Example: Specifying Bank-Selection Bits for Local Memory
Addresses
2019.04.01 19.1 •
Added new chapter to cover best practices when using HLS
tasks in System of Tasks Best Practices on page 87.
• Moved some content from Loop Best Practices on page 56 into a new
section called Reuse Hardware By Calling It In a Loop on page 57.
•
Revised Component Uses More FPGA Resource Than Expected
on page 98 to include information about the
hls_scheduler_target_fmax_mhz component attribute.
continued...
Send Feedback Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
103
B. Document Revision History for Intel HLS Compiler Pro Edition Best Practices Guide
683152 | 2021.12.13
2018.12.24 18.1 • Updated to Loop Best Practices on page 56 to include information about
function inlining in components and using loops to minimize the
resulting hardware duplication.
2018.09.24 18.1 •
The Intel HLS Compiler has a new front end. For a summary of
the changes introduced by this new front end, see Improved Intel HLS
Compiler Front End in the Intel High Level Synthesis Compiler Version
18.1 Release Notes.
•
The --promote-integers flag and the best_practices/
integer_promotion tutorial are no longer supported in Pro Edition
because integer promotion is now done by default. References to these
items were adjusted to indicate that they apply to Standard Edition
only in the following topics:
— Component Fails Only In Simulation on page 96
— Datatype Best Practices on page 94
2018.07.02 18.0 • Added a new chapter, Advanced Troubleshooting on page 96, to help
you troubleshoot when your component behaves differently in
simulation and emulation, and when your component has unexpectedly
poor performance, resource utilization, or both.
2018.05.07 18.0 • Starting with Intel Quartus Prime Version 18.0, the features and
devices supported by the Intel HLS Compiler depend on what edition of
Intel Quartus Prime you have. Intel HLS Compiler publications now use
icons to indicate content and features that apply only to a specific
edition as follows:
2017.12.22 17.1.1 • Added Choose the Right Interface for Your Component on page 40
section to show how changing your component interface affects your
component QoR even when the algorithm stays the same.
• Added interface overview tutorial to the list of tutorials in Interface Best
Practices on page 39.
Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide Send Feedback
104