A Driving Assistance System With Hardware Acceleration: Gongpei Cui
A Driving Assistance System With Hardware Acceleration: Gongpei Cui
A Driving Assistance System With Hardware Acceleration: Gongpei Cui
Hardware Acceleration
Master of Science Thesis in Computer Systems and Networks
GONGPEI CUI
Chalmers University of Technology
University of Gothenburg
Department of Computer Science and Engineering
Gothenburg, Sweden, January 2015
The Author grants to Chalmers University of Technology and University of Gothenburg
the non-exclusive right to publish the Work electronically and in a non-commercial pur-
pose make it accessible on the Internet. The Author warrants that he/she is the author
to the Work, and warrants that the Work does not contain text, pictures or other mate-
rial that violates copyright law.
The Author shall, when transferring the rights of the Work to a third party (for example
a publisher or a company), acknowledge the third party about this agreement. If the
Author has signed a copyright agreement with a third party regarding the Work, the
Author warrants hereby that he/she has obtained any necessary permission from this
third party to let Chalmers University of Technology and University of Gothenburg store
the Work electronically and make it accessible on the Internet.
Gongpei Cui
Nowadays, active safety has become a hot research topic in vehicle industry. Active
safety systems play an increasingly important role in warning drivers about and avoiding
a collision or mitigating the consequences of the accident. The increased computational
complexity requirement imposes a great challenge for the development of advanced active
safety applications using the traditional Electronic Control Units (ECUs). One way to
tackle this challenge is to use hardware offloading, which has the capability of exploiting
massive parallelism and accelerating such applications. A hardware accelerator combined
with software running on a general purpose processor can compose a hardware/software
hybrid system.
Model Based Development (MBD) is a common development scheme that reduces
development time and time-to-market. In this project, we evaluate different MBD work-
flows for the hardware/software co-design and propose a general workflow for MAT-
LAB/Simulink models. We investigate key techniques for hybrid system design and
identify three factors to assist hardware/software partitioning. Moreover, several essen-
tial techniques for hardware logic implementation, such as pipelining, loop unrolling, and
stream transmission, are analyzed based on system throughput and hardware resources.
This project describes the workflow for hardware/software co-design based on MBD
and finds methods to improve the system throughput combining hardware accelerators
and software. Using the proposed profiling methods and partitioning roles, a matrix
multiplication function is selected to be implemented by a hardware accelerator. Having
optimized the hardware implementation scheme of the accelerator, a 5.4x speedup is
achieved on a Zynq evaluation board.
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my examiner and su-
pervisor Prof. Yiannis Sourdis and Dr. Vassilis Papaefstathiou. Their guidance helped
me a great deal throughout the project process and the writing of this thesis. Besides,
I warmly thank Dr. Anders Svensson at Volvo ATR for his kind support, guidance and
advice during this project. Fruitful discussions with him helped me to find better and
better solutions and his great enthusiasm to research was my source of inspiration. A
special acknowledgement is given to Henrik Lönn who offered me opportunities at Volvo
and gave me the feeling of home warmth. Finally, I am particularly grateful for the
assistance given by Serkan Karakis, Hans Blom, Hamid Yhr, Mattias Wallander, Håkan
Berglund, Daniel Karlsson and all the colleagues at Volvo GTT. I own my deepest grat-
itude to my parents, who support me to study abroad and live in China alone. Last but
not the least, my sincere thanks goes to my wife, Jingya. Thank you for your positive
mind, prays and all your efforts. I wish you all the best.
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Hardware/Software Co-design . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Model Based Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
i
CONTENTS
5 Evaluation 39
5.1 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.2 Hardware Utilization Analysis . . . . . . . . . . . . . . . . . . . . . 43
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusion 46
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Discussion and Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Future Works and Directions . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 52
A Generated C Code 53
ii
1
Introduction
1.1 Introduction
Over the past few decades, vehicles, such as cars, buses and trucks, have dramatically
improved people’s lives. Now, vehicles have become an essential part of everyday life.
They have made it easier and faster for users to get from place to place. They bring
convenience to people but at the same time can endanger people’s lives. Surveys show
that about 50000 Europeans were killed in car accidents in 2001 [1]. Nowadays, customers
select vehicles based on two major factors: (a) safety and (b) fuel consumption; they
reference (a) and (b) with 54% and 53% respectively.
The automotive industry is continuously working to improve vehicle safety. The
safety systems developed in vehicles can be divided into two categories: passive safety
systems and active safety systems. Passive safety is a technology to decrease the damage
to the driver and passengers in an accident. For example, seat-belts can hold passengers
in place so that they will not be thrown forward or ejected from the car; airbags can
provide a cushion to protect the drive and passengers during a crash. These passive
safety systems have saved thousands of lives and are milestones in the automotive in-
dustry. Active safety refers to systems that help keep a car under control and use an
understanding of the state of the vehicle to predict and avoid accidents. For example,
anti-lock brakes can prevent the wheels from locking up when the driver brakes, enabling
the driver to steer while braking. Advanced Driving Assistance System (ADAS) can alert
the driver to potential problems, or to avoid collisions by implementing safeguards and
taking over the control of the vehicle.
In current vehicles, increasingly more mechanical components are replaced by Engine
Control Units (ECUs), sensors, and actuators. More sophisticated active safety systems,
such as Anti-lock Brake Systems (ABS) and Electronic Stability Control (ESC), are
deployed in the vehicles. According to the eSafety effects database, ABS has prevented
10-33% of single-vehicle accidents and ESC has reduced all kinds of crashes by 19.3% [2].
1
CHAPTER 1. INTRODUCTION
It is expected that active safety systems will play an increasing role in collision avoidance
and mitigation in the future. However, as more sophisticated systems are deployed, the
deployment of active safety systems becomes more challenging.
As mentioned before, ADAS is an example of active safety applications. Figure 1.1
shows three typical applications of ADAS. Figure 1.1(a) presents an example of predictive
pedestrian protection. Based on video streams from cameras installed on vehicles, a
pedestrian detection algorithm marks out the location of pedestrians and their distance
from the vehicle. If the distance is shorter than a threshold, then ADAS stops the vehicle
automatically. Figure 1.1(b) depicts a diagram of lane assist systems. By analyzing
the lane marks, algorithms of ADAS detect whether vehicles are crossing lane marks in
reasonable situations; this function detects whether the driver is concentrated on driving.
Figure 1.1(c) shows an example of radar system which measures the distance between
its front vehicle and itself. When the distance becomes shorter than a safe distance
threshold, ADAS triggers an emergency braking system to stop the vehicle. Besides the
above three applications, more new applications of ADAS have been developed recently,
such as surrounding view, automatic parking system, etc.
(a) Predictive Pedestrian (b) Lane Assist Systems (c) Emergency Braking Sys-
Protection tem
Figure 1.1: Applications of Advanced Driving Assistance System (a), (b) and (c) [3]
Data fusion is an essential module for ADAS. If each application supported by ADAS
requires its private sensor(s), then adding new applications will require more sensors,
which is inefficient. Figure 1.2 illustrates an example of a vehicle with several ADAS
applications, where multiple sensors are needed for supporting different services. There
may be significant overlapping between different sensors. Moreover, different sensors
may have different observation capabilities and various detection properties. The goal
of a data fusion application is to reduce duplicated sensors and organize different sen-
sors so as to get higher observation accuracy and efficiency. This thesis focuses on the
implementation problems of the data fusion application.
2
CHAPTER 1. INTRODUCTION
3
CHAPTER 1. INTRODUCTION
have their own advantages and disadvantages. For instance, some algorithms are more
suitable for pure software environments while other algorithms are more efficient when
executed by specialized hardware logic. An improper implementation of an algorithm
can result in low throughput and unnecessarily high resource utilization. The communi-
cation between processors and hardware accelerators is also a key factor that limits the
performance of hybrid systems. The total execution time of a hybrid system consists
of two parts: the overhead caused by the communication between the processor and
the accelerator, and hardware execution latency. Therefore, in order to reduce the total
execution time, both the communication scheme and the hardware implementation need
to be optimized. Note that for hardware implementations, a parallel design can achieve
4
CHAPTER 1. INTRODUCTION
high system throughput, while a serial structure might use resources more efficiently.
For developers it is time-consuming to find the proper tradeoff between the performance
and the resource consumption for hardware design.
Round-Trip Engineering (RTE) is a traditional development flow for the design of
hybrid systems [5]. As shown in Figure 1.5, three development roles are involved in
the cycles. First, system designers define system models in specialized tools, such as
MATLAB/Simulink and LabVIEW. Then, software and hardware designers implement
their components in other design tools, for example, Eclipse SDK, Quartus, and Vivado.
The separate software and hardware designs are then combined to form hybrid systems.
Finally, the HW/SW co-design is verified based on the results of the models. In conclu-
sion, RTE takes a significant amount of time and is not flexible if any change occurs in
any of the development phases.
Model Based Development (MBD) is a method where models are used in all phases
of development. Model designers define systems in modeling tools based on requirement
analyses. Based on the models, prototyping, visualization and testing are conducted. In
the end, executable code is generated from these models automatically. The advantage
of MBD is that the whole development cycle is centrally modeled so that it reduces
the development time. Ideally, only one development role, model designer, is necessary
and all the other tasks are done by modeling tools. There are several options of the
modeling tools and code generators and this thesis tries to answer the following two
questions: (a) Which workflow works best for general models? (b) Does the MBD fulfill
the expectations for automatic flows?
5
CHAPTER 1. INTRODUCTION
The first objective is to find a suitable workflow for MBD. A good workflow should
shorten the develop cycle as much as possible. It should also support the most common
source models including different language syntax and function blocks. The work of the
project starts from the second phase of MBD; thus, the inputs of the project are MAT-
LAB code and Simulink models. We use two MathWorks’ tools, namely the Embedded
Coder and the HDL Coder, as candidates for generating software and hardware code
respectively. Xilinx’s High-Level Synthesis is another alternative, which can generate
hardware code based on C code. Therefore, the first goal is to evaluate different tools
and suggest the best workflow for hardware/software co-design of the given application.
The second objective is to investigate methods for efficient hardware/software co-
design, by which the hybrid systems can achieve higher performance with low cost.
Specifically, high performance translates to high processing throughput, so the design
will meet the timing constraints. The cost represents the silicon area on the FPGA
and the utilization of processors. In order to achieve this goal, three steps are taken,
profiling, partitioning, and optimization of hardware design. Criteria for partitioning will
be summarized to assist developers to select the blocks that should be implemented in
hardware. During the hardware code generation, different optimization settings, which
will affect the performance of the generated hardware blocks, should also be considered.
6
2
Background
This chapter provides the background for this thesis and discusses related work. First,
we present the background for ADAS applications and describe the basic processes and
algorithms. Next, we introduce the concept of Hardware/Software co-design and MBD.
The state-of-the-art of both areas are introduced. Finally, we present the evaluation
platform used in this work.
2.1 Application
The main goal of this project is to implement a data fusion application on SoC using
MBD. Although the application model has been designed in other previous works, it is
still necessary to understand the basic processes so that we can explain the results of
application profiling and the complexity of the application.
Fusion is widely used in nature, where human beings combine the senses of vision,
touch and auditory to help themselves to recognize the world [7]. With the development
of sensor technology, signal processing and advanced hardware, data fusion is becoming
increasingly popular in industry [8, 9]. In generally, multisensor data fusion is to associate
information from different sensors and provides more accurate observation of objects than
what a single sensor does. Consequently, the better representation is provided, the more
precise decisions are made [10, 11].
Specifically, in vehicle industry, sensor data fusion (sDF) provides a way to use dif-
ferent sensors in active safety applications of ADAS as shown in Figure 2.1. Various
sensors are used because different sensors have distinct properties, for instance, radars
are sensitive to longitudinal distance but they have lower accuracy in lateral measure-
ments [12]. Information from cameras is useful to detect the direction of objects, while it
is difficult to measure how far the objects are. In addition, the rate of traffic accidents is
17 times higher in bad weather conditions than in good conditions. Moreover, laser and
supersonic-wave radar are not robust in foggy weather compared to mm-W radar [13].
7
CHAPTER 2. BACKGROUND
However, the combined information from mm-W radars and cameras can improve the
object recognition rate by 84% as compared to human judgments.
Sandblom and Sorstedt present the general process of sDF [12], which can be sum-
marized in Figure 2.2. Observational data from different sensors is represented in various
models, such as the constant acceleration (CA) model and the coordinate turn constant
acceleration (CT) model [14]. The first step is to parameterize them into a unique model
and then alignment follows to shift time delay from different sensors. The main part
of the procedure is state fusion that consists of four steps. First, data from different
sensors is associated by using Global Nearest Neighbor method, where data represent-
ing the same objects is clustered [15]. Second, based on previous results a local trace
management algorithm is executed, e.g. Multiple Hypothesis Tracking [16]. The next
two steps are priority selection and state updating. Finally, according to the updated
state vector the objects’ movement is estimated and the final results are delivered to the
ADAS applications.
8
CHAPTER 2. BACKGROUND
9
CHAPTER 2. BACKGROUND
10
CHAPTER 2. BACKGROUND
ecution on NI’s cluster servers yields very fast results. Other examples include OpenCL:
a programming language supporting parallelism models for CPUs, GPUs, FPGAs and
so on. OpenCL compilers can use C programs to generate RTL code for FPGAs [32].
Recently, OpenCL has been integrated into Altera’s SDK, by which embedded CPUs
can communicate with accelerators implemented in FPGAs [33].
11
CHAPTER 2. BACKGROUND
12
CHAPTER 2. BACKGROUND
13
CHAPTER 2. BACKGROUND
2.5 Conclusion
This chapter introduced the background for this thesis and the main concept of multi-
sensor data fusion. We discussed the two main areas of the thesis: (i) hardware/software
co-design and (ii) MBD. We presented the motivation, gave definitions, and discussed
the key issues and the most common solutions. In the end, we presented details about
the evaluation platform and the FPGA development board.
Although many MBD methods for hardware/software co-design exist, in this thesis
we use predesigned models in MATLAB/Simulink and focus on the associated workflows.
For the hardware/software co-design we focus and optimize the system throughput,
hardware architecture and interconnection schemes. Other remaining key factors such
as design space exploration and energy consumption will be studied in the future.
14
3
Methodologies for MBD
Nowadays, MATLAB/Simulink is one of the most popular modeling tools, which has
been widely used in industry. In this chapter, we present the findings of our study con-
sidering different workflows. In Section 3.1, we analyze a workflow for hardware/software
co-design based on Simulink models. In Section 3.2, based on MATLAB models, a simi-
lar workflow is described. Moreover, a different workflow for hardware code generation,
namely High-Level Synthesis (HLS) is presented. The advantages and limitations of
different workflows are also discussed in this chapter.
15
CHAPTER 3. METHODOLOGIES FOR MBD
HDL code can be generated by using ’HDL Workflow Advisor’ of HDL Coder. First,
the block which will be implemented by hardware logic is selected from the original
Simulink model. This block is used to start the HDL Workflow Advisor. Then, in the
advisor, an IP Core for the selected block is generated. In the next step, the generated IP
Core together with a processor are inserted into an FPGA project. At the same time, the
connection between the IP Core and the processor is established automatically. Based on
the FPGA project, the advisor generates a bit file, which can be used for programming
a target board. Finally, a new Simulink model is produced by replacing the selected
block with an interface between the Simulink model and the generated IP Core. This
new model will be used to synthesize C code in the second part. Figure 3.3 indicates the
overview of the advisor.
The second part of the Simulink workflow is to synthesize C code by using Embed-
ded Coder. In the first step, the interface between Simulink and a target processor is
configured. The second step is to build the Simulink model and generate the C code.
Then, the C code can be downloaded to the target processor. Finally, via Simulink, the
C code can be run on the processor. We can control input valuables easily in Simulink
and check simulation result in Simulink. This step is called a System-In-the-Loop (SIL)
co-simulation. By this feature, developers can verify the generated RTL and C code to
confirm that the behavior of the hybrid system matches the behavior of the model.
3.1.1 Discussion
Advantages:
• Convenient HDL Workflow Advisor: The advisor covers the whole process of RTL
16
CHAPTER 3. METHODOLOGIES FOR MBD
code generation. It produces IP Core, FPGA project, and replaces selected block
in the original Simulink model with some specific interface. Following the advisor,
the users can performs all the essential steps for hardware/software co-design.
• Broad blocks support: HDL Coder and Embedded Coder have sufficient support
for Simulink blocks. Different from the MATLAB workflow, almost all the Simulink
pre-defined blocks can be synthesized to RTL and C code.
Limitations:
• Single top-level block: The advisor can only support one block for hardware im-
plementation. We tried to execute the advisor in second round, where we select
the second block to be implemented in hardware, but the advisor cannot work
correctly in second round. Since it is very common that several hardware blocks
(hardware accelerators) exist in one design, thus, users cannot use the Simulink
workflow to complete the whole hybrid system design.
17
CHAPTER 3. METHODOLOGIES FOR MBD
Conclusion:
In summary, Embedded Coder and HDL Coder are good code generators for Simulink
models. The generated code can be easily verified. However, if a system contains more
than one acceleration blocks then the developers need to integrate the hardware/software
co-design manually.
3.2.1 Discussion
Advantages:
• Easy mapping between the m-script and the C code: The way of C programming
is very similar to writing a m-script and therefore it is very easy to find one to
18
CHAPTER 3. METHODOLOGIES FOR MBD
one mapping between them. This is a good feature that helps developers to better
understand the generated code.
• Good support from m-script to C code: Embedded Coder has a good support for
MATLAB functions. Almost all functions included in m-script can be synthesized
to C code.
Limitations:
19
CHAPTER 3. METHODOLOGIES FOR MBD
• Limited supported functions: Only a small set of the pre-defined functions is sup-
ported by HDL Coder [38] and some language syntax is not supported by Embed-
ded Coder [39]. For example, in our real model, data with structural format is
used as input and output parameters, but structural format is not allowed by HDL
Coder. Therefore, we suggest that designers should check which MATLAB lan-
guage features, functions, classes, and System objects are supported by Embedded
Coder and HDL Coder before the model design [38, 39].
• Hard to find reference designs: Most of the documents are only available to users
with product licenses. For some features, even with a license, it is still difficult to
find their corresponding reference designs. For instance, HDL Coder can support
the following three communication schemes: the basic non-blocking AXI interface,
blocking AXI interface, and AXI streaming interface, but we can only find the
reference design using the basic setting.
Since the MATLAB workflow does not include any interactive design, which estab-
lishes the communication between modeling tools and boards, it is not possible to run a
co-simulation like what we can do in the Simulink workflow.
Figure 3.6 presents one way we proposed to verify the generated code. In this case,
’Function1’ is the initial top-level function.A ’main function’ is created including ’Func-
tion1’, input data, and output data. The input and output data are pre-generated from
MATLAB and the output data will be used as a reference for verification. The input
data is fed to ’Function1’ and the result of ’Function1’ is compared with the output data.
Both the target function (’Function1’) and the verification data are synthesized when
’MainFunction’ is sent to the code generators. By doing so, we can verify the generated
code automatically.
Conclusion: For MATLAB models, Embedded Coder and HDL Coder can generate
C functions and RTL Code. The C code and IP Cores are modules useful for the further
hybrid system development. However, developers need to do the further integration work
manually.
3.2.2 C to HDL
Due to the limitations of the MATLAB workflow discussed in Section 3.2, the HDL
Coder cannot satisfy our requirements. Another potential solution is to use High-Level
Synthesis (HLS). HLS is a tool designed by Xilinx and it can synthesize RTL code based
on C code. The C code can be generated by Embedded Coder mentioned in the beginning
20
CHAPTER 3. METHODOLOGIES FOR MBD
The way of using HLS is very similar to that of using HDL Coder. A top-level function
and a testbench are given to HLS, and then it builds C code, synthesizes RTL code, and
exports IP cores. Similar to HDL Coder, HLS has some interface and optimization
settings, which can be configured by TCL scripts.
Advantages:
• Good support for SoC chips: HLS is part of the Xilinx design flow, so it has
a good support for hardware logic implementation, namely it offers very flexible
optimization options and interfaces. The Xilinx website has a comprehensive set
of tutorials and documents describing how to use these features. In Chapter 4
we will show how to use three types of implementation settings to improve the
performance of the RTL Code.
Limitations:
21
CHAPTER 3. METHODOLOGIES FOR MBD
22
CHAPTER 3. METHODOLOGIES FOR MBD
23
CHAPTER 3. METHODOLOGIES FOR MBD
24
4
Application Analysis and
Implementation
In this chapter, the methodologies of system optimization will be introduced. First, the
profiling methods are introduced in Section 4.1, where three different profiling methods
are compared. Next, according to the results obtained from the profiling methods, we
propose three criteria for selecting the function(s) to be implemented in PL. Finally,
the optimization of the hardware implementation is discussed in Section 4.3, where we
evaluate loop pipelining, loop unrolling, and stream based interface schemes.
4.1 Profiling
Profiling is an important step that needs to be done before applications are separated
into hardware and software parts. The performance in terms of execution time, memory
usage and function call graphs are usually analyzed during the profiling process. In this
project, the system throughput is our main criterion. In the following parts, we discuss
and compare three different profiling methods, i.e., target profiling, host profiling, and
MATLAB profiling, for analyzing the execution time of applications.
25
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
26
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
are shown in the graph: the name of functions, percentage of time taken to run the
functions (including sub-functions), percentage of time that self-code takes (excluding
sub-functions), and number of calls. From the graphs, it is very convenient to find the
relationships between different function blocks and the proportion of the execution time
for each function. Moreover, from the color of the diagram, developers can realize which
parts require more computational resources.
27
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
steps for host profiling are summarized in Appendix B. In general, host profiling uses
the same profiling feature as target profiling and produces similar outputs, namely one
text file and one calling graph. The comparison results between host and target profiling
will be discussed in Section 4.1.5
28
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
and Figure 4.6, where the execution time and memory usage per line are illustrated
respectively.
4.1.5 Comparison
In this section, we compare the three profiling methods. We assume that the results from
target profiling will be the final performance on the real system. Then, by comparing
with the results of target profiling, we can verify the accuracy of the results provided by
host profiling and MATLAB profiling.
Table 4.1 shows the profiling results of target, MATLAB and host profiling. Con-
sidering the test case designed in Section 4.1.1, the rank of each function is listed for
each considered profiling method. The left table gives the result obtained from target
profiling. We see that ’f6’, ’f5’ and ’f3’ are the most heavy functions, which take more
than 70% of execution time in total, while ’f2’, ’f4’ and ’f1’ are three light weight func-
tions. The table in the middle shows the result from MATLAB profiling. It can be seen
that the ranks of functions are almost the same as the ones shown for target profiling.
29
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
However, we observe that ’f5’ and ’f3’ take much smaller proportion of execution time
than that of target profiling. Note that both ’f5’ and ’f3’ are matrix multiplication. So
one possible explanation is that MATLAB has better optimized algorithms for matrix
computations. Finally, the right table presents the results obtained from host profiling.
The execution time for all functions except ’f6’ are similar to the results from target
profiling. However, the most heavy time consuming function ’f6’ (a matrix division)
takes almost zero execution time in the host profiling. The reason for this result is not
clear. Note that the total execution proportion is not 100%. This might imply that some
operations are done in operating system level and these operations cannot be sampled
by Eclipse.
30
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
31
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
T ∗ Pi
T0 = + T ∗ (1 − Pi ) (4.1)
Si
Equation 4.2 is another expression of Equation 4.1 using T 0 as the function of Tiexe .
Assumed that Si is the constant for all candidates and it is larger than 1, the longer
Tiexe , the faster execution time T 0 is on the new system. Consequently, the 1st criterion
is longer execution time Tiexe .
Tiexe 1
T 0 = f (Tiexe ) = + T − Tiexe = Tiexe ( − 1) + T (4.2)
Si Si
Tiexe
T 0 = f (Tiexe , Titran ) = + Titran + T − Tiexe (4.3)
Si
According to the experience from this project, the communication overhead can be
easily estimated based on the number of data, clock frequency and selected communica-
tion protocol. Let nii and noi denote the number of input and output data respectively.
Then, the communication overhead is (nii + noi ) ∗ ttran . Using the basic AXI protocol,
ttran is 16 clock cycles, so it takes 160 ns to send or receive one data if the clock frequency
is set to 100MHz.
Tiexe
T 0 = f (Tiexe , Titran , Si ) = + Titran + T − Tiexe (4.4)
Si
One property of hardware (FPGA) is that it can support massive parallel compu-
tations. For example, one Zynq chip contains 220 DSP units, but a processor contains
much less Arithmetic Logic Units (ALU). 220 arithmetic calculations can be executed
at the same time in Zynq, while few calculations can run in parallel on the processor.
As a result, algorithms that contain a lot of parallelism can get high speedup when they
are implemented in hardware. Furthermore, loops with independent iterations can also
be executed in parallel on hardware to achieve high speedup.
32
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
In order to explain what kinds of computation characteristics are suitable for hard-
ware implementation, six examples are examined:
Function 1: A~=B ~ 0;
~
Function 2: A = B ~ B ~ 0;
Function 3: A~=B ~B ~ ;
0
~ ~
Function 4: A = B C; ~
Function 5: A~=B ~ C;
~
~ ~
Function 6: A = B ; −1
33
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
In conclusion, blocks that have higher computation demands, less data exchange,
and better speedup factors are good candidates for hardware implementation.
According to the previous evaluation of different workflows, HLS is the tool selected
to synthesize RTL code. The steps of HLS has been introduced in Section 3.2.2. There
are a lot of options by which designers can set and customize the implementation of
IP Cores. A list of options and example settings can be found in Table 4.3. Detailed
description of all settings could be found in a Xilinx user guide document [42].
In this section, two useful settings for loop optimization and an interface scheme,
Direct Memory Access (DMA), will be introduced.
4.3.1 Pipelining
Loops are commonly used in algorithms. In hardware implementations, the input data
is fed into a pipeline and results are delivered to the output port iteration by iteration.
Figure 4.7 shows a simple example that contains loop iterations. Ideally, each operation
in the loop takes 1 clock cycle in default setting. Therefore, one iteration takes 3 cycles
to finish. The next iteration will not start until the previous one is finished, thus, the
second iteration starts in cycle 4 and it takes 6 cycles in total to finish 2 iterations.
Assuming that there is no dependence between these two iterations, then in pipelining
mode the second iteration does not need to wait for previous iteration to complete all
operations. It can start as long as no resource conflict exists. In Figure 4.8, the iteration
latency is still 3 cycles, but the initiation interval between two iterations is reduced from
3 cycles to 1 cycles. The blocks with the same color are executed at the different times,
so no resource conflict exists. As shown in Figure 4.8, it takes 4 cycles in total to finish
2 iterations, which is 2 cycles faster than the normal setting.
34
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
Figure 4.7: An Example Loop And Time Schedule of The Loop in Normal Mode
35
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
Figure 4.9: Time Schedule of The Loop in Loop Unrolling Mode (a) and (b)
No matter how many operations are contained in one loop, the latency of the loop is 1
cycle. HLS duplicates hardware logic so that all operations execute in parallel. However,
if there is dependency between operations within the iteration, those operations have to
execute sequentially. In this example, the three operations within one iteration (read,
calculate and write) need to execute one by one. Therefore, the latency of one iteration
is 3 cycles as shown in Figure 4.9(b). Since the hardware implementation is parallelized,
the loop duration is always 3 cycles no matter how many iterations are contained in this
loop. HLS also allows the designers to set a partially unrolled loop. In that case, the
scale of parallel execution is decided by the unrolling factor.
36
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
two options for data transfer. Both of them belong to the category of DMA schemes. By
using HP, IP Cores can communicate with DDR memory directly. Hardware accelerator
can also exchange data with the L2 cache of the processor via ACP. The first advantage
of DMA schemes is that exchanging data does not need to go through L1 Cache and
the core of processors. Thus, the time for memory flush could be saved, especially for
the case with large set of data streams. The second advantage is that one address is
enough for transmitting one array. Since DMA schemes are stream-based, the protocol
can access a sequence memory area according to its base address and the size of the
space. A detailed tutorial of DMA connection for Zynq-7000 can be found in [44].
After we implemented HP connection for our IP Core, the transmission time is re-
duced from 36us to 16us. It has been shown that the speedup can be even higher if a
larger stream is exchanged [45].
4.4 Conclusion
In this chapter, using the workflow suggested in Chapter 3, we discussed how to analyze
applications and implement hardware IP Cores. Hardware/software partitioning and
hardware logic optimization are the two main parts of this chapter.
Profiling is the step before partitioning. We introduced and compared three ways
of profiling, i.e., target, host and MATLAB profiling. According to our evaluation, the
results of these three profiling methods are similar to each other. Because systems will
finally run on target boards, target profiling is suggested, but the other two ways are
also acceptable for developers.
37
CHAPTER 4. APPLICATION ANALYSIS AND IMPLEMENTATION
Next, three criteria: execution time, communication overhead, and potential speedup,
were proposed for assisting designers to decide which blocks are better to be implemented
in hardware when partitioning hybrid systems.
Finally, the hardware implementation was optimized in order to get better perfor-
mance and three common settings for HLS projects were investigated.
38
5
Evaluation
In this chapter, we use two test cases to evaluate the performance of the proposed MBD
workflow, profiling methods, and optimization schemes. Using the first case, we show
how the performance, in terms of throughput of the blocks, is affected by the imple-
mentation settings. The second case is selected from our data fusion application, where
an AXI DMA scheme is used to reduce the communication overhead. The throughput
performance and hardware resource utilization are compared and analyzed for different
optimization schemes.
fo r ( i = 0 ; i < 1 5 ; i ++) {
d [ i ]=v [ i ] ∗ v [ i ] ;
}
}
The second case is a function selected from the data fusion application. The C
source code of this case is generated from the MATLAB Code covariance = srCovari-
ance*srCovariance’ and variances = diag(covariance). Here, the input (srCovariance)
of the former MATLAB function (multiplication) is one matrix with size of 15*15. The
output of the former MATLAB function is then used as the input for the latter one
39
CHAPTER 5. EVALUATION
(diag), which produces a new output, the vector ”variance” with size of 15*1. By com-
bining these two MATLAB functions together, the number of elements for the output of
the function block shown in Case 2 reduces to 15, which is much smaller than the case
where only the first MATLAB function is contained in the C function block.
Case 2:
void matrixProduct ( const f l o a t v [ 2 2 5 ] , f l o a t d [ 1 5 ] ) {
char i , j ;
float b StateVectors [ 2 2 5 ] ;
// c o v a r i a n c e = s r C o v a r i a n c e ∗ s r C o v a r i a n c e ’ ;
l o o p 1 : f o r ( i = 0 ; i < 1 5 ; i ++) {
l o o p 2 : f o r ( i 1 1 1 = 0 ; i 1 1 1 < 1 5 ; i 1 1 1++) {
b S t a t e V e c t o r s [ i + 15 ∗ i 1 1 1 ] = 0 . 0 F ;
l o o p 3 : f o r ( i 1 1 2 = 0 ; i 1 1 2 < 1 5 ; i 1 1 2++) {
b S t a t e V e c t o r s [ i + 15 ∗ i 1 1 1 ]
+= v [ i + 15 ∗ i 1 1 2 ] ∗ v [ i 1 1 1 + 15 ∗ i 1 1 2 ] ;
}
}
}
// v a r i a n c e s = d i a g ( c o v a r i a n c e ) ;
for ( j = 0 ; j < 1 5 ; j ++) {
d [ j ] = b S t a t e V e c t o r s [ j << 4 ] ;
}
40
CHAPTER 5. EVALUATION
Figure 5.1: Control Steps of Case 1 in Default Setting and in the Pipeline Mode
to 1 clock cycles as can be seen in Figure 5.2, since each iteration starts 1 clock cycle
after the its prior iteration. The latency for each individual iteration is still 7 clock
cycles. Therefore, the total latency of the block is 1*15+7+1=23 clock cycles.
When the loop is configured in unrolled mode, HLS generates a total parallel hard-
ware logic, so we can see 15 parallel control steps in Figure 5.3. Table 5.1 summarizes
the latency and speedup factors for three different settings. We see that compared to
the default mode, the pipelining scheme speeds up the block by a factor of 4.65 and the
unrolling scheme speeds up it by a factor of 26.75. The reason why it gets more than
15 times speedup in the unrolled mode is that it not only parallelizes the hardware logic
but also saves a lot time for data multiplexing.
Case 2:
Initially, the function takes around 10% (314 us) of the total execution time for the
whole ADAS application. We try to use hardware accelerator to implement the selected
function block as given in Case 2. Table 5.2 shows the execution time and speedup
factors when considering different loop configurations. Note that there are three nested
loops in Case 2, and the iterations of different loops have dependencies between each
other. Thus, the effects of different loop configurations on the latency performance is
not straight forward to analyze as in Case 1. For example, if we unroll a loop, the
execution time does reduce, but the speedup factor does is not equal to the unroll factor.
As it shown in Table 5.2, although the block needs only 366 clock cycles to execute the
41
CHAPTER 5. EVALUATION
computation (4 us @ 100MHz), the total execution time for the block is 44 us. This is
because the communication takes 40 us.
Table 5.2: Execution time for Case 2 without DMA Interface
42
CHAPTER 5. EVALUATION
Speedup
Speedup/U tilization = N umBRAM N umDSP 48E N umF F N umLU T
T otalN umBRAM + T otalN umDSP 48E+ T otalN umF F + T otalN umLU T
(5.1)
Note that we prefer higher speedup and lower hardware resource utilization. There-
fore, larger Speedup/Utilization is better. The last column of Table 5.5 shows the value of
Speedup/Utilization for the three solutions. The results indicate that the pipelined loop
is the best solution for Case 1, since it achieves the highest value of Speedup/Utilization.
Case 2: The resource utilization and speedup of Case 2 are listed in Tables 5.6, 5.7 for
non-DMA solutions and DMA based solutions. In non-DMA mode, solution ’noDMA2’
and ’noDMA3’ have the same value of Speedup/Utilization. With DMA configuration,
when parallel hardware structure is set, only computation time decreases linearly, so
speedup does not increase relatively. The best Speedup/Utilization value appears when
43
CHAPTER 5. EVALUATION
the array is stored in a single block RAM, which means ’DMA1’ is the most efficient
solution for case 2. If FPGA resource are available, the second optimized solution,
’DMA2’, is also a good candidate due to its high speedup. In the end, the tradeoff
between the throughput, performance and resource utilization needs to be considered in
real system design.
5.2 Conclusion
In this chapter, we analyzed the performance and resource utilization of hardware imple-
mentation. We found that different hardware implementation configurations in HLS can
dramatically affect the system throughput. Using pipeline loop configuration together
with the AXI DMA scheme, a speedup of 10 can be achieved for one block selected from
our data fusion application. This implies that hardware/software co-design can improve
44
CHAPTER 5. EVALUATION
45
6
Conclusion
6.1 Contributions
This thesis explores a software/hardware hybrid system for ADAS using MBD. An
FPGA-based SoC was selected as the platform of the system because standalone proces-
sors are not powerful enough to satisfy the high demands of ADAS applications. Using
hardware accelerators on SoC is one potential solution to improve the performance of
applications. Moreover, MBD is a methodology widely used in vehicle industry, by which
can improve the development workflow and shorten the time-to-market cycle.
The thesis project has two objectives. First, MBD methodologies are evaluated for
MATLAB and Simulink models. The aim is to identify a suitable workflow, which syn-
thesizes C and RTL code automatically, minimizes development time and supports most
source models. Second, the key technique for hardware/software co-design is investigated
based on MBD workflow. We found methods for profiling the hybrid system, criteria
for partitioning the system into hardware and software, and ways for optimizing SoC
designs to get better system performance.
The workflow for MBD was presented in Chapter 3. Embedded Coder and HDL
Coder are two tools from MathWorks. Embedded Coder and HDL Coder can generate
C and RTL code from Simulink models. The SoC design was created using the HDL
workflow advisor. System-In-the-Loop simulation is integrated in the Simulink environ-
ment, which is a helpful feature to reduce the coding and verification time. From MAT-
LAB models, C code can be synthesized by Embedded Coder, but some limitations were
found when RTL code was generated using the HDL Coder. Due to these limitations,
we selected HLS as an alternative to generate RTL Code from C code. Subsequently, we
proposed a workflow that combines Embedded Coder and HLS.
In order to develop hardware/software hybrid systems, we proposed three steps:
(1) Profiling, (2) Partitioning, and (3) Implementation optimization. For profiling, we
evaluated and compared different profiling methods in Section 4.1. Developers can profile
46
CHAPTER 6. CONCLUSION
applications on hosts, target boards or modeling tools. For partitioning, some key factors
were listed and explained in Section 4.2. For instance, execution time, communication
overhead, and the potential for speedup. Using these factors, developers can decide which
blocks are candidates to be implemented hardware accelerators. For implementation
optimization, a lot of detailed hardware implementation schemes can be used that affect
the performance of hardware accelerators. We investigated loop pipelining, loop unrolling
and DMA interfaces in Section 4.3.
After evaluating the workflow of MBD and methods of hardware/software co-design,
a real active safety application was implemented. The application was developed in
MATLAB. C code was generated by the Embedded Coder. Next, it was executed and
profiled on a target board. According to the proposed partitioning criteria, we selected
a specific function, which takes around 10% of the total execution time, and migrated
it from software to hardware implementation. After the block was optimized in HLS
using pipelining and DMA, we achieve up to 12 times speedup. Section 5.1 analyzed
different settings, speedups and resource utilization. The ADAS application cannot be
further improved because most of the algorithms in the source models are not suitable
for hardware implementation. However, the main goals of this thesis have been achieved:
we found a suitable workflow for hardware/software co-design based on MDB and we
proved that hardware accelerators can significantly improve system throughput.
47
CHAPTER 6. CONCLUSION
by HDL Coder. Although HLS supports it, elements in the struct are splitted into in-
dividual data ports by HLS, therefore, the struct data should be avoided for the blocks
that might be implemented in hardware accelerators.
48
Bibliography
[2] K. J. Kingsley, Evaluating crash avoidance countermeasures using data from fm-
csa/nhtsa’s large truck crash causation study, in: Proceedings of the 21s Interna-
tional Technical Conference on the Enhanced Safety of Vehicles Conference (ESV)-
International Congress Center Stuttgart, Germany, 2009.
[4] TEXAS INSTRUMENTS, Advanced safety and driver assistance systems paves
the way to autonomous driving (Jul. 2014).
URL http://e2e.ti.com/blogs_/b/behind_the_wheel/archive/2014/02/
04/advanced-safety-and-driver-assistance-systems-paves-the-way-to-
autonomous-driving.aspx
[8] D. Hall, J. Llinas, A challenge for the data fusion community i: research imperatives
for improved processing, in: Proc. 7th Nat. Symp. on Sensor Fusion, Vol. 16, 1994.
[9] J. Llinas, D. Hall, A challenge for the data fusion community ii: Infrastructure
imperatives, in: Proc. 7th Natl. Symp. on Sensor Fusion, Vol. 16, 1994.
49
BIBLIOGRAPHY
[11] L. A. Klein, Sensor and data fusion concepts and applications, Society of Photo-
Optical Instrumentation Engineers (SPIE), 1993.
[12] F. Sandblom, J. Sörstedt, Sensor data fusion for multiple sensor configurations, in:
Submited to 2014 IEEE Intelligent Vehicles Symposium.
[15] D. P. Bertsekas, The auction algorithm for assignment and other network flow
problems: A tutorial, Interfaces 20 (4) (1990) 133–149.
[16] S. S. Blackman, Multiple hypothesis tracking for multiple target tracking, Aerospace
and Electronic Systems Magazine, IEEE 19 (1) (2004) 5–18.
[17] J. Teich, Hardware/software codesign: The past, the present, and predicting the
future, Proceedings of the IEEE 100 (Special Centennial Issue) (2012) 1411–1430.
50
BIBLIOGRAPHY
[26] D. F. Bacon, R. Rabbah, S. Shukla, FPGA programming for the masses, Commu-
nications of the ACM 56 (4) (2013) 56–63.
[30] NATIONAL INSTRUMENTS, Getting started with LabVIEW FPGA (Jul. 2014).
URL http://www.ni.com/tutorial/14532/en/
[33] D. Singh, S. P. Engineer, Higher level programming abstractions for fpgas using
opencl, in: Workshop on Design Methods and Tools for FPGA-Based Acceleration
of Scientific Computing, 2011.
[35] MathWorks, Block support and compatibility checks, supported block libraries
(Jul. 2014).
URL http://www.mathworks.se/help/releases/R2013b/hdlcoder/block-
support-and-compatibility-checks.html
[36] MathWorks, Check for blocks not supported by code generation (Jul. 2014).
URL http://www.mathworks.se/help/rtw/ref/embedded-codersimulink-
coder-checks.html#btpdhno-1
[37] E. Monmasson, M. N. Cirstea, Fpga design methodology for industrial control sys-
tems—a review, Industrial Electronics, IEEE Transactions on 54 (4) (2007) 1824–
1842.
51
BIBLIOGRAPHY
[38] MathWorks, Matlab language syntax and functions for hdl code generation (Jul.
2014).
URL http://www.mathworks.se/help/releases/R2013b/hdlcoder/matlab-
language-support.html
[39] MathWorks, Matlab language features, functions, classes, and system objects sup-
ported for c and c++ code generation (Jul. 2014).
URL http://www.mathworks.se/help/ecoder/language-supported-for-code-
generation.html
[41] José Fonseca’s utilitities, Convert profiling output to a dot graph (Jul. 2014).
URL https://code.google.com/p/jrfonseca/wiki/Gprof2Dot
[44] Xilinx, Zynq-7000 all programmable soc accelerator for floating-point matrix
multiplication using vivado hls (Jul. 2014).
URL http://www.xilinx.com/support/documentation/application_notes/
xapp1170-zynq-hls.pdf
[45] M. Sadri, C. Weis, N. Wehn, L. Benini, Energy and performance exploration of ac-
celerator coherency port using xilinx zynq, in: Proceedings of the 10th FPGAworld
Conference, FPGAworld ’13, ACM, New York, NY, USA, 2013, pp. 5:1–5:8.
URL http://doi.acm.org/10.1145/2513683.2513688
52
A
Generated C Code
Below is the C Code generated by the Embedded Coder from Matlab script. Four
functions from Section 4.1.1 are shown below, which are useful for explaining the profiling
results in Section 4.2.3.
//A1=VB ’ ;
void f 1 ( const emxArray real T ∗VB, emxArray real T ∗A1)
{
int i 1 ;
int l o o p u b ;
int b l o o p u b ;
int i 2 ;
i 1 = A1−>s i z e [ 0 ] ∗ A1−>s i z e [ 1 ] ;
A1−>s i z e [ 0 ] = VB−>s i z e [ 1 ] ;
A1−>s i z e [ 1 ] = VB−>s i z e [ 0 ] ;
emxEnsureCapacity ( ( emxArray common ∗ )A1 , i 1 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = VB−>s i z e [ 0 ] ;
for ( i 1 = 0 ; i 1 < l o o p u b ; i 1 ++) {
b l o o p u b = VB−>s i z e [ 1 ] ;
fo r ( i 2 = 0 ; i 2 < b l o o p u b ; i 2 ++) {
A1−>data [ i 2 + A1−>s i z e [ 0 ] ∗ i 1 ]
= VB−>data [ i 1 + VB−>s i z e [ 0 ] ∗ i 2 ] ;
}
}
}
//A2 = VB . ∗ VB ’ ;
void f 2 ( const emxArray real T ∗VB, emxArray real T ∗A2)
53
APPENDIX A. GENERATED C CODE
{
int i 3 ;
int l o o p u b ;
int b l o o p u b ;
int i 4 ;
i 3 = A2−>s i z e [ 0 ] ∗ A2−>s i z e [ 1 ] ;
A2−>s i z e [ 0 ] = VB−>s i z e [ 0 ] ;
A2−>s i z e [ 1 ] = VB−>s i z e [ 1 ] ;
emxEnsureCapacity ( ( emxArray common ∗ )A2 , i 3 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = VB−>s i z e [ 1 ] ;
for ( i 3 = 0 ; i 3 < l o o p u b ; i 3 ++) {
b l o o p u b = VB−>s i z e [ 0 ] ;
fo r ( i 4 = 0 ; i 4 < b l o o p u b ; i 4 ++) {
A2−>data [ i 4 + A2−>s i z e [ 0 ] ∗ i 3 ]
= VB−>data [ i 4 + VB−>s i z e [ 0 ] ∗ i 3 ] ∗
VB−>data [ i 3 + VB−>s i z e [ 0 ] ∗ i 4 ] ;
}
}
}
//A3 = VB ∗ VB ’ ;
void f 3 ( const emxArray real T ∗VB, emxArray real T ∗A3)
{
emxArray real T ∗b ;
int i 5 ;
int l o o p u b ;
int b l o o p u b ;
int i 6 ;
int c l o o p u b ;
int i 7 ;
unsigned int unnamed idx 0 ;
unsigned int unnamed idx 1 ;
e m x I n i t r e a l T (&b , 2 ) ;
i 5 = b−>s i z e [ 0 ] ∗ b−>s i z e [ 1 ] ;
b−>s i z e [ 0 ] = VB−>s i z e [ 1 ] ;
b−>s i z e [ 1 ] = VB−>s i z e [ 0 ] ;
emxEnsureCapacity ( ( emxArray common ∗ ) b , i 5 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = VB−>s i z e [ 0 ] ;
for ( i 5 = 0 ; i 5 < l o o p u b ; i 5 ++) {
b l o o p u b = VB−>s i z e [ 1 ] ;
fo r ( i 6 = 0 ; i 6 < b l o o p u b ; i 6 ++) {
b−>data [ i 6 + b−>s i z e [ 0 ] ∗ i 5 ]
54
APPENDIX A. GENERATED C CODE
= VB−>data [ i 5 + VB−>s i z e [ 0 ] ∗ i 6 ] ;
}
}
i f ( (VB−>s i z e [ 1 ] == 1 ) | | ( b−>s i z e [ 0 ] == 1 ) ) {
i 5 = A3−>s i z e [ 0 ] ∗ A3−>s i z e [ 1 ] ;
A3−>s i z e [ 0 ] = VB−>s i z e [ 0 ] ;
A3−>s i z e [ 1 ] = b−>s i z e [ 1 ] ;
emxEnsureCapacity ( ( emxArray common ∗ )A3 , i 5 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = VB−>s i z e [ 0 ] ;
for ( i 5 = 0 ; i 5 < l o o p u b ; i 5 ++) {
b l o o p u b = b−>s i z e [ 1 ] ;
f o r ( i 6 = 0 ; i 6 < b l o o p u b ; i 6 ++) {
A3−>data [ i 5 + A3−>s i z e [ 0 ] ∗ i 6 ] = 0 . 0 ;
c l o o p u b = VB−>s i z e [ 1 ] ;
f o r ( i 7 = 0 ; i 7 < c l o o p u b ; i 7 ++) {
A3−>data [ i 5 + A3−>s i z e [ 0 ] ∗ i 6 ]
+= VB−>data [ i 5 + VB−>s i z e [ 0 ] ∗ i 7 ] ∗
b−>data [ i 7 + b−>s i z e [ 0 ] ∗ i 6 ] ;
}
}
}
} else {
unnamed idx 0 = ( unsigned int )VB−>s i z e [ 0 ] ;
unnamed idx 1 = ( unsigned int ) b−>s i z e [ 1 ] ;
i 5 = A3−>s i z e [ 0 ] ∗ A3−>s i z e [ 1 ] ;
A3−>s i z e [ 0 ] = ( int ) unnamed idx 0 ;
emxEnsureCapacity ( ( emxArray common ∗ )A3 , i 5 ,
( int ) s i z e o f ( double ) ) ;
i 5 = A3−>s i z e [ 0 ] ∗ A3−>s i z e [ 1 ] ;
A3−>s i z e [ 1 ] = ( int ) unnamed idx 1 ;
emxEnsureCapacity ( ( emxArray common ∗ )A3 , i 5 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = ( int ) unnamed idx 0 ∗ ( int ) unnamed idx 1 ;
fo r ( i 5 = 0 ; i 5 < l o o p u b ; i 5 ++) {
A3−>data [ i 5 ] = 0 . 0 ;
}
55
APPENDIX A. GENERATED C CODE
e m x F r e e r e a l T (&b ) ;
}
//A4 = VB . ∗ VC ’ ;
void f 4 ( const emxArray real T ∗VB,
const emxArray real T ∗VC, emxArray real T ∗A4)
{
int i 8 ;
int l o o p u b ;
int b l o o p u b ;
int i 9 ;
i 8 = A4−>s i z e [ 0 ] ∗ A4−>s i z e [ 1 ] ;
A4−>s i z e [ 0 ] = VB−>s i z e [ 0 ] ;
A4−>s i z e [ 1 ] = VB−>s i z e [ 1 ] ;
emxEnsureCapacity ( ( emxArray common ∗ )A4 , i 8 ,
( int ) s i z e o f ( double ) ) ;
l o o p u b = VB−>s i z e [ 1 ] ;
for ( i 8 = 0 ; i 8 < l o o p u b ; i 8 ++) {
b l o o p u b = VB−>s i z e [ 0 ] ;
fo r ( i 9 = 0 ; i 9 < b l o o p u b ; i 9 ++) {
A4−>data [ i 9 + A4−>s i z e [ 0 ] ∗ i 8 ]
= VB−>data [ i 9 + VB−>s i z e [ 0 ] ∗ i 8 ] ∗
VC−>data [ i 8 + VC−>s i z e [ 0 ] ∗ i 9 ] ;
}
}
}
56
B
User Guide for Host Profiling
In this thesis, three methods of profiling are proposed. MATLAB profiling and target
profiling are two ways that developers can easily find tutorials on the internet. In this
section, we present the way to perform host profiling that is a method we summarized
in this project. The steps to setup the profiling are the following:
Step 1: Download Eclipse.
Eclipse Standard 4.3.2
Step 2: Unzip and Start Eclipse.
57
APPENDIX B. USER GUIDE FOR HOST PROFILING
58
APPENDIX B. USER GUIDE FOR HOST PROFILING
59
APPENDIX B. USER GUIDE FOR HOST PROFILING
Step 6: Build and Run the application, get .exe and .out files.
60
APPENDIX B. USER GUIDE FOR HOST PROFILING
Step 7: Copy .exe and .out into a folder together with gprof2dot.py.
Step 8: Execute the following commands.
:: Process Profile daa recalled from target
gprof ddd.exe gmon.out > gmon.txt
:: Transform into a Dot format
61
APPENDIX B. USER GUIDE FOR HOST PROFILING
62