I Introduction 1
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Problem Description and Challenges . . . . . . . . . . . . . . 8
1.1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Research Map . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 List of Publications . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Background 21
2.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Programmable Logic, FPGA . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Adaptive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Autonomic and Organic Computing . . . . . . . . . . . . . . . . . . 27
2.5.1 Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Organic Computing . . . . . . . . . . . . . . . . . . . . . . . 28
II Included Papers 85
Chapter 1
Although octopuses and other invertebrate animals can change the color
and pattern of their skin for camouflage or signaling, no one had ever
described a vertebrate species that changes its skin texture until 2015.
In [16], authors describe a new species of frog named Pristimantis mu-
tabilis. The “individuals of the new species from the Andes rain forest
are remarkable for their ability to change skin texture within few min-
utes (330 s), between two discrete states. This mutable rain frog is the
first documented vertebrate to show such dramatic and rapid phenotypic
plasticity” [16].
Hopefully, this discovery among biological organisms may help us to
visualize how unique and transcendent the possibility of having a non-
living organism exhibiting similar traits is. A solid-state integrated cir-
cuit whose hardware functions can be changed almost immediately while
it is still powered and operative. This integrated circuit is known as
Field-Programmable Gate Array (FPGA). Moreover, this type of hard-
ware plasticity, known as "partial and run-time reconfiguration (RTR)",
is arguably the current most promising characteristic of FPGAs that
has ignited considerable interest in several fields, generally as a func-
tion switching mechanism that accelerates computations in hardware.
However, RTR is also becoming useful in new realms since it enables
self-reconfiguration, a fundamental characteristic that allows hardware
adaptation. In turn, this can allow other bioinspired properties such as
self-healing, self-repairing, self-optimization, and self-adaptation. How-
ever, compared to animal species where morphing is used for camouflage
or mating purposes, and there is a brain or intelligence controlling this
activation, in FPGAs the application of these bio-inspired properties
enabled by RTR are ready to be explored and the cognitive mechanisms
1.1 Motivation
This thesis presents the research that started by the need to find a simplified and
reusable approach to developing reconfigurable embedded SoCs and explore the
applications of RTR. However, the final result went beyond that goal.
FPGAs provided with partial and RTR combine the flexibility of software with
the high efficiency of hardware. But, their potential can not be fully exploited
due to the increased complexity of the design process, and the intricacy to gen-
erate partial reconfigurations. During this thesis investigation several challenges,
inspiration sources, and answers were consolidated at every stage. This section
motivates the reader by briefly introducing topics and sources of inspiration, such
as reconfigurable computing, field-programmable logic, dependability, adaptation,
bio-inspired hardware, organic computing, and machine learning. Then, problem
statements and objectives are summarized. Finally, the contributions based on
publications are reviewed.
Reconfigurable Computing (RC) is the second random access memory (RAM)-
based machine paradigm [19] that offer a drastic reduction in energy consumption
and speedup factors by up to several orders of magnitude compared to the von
Neumann paradigm, which, according to Hartenstein [19], is beginning to lose its
dominance. This new paradigm brings a disruptive new horizon for affordable high-
performance computing and also changes to the practice of scientific and embedded
system applications. Heterogeneous systems, including both paradigms, will need
innovation in parallel programming of manycore architectures and structural pro-
gramming, i.e.,configware, of Reconfigurable Computing. For that, designers need
to think out of the box [19].
Current research in Reconfigurable Computing has shown considerable speedup
and low power consumption, in addition to its capability to increase dependability.
However, the superior design complexity over traditional technologies has impeded
its dominance in strategic areas like personal, embedded and high-performance
computing. Therefore, to support the new revolution of reconfigurable computers
new methods, architectures, and tools are needed, but most importantly flexible,
economical, simplified, and reusable approaches so that the community can start
using and taking advantage of its benefits.
RTR has also found potential application in the Autonomic Computing (AC)
Initiative and the Organic Computing (OC) Initiative. Both initiatives envision
systems with enough degrees of freedom to adapt dynamically to changing re-
quirements of execution environments. Those systems exhibit lifelike character-
istics or the so called self-x properties, such as self-configuration, self-healing, self-
optimization, and self-protection.
Computing systems’ complexity appears to be approaching the limits of human
capability, yet the demand for increased interconnectivity and integration of in-
dependent computing devices seems unstoppable. This situation could become a
critical threat that creates a problem of controllability [27]. Therefore, they must
be built as robust, safe, flexible and trustworthy as possible [37]. The AC Initiative,
inspired by biological principles, proposes that the only solution may be to create
computing systems that can manage themselves given high-level objectives from
administrators [27]. This self-management property should be supported by the so-
called self-x properties, that is as self-configuration, self-healing, self-optimization,
and self-protection. Among them, self-configuration may be the most important
enabler for other properties. On the other hand, a similar proposal was developed
by the OC Initiative, who is not interested in fully autonomous systems, but in the
search for concepts to achieve controlled self-organization as a new design paradigm
[51]. In 2004, they addressed the need for fundamental research on the effects of
emergence due to self-organization and the need for a paradigm shift in system de-
sign and tools [36]. They stated that “It is not the question whether self-organized
and adaptive systems will arise, but how they will be designed and controlled” [51].
A number of surveys and technical reports remark the relevance of research fields
converging in this thesis, for which RTR becomes a critical enabler. For instance,
the High Performance and Embedded Architecture and Compilation (HIPEAC)
European Network of Excellence presented the expected vision for advanced com-
puting for the year 2020 [22]. Here, they identify three strategic areas: embedded,
mobile, and data center and three crosscutting challenges: energy efficiency, sys-
tem complexity, and dependability. They suggest that to overcome the complexity
and cost of heterogeneous system development, we need tools and techniques that
provide power and performance portability. As described by HIPEAC: ”Run-time
adaptation and machine learning has demonstrated a high potential to improve
performance, power consumption, and other relevant metrics. However, these tech-
niques are still far from widespread” [22]. Similarly, another report from the In-
ternational Technology Roadmap for Semiconductors (ITRS) in 2012 [1], considers
that the overall design technology challenges are: design productivity, power con-
sumption, manufacturability, interference, and reliability and resilience. For the
last challenge, ITRS remarks autonomic computing, robust design, and software
(SW) reliability as the main issues for technologies beyond 22 nm and soft-error
correction for Ø 22 nm. Among other challenges and trends, they identify fault
tolerant architectures and protocols, and re-use of solutions for test and system
management of the dynamic FT schemes)? How to optimize area and power when
using dynamic fault-tolerance (dynFT)? How to make such a system adapt to real
environments that produce highly unpredictable situations? How to complete the
observe and decide phases of an adaptation system so that the whole system be-
comes self-adaptive exhibiting cognitive development, such as self-awareness and
learning so that it can cope with uncertain situations and keep acceptable levels
of performance while using their resources to maintain fault tolerance? So far,
to the best of this thesis investigation, this kind of research that explore the self-
organization of groups of cognitive reconfigurable cores towards self-adaptation may
have not been conducted or is partially explored as in [25] [28] [44] [45] [53] [3] [31].
In the last years, the community in RC has been focused on improving reconfig-
uration methods, architectures, applications. But RTR embedded SoCs are ideal
for adaptive embedded applications. However, what is still missing is the adapta-
tion capability, that is, the observe and decide phases that complement the already
developed act phase of the adaptive cycle. Moreover, advances in technology and
consumer’s demand lead to a world flooded with interconnected devices required to
work without human intervention, such as delivery drones or self-driven vehicles.
Here, autonomic, self-adaptation and cognitive properties are mandatory.
Nevertheless, most of the design methodologies that exist today are mainly
focused on generating the entire system directly. They fail to consider that a partial
reconfiguration can be used to reduce cost and power consumption, by loading
critical parts of the design at run-time. However, adoption of partial reconfiguration
is limited by the intricacy of the process to generate configurations, which implies
time and recursive effort during the layout and physical synthesis stages [Paper I].
FPGAs may be an empty canvas waiting to be imprinted by the designer’s cre-
ativity. However, by devising the right approach, we would get greater productivity
and a range of today unthinkable applications. Otherwise, we would just be using
another device that computes a bit faster with low power consumption.
The global research question that motivated the investigation that is addressed
in this thesis is How can the flexibility, reusability, and productivity in the design
process of FPGA-based partial and run-time reconfigurable embedded systems-on-
chip be improved to enable research and development of novel applications in the
areas of hardware acceleration, dynamic fault-tolerance, self-healing, self-awareness,
and self-adaptation?
Essentially, to address the general research question, the author proposes that
by adopting a modular approach i.e., hardware, software, and configware, applying
principles of SoC design-and-reuse principles, and it is possible to alleviate and
speed up the design process of reconfigurations and maximize the productivity in
other advanced topics.
- Flexible
d e
- Reusable
- Maintainable
n-Board-Computer (OBC)
A motivation example:
A long-range mission cosmic
loses contact with Earth. radiation
The FPGA based on-
board computer Uncertainty
may become faulty due
to radiation or nano-
scale integration effects.
It then must optimize
resources and self-adapt
to unpredictable
situations HPC
If link I/O
gets broken High performance in a SoC The ideal Flexible and Reusable
Small-footprint, low area and power, high embedded FPGA HW, SW, and CW is SoC
integration, hardware acceleration flexible, reusable, and maintainable. - Smal
- High
Self-assembly Deals with uncertainty - Smal
Self-assembles the adequate hardware Takes optimal and non-programmed - Low
The embedded system (OBC) should : fault-tolerant scheme decisions
Learn ing Based Self-
goal Opti mi zation
of Dynamic Fault-
Learn ing-based policy ? Tolerant Sc hemes in
Self- Performance-Aware
Optimization RecoBlock SoCs.
Kno w what KTH/ICT/ESY’15
I know
Towards Cogniti ve
Hardware: Self-
Machine Learning Aware Learning in
Reinforcement Learning V RTR Faul t-Tolerant
Cognitive Science Self-Awareness SoCs.
Psychology ReCoSoC’15
On Providing
Sc al abl e
PAPER 4 Self-Healing
Adapti ve Faul t-
The Ups et Fault IV tolerance to RTR
Observer SoCs.
ReConFi g’14
Computi ng
Figure 1.2: Research Map. Illustrates the research timeline and the relationship between areas,
sources of inspiration, publications, and contributions.
Summary of Contributions
The following list summarizes the main contributions derived from the conducted
research and the scientific publications (Part II). The research is an accumulative,
and related work developed around the first publication and progressively extended
to answer new challenges in different related areas. It is an original work that does
not build-up on previous work at KTH
accelerator, burst I/O transfers, and RTR pre-fetch [Paper II]. Char-
acterized the FT metrics and scalability of the novel dynFT schemes
[Paper IV].
– Design methodology: Devised a design methodology to develop complex
algorithms modeled in Matlab to be accelerated in the RecoBlock RTR
IP-core, using existing commercial off the shelf (COTS) applications that
have not been linked [Paper II].
– RTR technique: Improved the RTR method used in the RecoBlock plat-
form by pre-fetching reconfigurations from non-volatile Compact Flash
(CF) memory into double data rate type three SDRAM (DDR3) memory
[Paper II].
– RTR FT schemes: Introduced the concept and guidelines for the im-
plementation of reconfigurable IP-core based TMR schemes (dynamic
fault-tolerance) based on the RecoBlock IP-core. These are FT schemes
based on hardware redundancy, RTR, and modular IP-cores that can be
reconfigured (self-assembled) during run-time. They allow to self-repair
(run-time reconfiguration in the same core) and to self-replicate (run-
time relocation and reconfiguration in another core). Implemented the
RTR FT schemes [Paper IV].
– Upset-Fault-Observer (UFO): Conceptualized [Paper III] and implemented
[Paper IV] the UFO. An innovative dynamic TMR-with-spare scheme,
which multiplexes redundancy in time to maximize the number of fault-
tolerant cores and minimize the scalability cost of hardware redundancy.
– Organic Computing (OC) Management Functions: Conceptualized [Pa-
per III] and implemented [Paper IV] the adaptive self-x SW model.
A software organization structure based on OC and AC that handles
strategies, schemes, and self-x features e.g., self-monitoring, service-map,
adaptive, strategy. It complements the underlying hardware and soft-
ware to observe and control the system so that it can adapt and exhibit
lifelike properties [Paper IV].
– RTR FT schemes 1 metrics: Characterized, evaluated experimentally,
and analyzed the trade-off of FT metrics involved in the self-healing
RTR FT schemes i.e., recovery time, detection and assembly latency,
throughput reduction, and scalability analysis [Paper IV].
– Proof of concept: Verified the functionalities and presented detailed
guidelines about how the RTR FT schemes and UFO are implemented
[Paper IV].
1 dynamic fault-tolerance (dynFT) or RTR FT schemes.
lications and main outcomes are organized according to the three stages identified
in Figure 1.2 and reviewed in Chapters 3, 4, and 5. Additional details are found in
the actual papers reprinted in Part II. The following list offers a brief summary of
each enclosed paper and highlights the author’s contributions.
Paper I: Byron Navas, Ingo Sander, and Johnny Öberg. The RecoBlock SoC Plat-
form: A Flexible Array of Reusable Run-Time-Reconfigurable IP-Blocks. In De-
sign, Automation & Test in Europe Conference & Exhibition (DATE), pages
833–838, New Jersey, 2013. IEEE Conference Publications
Paper II: Byron Navas, Johnny Öberg, and Ingo Sander. Towards the generic re-
configurable accelerator: Algorithm development, core design, and performance
analysis. In Reconfigurable Computing and FPGAs (ReConFig), International
Conference on, pages 1–6, dec 2013
This paper investigates how good the RecoBlock SoC Platform can be as
a hardware accelerator in terms of speedup and bandwidth. For that pur-
pose, it first improves the IP-core architecture and design method to generate
configurations from complex algorithms described and verified in high-level
models, which must comply with the architectural specifications of the Re-
coBlock IP-core, particularly with the RP interface to facilitate relocation.
This work also enhances the synchronization and execution mechanism of the
IP-core architecture, so that high-level languages can trigger its execution and
control its periodicity. The method uses available modeling and high-level
synthesis COTS tools to reduce the gap between high-level design approaches
and reconfigurable computing technology. Experiments conduct performance
analysis of the platform and its built-in acceleration mechanism (i.e., the core
itself, burst I/O transfers, and reconfiguration pre-fetch in dynamic memory)
Author’s Contribution: The main author performed the problem formulation,
solution, design, implementation, experiments, and wrote the manuscript.
The coauthors provided important feedback, revision of drafts, and correc-
tions to the final manuscript.
Paper III: Byron Navas, Johnny Öberg, and Ingo Sander. The Upset-Fault-Observer
: A Concept for Self-healing Adaptive Fault Tolerance. In 2014 NASA/ESA Con-
ference on Adaptive Hardware and Systems (AHS-2014), 2014
Paper IV: Byron Navas, Johnny Öberg, and Ingo Sander. On providing scalable
self-healing adaptive fault-tolerance to RTR SoCs. In 2014 International Confer-
ence on ReConFigurable Computing and FPGAs (ReConFig14), pages 1–6. IEEE,
dec 2014
This paper investigates how to know when and which dynamic FT-scheme
can be launched without critically affecting the overall performance of a sys-
tem that works under uncertainty. Thus, this paper proposes the concept
of cognitive reconfigurable hardware as an approach to overcome the com-
plexity of reconfigurable multicore embedded systems operating in uncertain
environments, assuming that software execution branches are unpredictable.
This concept outlines a model for cognitive development in hardware consist-
ing of five phases, with a strong emphasis on reconfigurable hardware, i.e.,
self-monitor, self-awareness, self-evaluation, self-learning, and self-adaptation.
Consequently, this paper presents the design of a FPGA-based RTR SoC
that it becomes conscious of its monitored hardware and learns to make
decisions that maintain the desired system performance, particularly when
triggering hardware acceleration and dynamic fault-tolerant (FT) schemes on
RTR cores. This work also describes how to achieve self-awareness based
on hardware self-monitoring using hardware performance counters. Experi-
ments propose and implement methods to simulate uncertainty and to adapt
performance based on reinforcement rule-based algorithms and a set of the
most meaningful self-monitored metrics. Existing research on self-awareness
has been concentrated on workstations but is rapidly growing on embedded
SoCs. The cognitive processes defined in this work mark the path for addi-
tional research.
Author’s Contribution: The main author identified the problem, defined the
CRH concept, implemented the system, conducted experiments, and wrote
the manuscript. The coauthors provided important feedback, revision of
drafts, and corrections to the final manuscript.
Paper VI: Byron Navas, Ingo Sander, and Johnny Öberg. Reinforcement Learning
Based Self-Optimization of Dynamic Fault-Tolerant Schemes in Performance-
Aware RecoBlock SoCs. Technical report, KTH Royal Institute of Technology,
School of Information and Communication Technology, Stockholm, Sweden, 2015
This paper realizes that embedded systems need to make autonomous deci-
sions, develop cognitive properties and finally become self-adaptive so that
they can be deployed in the real world. However, classic off-line modeling
and programming methods are inadequate to cope with uncertainty in real
world environments. Therefore, this paper complements de decision phase of
an adaptive system by introducing a reinforcement learning based approach.
This article presents a FPGA-based SoC that is self-aware of its monitored
hardware and utilizes an online RL method to self-optimize the decisions that
maintain the desired system performance, particularly when triggering hard-
ware acceleration and dynamic FT schemes on RTR IP-cores. Moreover, this
article describes the main features of the RecoBlock SoC concept, overviews
the RL theory, shows the Q-learning algorithm adapted for the dynamic fault-
tolerance optimization problem, and presents its simulation in Matlab. Based
on this investigation, the Q-learning algorithm will be implemented and ver-
ified in the RecoBlock SoC platform.
Author’s Contribution: The investigation, conceptualization, design, imple-
mentation, experiments, and writing the manuscript was conducted by the
main author of this paper. The coauthors provided important feedback, re-
vision of drafts, and corrections to the final manuscript.
This chapter reinforces and briefly extends the introductory background presented
in Section 1. It provides some additional concepts needed to unerstand the following
chapters, particularly for readers not familiar with basic concepts of reconfigurable
computing (Section 2.1), FPGA technology (Section 2.2), dependability and fault
tolerance (Section 2.3), adaptive systems (Section 2.4), autonomic and organic com-
puting including the self-x properties (Section 2.5).
Research and industrial applications in RC using FPGAs have moved into two
major fields, high performance reconfigurable computing (HPRC) and embedded
Reconfigurable Computing (eRC). Due to low power consumption and high den-
sity available in current FPGA technology, HPRC uses FPGAs as accelerators for
supercomputers, providing additional increase in performance. Large high per-
formance computing (HPC) vendors are already supplying machines with FPGAs
ready-fitted. However, the programmer productivity is a problem yet to be solved
[19]. On the other hand, a growing trend is the use of FPGAs in embedded sys-
tems. Originally there has been a feeling that FPGAs are too slow, power-hungry
and expensive for many embedded applications. Now, the situation has changed
due to the availability of low power and a wide range of small packages. FPGAs can
be found in the latest handheld portable devices, including smartphones, cameras,
medical devices, industrial scanners, military radios [19].
The performance of FPGAs, and most RC architectures is better than micropro-
cessors or CPUs by several orders of magnitude [18], despite the fact that FPGAs
have lower clock speeds and huge area overhead product of reconfigurable wiring
fabrics. The reason for this paradigm is the way that data is moved. CPUs usually
move data between memories commanded by the execution of instructions due to
the von Neumann approach. Besides, reading and decoding is required before the
execution inside the CPU. In contrast, in RC, an algorithm is run by data streams
only. No hardwired program counter or instructions sequences are required. In-
stead, reconfigurable data counters are used, and the execution in the reconfigurable
architecture or DPU is transport-triggered [19]. This is a major difference between
RC and classical CPU.
Hardware acceleration, as a subsystem that supports CPU, is a classical appli-
cation for RC implemented either in a hardwired logic device e.g., ASIC or in a
field-programmable logic, e.g., FPGA.
Block IOB
(a) Abstract Block Diagram of an FPGA (b) Simplified FPGA Design Flow
Figure 2.1: Abstract view of a FPGA internal structure and design flow. (a) Abstract block
diagram of a modern FPGA integrated circuit with typical blocks such as CLB, input-output
blocks (IOB), and digital-clock-manager (DCM), and specialized blocks such as BRAM and DSP.
(b) Simplified design flow from design entry as HDL description until generation of the equivalent
bitstream file which is loaded in the FPGA configuration memory
(EPROM) at the cost of being able to compute only Boolean functions in sum-of-
product form. In fact, FPGAs feature much more flexibility by introducing CLBs
and routable wiring fabrics for interconnecting between CLBs (Figure 2.1). As
opposed to FPLAs, the CLB in FPGAs allows for instance to select one of 16
logic functions from simple Look-Up Tables (LUTs). In any case, FPLAs are not
routable and permit just to implement Boolean functions in sum-of-product form.
The kind of reconfiguration in FPGAs has improved from full to partial-and-
RTR. In contrast to early years of reconfigurable computing where a full reconfigu-
ration was loaded at a time in an FPGA, partial reconfiguration allows the partition
of the FPGA into reconfigurable regions that can be independently reconfigured.
Besides, run-time reconfiguration permits to reconfigure any of those regions with-
out any global reset allowing the other regions remain operative. Therefore, the
challenge is different nowadays since a multicore MPSoC and programmable logic
can coexist in the same FPGA. Thus, the system itself must be able to selectively
reprogram logic partitions while the software in the processor is still running. This
property makes partial and RTR extremely attractive and challenging [Paper IV].
Regarding adaptivity, the improved speed-up, power consumption, reduced size,
and flexibility given by reconfiguration makes heterogeneous FPGA-based SoCs
with reconfigurable cores the ideal vehicle for adaptive embedded systems [Pa-
static reconfigurable
Functionality unnaffected,
according to initial full
configuration .bit file
reprogrammed according to
partial .bit file(s)
Figure 2.2: Partial and run-time reconfigurable (RTR) Reconfiguration. (a) Main concept. The
logic is divided in static and reconfigurable regions (partitions). Reconfigurable modules (partial
reconfigurations) can be programmed into the reconfigurable partition (RP) while the static region
remains unnafected. (b) Simplified partial reconfiguration design flow. The bitstream generation
is repeated to obtain each partial reconfiguration bitstream. The static region remains constant
so it is copied (promoted) between designs (Source: Xilinx [63])
per V]. For instance, today researchers can implement on hardware, previously
unthinkable properties such as self-reconfiguration, self-repairing, self-healing, or
dynamic hardware redundancy for on-line fault mitigation with low area overhead
[6] [57] [39].
Research in RC witnessed the first steps towards heterogeneous RTR MPSoC,
however, they are now becoming a mainstream trend due to industry solutions such
as the Xilinx Zynq All Programmable SoC platform [2]. In these environments, the
importance of flexibility through IP-core design-and-reuse [33] is now more evident
than before. Therefore, modular and flexible solutions based on IP-cores that are
also RTR (RTR IP-cores) will become paramount in the next years [Paper IV].
2.3 Dependability
The importance of building dependable computer systems has increased radically.
Today, the industry can integrate silicon devices at nano-scales and provide suffi-
cient logic density to synthesize complex systems such as multicore heterogeneous
SoCs. However, it also increases the probability of fabrication defects, lifetime
faults, and other issues.
Dependability is the ability of a system to deliver its intended level of service
in the presence of faults [11], which is measured in terms of probabilistic Reliabil-
ity, Availability or Safety. Fault Tolerance provide design techniques that increase
the dependability of systems. Among those FT techniques, hardware redundancy
schemes are used to mask the presence of faults or to detect them. For instance,
the TMR is a widely accepted scheme that coalesce three replicas of a module and
a majority voter. TMR is the most elementary scheme able to mask one fault [Pa-
per IV]. Partial or full reconfiguration are typically used to recover FPGA-based
systems from a faulty state.
an autonomic system. These self-x properties are described as follows. (1) Self-
configuration: the system sets or resets its internal parameters so as to conform
to initial deployment conditions or to adapt to dynamic environmental conditions,
respectively. In this way, the system continuously meets a set of business objec-
tives. (2) Self-healing: the system detects, isolates and repairs failed components
so as to maximize its availability. (3) Self-optimisation: the system pro-actively
strives to optimize its operation so as to improve efficiency on predefined goals.
(4) Self-protection: the system anticipates, identifies and prevents various types
of threats to preserve its integrity and security [27]. Later, an extended list has
been progressively added to some other enabling properties such as self-adapting,
self-adjusting, self-aware, self-diagnosis. It most be noted that self-configuration
may be the most important enabler for the other fundamental properties, e.g.,
self-optimization, self-healing, and self-protection.
Figure 2.4: Autonomic Computing Manager. The MAPE-K model or autonomic manager refer-
ence architecture. (Source: from Lalanda et al. Autonomic Computing [29])
question whether self-organized and adaptive systems will arise but how they will be
designed and controlled” [51].
Similarly, OC also devise technical systems that act more independently, flex-
ibly and autonomously, adapting dynamically to the current conditions of its en-
vironment. The organic systems are self-organizing, self-configuring, self-healing,
self-protecting, self- explaining and context-aware [37]. However, endowing techni-
cal systems with the property of self-organization in order to be adaptive changing
environmental conditions and external goals, may result in a phenomenon known
as emergent behavior (e.g., a flock of birds, school of fish) exhibiting higher de-
grees of order. This emergent behavior may be unanticipated or undesired, with
positive or negative effects [37] [51]. Therefore, the vision of autonomic comput-
ing is about computing systems capable of managing themselves without (or with
limited) intervention by human beings [45].
description with respect to a given objective and takes appropriate control actions
to influence the underlying system and meet the system goal [51]. OC allows a sin-
gle SuOC but also encourage the distribution of computational intelligence among
large populations of smaller entities. In both cases, the SuOC needs to fulfill basic
requirements: system and environmental observability, performance measurability,
system parameters dynamic reconfigurability [54].
Chapter 3
This chapter in brief introduces the RecoBlock Concept, its implementation, and
enhancements for reconfigurable hardware acceleration of complex algorithms in-
cluding performance analysis. This chapter is divided into two sections that sum-
marize the work in [Paper I] and [Paper II]. Section 3.1, presents the flexible and
reusable RecoBlock Concept, describes the research questions that motivated this
investigation, the proposed solution, and gradually discuss the reasons that moti-
vated the RecoBlock concept. It then summarizes its implementation and discuss
work in Paper I. Section 3.2 present a brief overview of the challenges, motivation,
and the solution to develop complex algorithms in Matlab and implement them
in RTR RecoBlock IP-cores. It then summarizes the RecoBlock platform charac-
terization and its performance analysis as a reconfigurable computing architecture,
with execution phases. Finally, it concludes the work in Paper II. The purpose
of this chapter is to give an abstract overview of these two papers from another
perspective and to help the reader to understand the work that supported the rest
of publications and the thesis.
"Architecture starts when you carefully put two bricks together. There
it begins."
3.1.1 Motivation
Run-time reconfigurable FPGAs combine the flexibility of software with the high
efficiency of hardware. Still, their potential cannot be fully exploited because of
the increased complexity of its design process, particularly the long and complex
generation of partial reconfigurations.
To illustrate the initial motivation for the research in this thesis, Figure 3.1
describes a scenario where a system level designer wants to explore the benefits
of RTR. One of the common strategies to cope with the SoC’s design complexity
gap [33] is to perform system-level design and design exploration at high-levels
of abstraction. Nevertheless, system-level designers may also want to try run-
time reconfiguration in their projects, attracted by the advantages in speed-up and
energy saving offered by reconfigurable computing. Then, some questions start to
come out in the designers mind such as, shall I use fine-grained, coarse-grained,
or maybe a full high-performance reconfigurable system. Frequently, the answer is
something in the middle, a virtual platform.
Thus, the problem is a new kind of design gap, the one between high-level
design approaches and reconfigurable computing technology. On one hand, system
level designers forget to consider or wrongly assume important details when they
abstract their models. Whereas on the other hand, hardware designers are not
making reconfiguration friendly enough. In this sense, the challenge is to investigate
alternative solutions to close this gap and to move people to the reconfigurable
computing revolution. Hence, to bridge this gap, this thesis devised the RecoBlock
3.1.2 Idea
The following describes the reasoning development that enabled this thesis ad-
dressing the research question presented in Chapter 1. Design-and-reuse through
IP-cores is one of the foundations of any SoC [13] [33], which even outlines an
ideal IP business model. Consequently, the author wonders if the same strategies
can be used to conquer the new system design and reconfigurable computing gap.
This presumption makes sense, particularly considering that in near future FPGAs
are expected to hold heterogeneous and multicore SoCs, which contrasts with the
old glue-logic vision of FPGA design. In this context, reconfigurable slots have to
communicate with a processor system through an interconnection fabric. For that
p1 f3 A
p3 Matlab model A
Process Network
Figure 3.1: The system-level design and reconfigurable computing gap. System-level designers
may want to try run-time reconfiguration, attracted by the advantages in speed-up and energy
saving. Sometimes, the design time and cost of a custom implementation, or the selection between
fine-grained, coarse-grained, high-performance reconfigurable systems, or virtual platforms, will
dis-encourage the designer. System-level designers forget important implementation details when
they abstract their models, whereas hardware designers are not making reconfiguration friendly
reason, any reconfigurable function should be packed inside an IP-core with appro-
priate master and slave ports and following the interconnection fabric protocols.
In this way, the chances of reuse and maximize productivity may be higher. The
idea described above may seem evident, but as simple as it sounds, it has not been
always realized or adopted. The results, of this thesis, exemplify the advantages of
flexibility, reusability, and productivity of this approach.
Library of
RTR Functions
Interblock Links (SW)
Array Architecture (SW)
Built-In Buffer
p1 p0
p1 RM
RecoBlock IP Partial
RM Reconfiguration
Lib. (.bit)
Figure 3.2: The RecoBlock Concept. A framework composed of a reusable SoC template, run-time
reconfigurable plug-and-play IP-cores (RecoBlocks), a library of partial reconfiguration bitstreams,
hardware and configware software layer with API and drivers, and a design methodology. Addi-
tionally, highly flexible links selected by software help to define different arrays of reconfigurable
architectures during run-time. The RecoBlock IP-core is also as an implicit hard- ware accelerator
with additional built-in performance-oriented features [Paper I].
uration, the IP-core has a built in buffer that improves performance and alleviates
contention for shared RAM. The concept supports data-driven applications, and
context switching during reconfiguration.
The following scenario, illustrated in Figure 3.2, assumes there is a model of a
system where exists a process that is resource-hungry. This condition makes it an
ideal candidate to be mapped onto hardware. Then the functions are allocated in
the IP-core. But, that is nothing new, embedded systems designers have been using
FPGAs to accelerate functions, even with IP-cores. However, the difference is that
after a while, a second candidate is identified, and the system can reconfigure that
function in the embedded Reconfigurable Partition (RP) of the same IP-core, at
run-time while the processor is running. Besides, since there is no coarse-grained re-
configurable custom architecture in the partition, the processes are mapped almost
directly without any particular compiler, provided that the partial reconfigura-
tions are available beforehand. The VHSIC (Very High Speed Integrated Circuit)
Hardware Description Language (VHDL) description could become available via a
high-level synthesis tool such as C2H or HDL Coder [24]
Regarding the choices for implementation, the RecoBlock concept is supposed
to be general and not limited to any FPGA technology, hence his importance.
However, the concept has been demonstrated using Xilinx devices (i.e., Virtex-6)
and tools (i.e., ISE, XPS and PlanAhead, versions 13.x to 14.x) for the following
reasons. By the time this concept was conceived, Altera did not have devices
supporting full partial and run-time reconfiguration, their limited reconfiguration
allowed the switching of I/O standards (tranceivers) during run-time [4]. After few
years, they launched devices with RTR capability, i.e., Stratix V, and they gradually
introduce tool support [4] . In contrast, Xilinx had been for years leading RTR
technologies including tool support. During last years of the research presented
in this thesis, Zynq devices, Vivado tools, and the All Programmable and EPP
approach became available [63]. The "All Programmable SoC" concept also puts
emphasis in reusability via extensible platforms and all in one solution so that
software designers can handle the complexity of reconfigurable logic, yet it is a
complete solution with strong company support. It remains exciting to investigate
the implementation of RecoBlock arrays in the Zynq and Vivado environment.
Hardware As a consequence, the next question is which type of platform can sup-
port the RecoBlock Concept. Figure 3.3 illustrates the RecoBlock platform based
on a Xilinx Virtex-6 that supports run-time partial reconfiguration [59]. Appar-
ently, it looks like any other SoC platform because part of the research philosophy
is to keep solutions simplified and economical while enabling advanced research.
This classical Xilinx FPGA-based processor system (SoC) is called the RecoBlock
template, over which custom enhancements have been progressively added. Among
other special enhancements to the RecoBlock template, the most important are: (a)
the interconnect fabric and (b) the RecoBlock array. First, the early adopted in-
terconnection is the now de facto standard for embedded applications; the AMBA
AXI4 interconnect [60]. This custom AXI4-interconnect is configured in sparse
crossbar mode with a shared-address multiple-data (SAMD) topology. The pur-
pose is to allow independent and concurrent transfers. In that way, it enables link
flexibility. Any RecoBlock’s master can transfer data to any other RecoBlock’s
slave (even itself) without passing through memory. Although, they can also trans-
fer data to memory, no Direct Memory Access (DMA) logic is enabled inside the
IP-core architecture to keep resource consumption low and to avoid memory con-
tention. Secondly, the flexible RecoBlock array, whose architecture can be modified
via software due to the flexible links, and their functions are loaded and executed
at run-time via software due to its custom internal structure described as follows.
Block &
AXI4 Interconnect
INTERCONNECT FABRIC (Sparse Crossbar Mode)
AXI4-Lite AXI4
AXI4-Lite AXI4 AXI4-Lite AXI4 AXI4-Lite AXI4-Lite AXI4-Lite AXI4-Lite AXI4-Lite AXI4-Lite AXI4-Lite
Architecture Reconfiguration Generic System Control & IO
Figure 3.3: The basic RecoBlock platform. It is composed of a RecoBlock template and custom
enhancements like the RecoBlock array and customized AXI4-interconnect. The template is basi-
cally a classical Xilinx FPGA-based processor (SoC) architecture. The RecoBlock concept looks
for solutions simplified, economical, reusable approaches but enables advanced research on RTR
computing [Paper I]. This platform has been gradually improved. Paper V presents the current
state that includes ten RecoBlock IP-cores, self-monitoring, and other additions.
capable of single transactions, whereas the AXI4 master is capable of single and
burst transactions, which improve the general performance. Slave registers pro-
vide status, control bits, and the input data. In every RTR system, the initial
state in the RP is unknown after reconfiguration, so a software reset is needed
to reset the function loaded in the partition. This requirement is a design chal-
lenge because the context of the process (e.g., status bits, results) must be saved
between reconfigurations, whereas using the global reset resets the whole system.
The execution-and-decoupling module disengages the outputs of the RP during re-
configuration (recommended in any reconfigurable system) and controls the start
of execution. Finally, the first input first output (FIFO) module holds results, and
works as a built-in buffer for inter-process communication, decoupling in streaming
models, and context switching. This summary describes the main components that
enable hardware and configware (CW). In addition, a software structure is also
required to control the platform during run time.
Software Figure 3.5 describes the abstraction layer stack and the supporting
application programming interface (API) functions. As other SoCs, the abstraction
layer includes the HW components in the lower level, whereas drivers, libraries and
API in higher levels. The left side shows the layers generated by vendor tools are
identified, i.e., (Xilinx) Board Support Package (BSP). The right side shows the
equivalent layers that were developed for the reconfigurable features. The API
functions are divided into two types, reconfiguration and transaction. First, the
single AXI4-Lite Slave IPIF AXI4 Burst Master IPIF single & A
transactions burst
transactions A
Slave Regs up to 256 A A
Master IPIC
Slave IPIC
Master Regs
Ctrl A
Status clk Burst & Cmd
DataIn Control
ExeT ain
reset Result
after Soft
Extra result dout
reconfiguration Reset
soft reset
exe time
exe_ena fifo_empty
Exec & fifo_full FIFO
De- fifo_ena write
RP output
during PR
RP results
simple I/O, inter-process communication
increment decoupling in streaming models,
reuse & flexibility RTR context switching.
Figure 3.4: The RecoBlock RTR IP-core. The Reconfigurable Partition (RP) is the only reconfig-
urable region, the rest are part of the static region [Paper I].
reconfiguration oriented control features like software reset, execution start, and
load configuration. Secondly, the transaction-oriented which control data movement
between RecoBlocks and memory. Using this layer, the configurable links and array
structure are actually defined. Transactions can be single between two RecoBlocks
and either single or burst between one RecoBlock and memory.
Design Flow Xilinx offers a design flow for embedded systems development [61]
and another for reconfiguration [59] [62], however, they are not tailored for any
custom platform and particularly they are not aware of specific features such as
in the RecoBlock platform and IP-core. Therefore, a reliable design flow for the
RecoBlock concept has been identified and organized based on Xilinx tools and is
depicted in Figure 3.6. Finding an adequate design flow can be compared to a
puzzle, most of the pieces are hopefully provided by the company, but they are not
aware of the custom RecoBlock platform.
The center regions in Figure 3.6 show the design flow for the reconfigurable sys-
tem designer. On the left area, is the system-level design model where the project
specification starts (this area is suggested for implementation, Matlab models have
been used in later publications). First, on (Xilinx) Xilinx Platform Studio (XPS)
the basic RecoBlock Platform project is open and depending on the system require-
ments in the high-level model, more RecoBlocks can be added from IP library, and
AXI connections are established . The XML description of the complete system is
generated and taken to the software flow. Then in (Xilinx) Software Development
Figure 3.5: Abstraction layer stack (software, hardware, configware). Abstraction layer and types
of RecoBlock IP-core transactions.
Kit (SDK), the libraries and drivers are generated as a BSP. A new application in
C is created based on this BSP, and the software functions are imported from the
RecoBlock API.
While the application is running and a reconfiguration is started by one of the
API functions, a partial reconfiguration file is taken from the library, parsed, and
downloaded to the specified RecoBlock. Further details of this Design flow are
presented in [Paper I] and [Paper II]
Case Study One of the aims of this work is to enable flexibility and reusability in
run-time reconfigurable architectures. These characteristics are evaluated through
the case study illustrated in Figure 3.7. The RecoBlock platform is used to imple-
ment a model of function adaptivity [48] [67], which abstracts an encoder-decoder
streaming application. In summary, there are four processes in the model, Source
process (S), destination process (D), encoder (P0 ), and decoder (P1 ). S and D are
implemented by software that runs on CPU. P0 and P1 are implemented by run-time
reconfiguration in two RecoBlock IP-cores. Data is sent from S to D, but in the
trajectory, data is encoded and the decoded in P0 and P1 respectively. Since data
rates can be different, decoupling buffers FS and FD are required. The buffer F01
in the middle alleviates interprocess communication. F01 and FD are implemented
with the built-in FIFO in both IP-core. FS is implemented in the Synchronous
Dynamic Random Access Memory (SDRAM) memory. Several encoder e() and
Abstract Model
Refined SDF
Define Reconfig. (RPs of RcBks)
Management # of RcBk and connections
RcBk IP RcBk
(Library) Platform Read Netlist
SCHEDULE Arch. & Schedule RcBk API
Set Partitions
Static Regions A A
Information (Basic Arch.) A
.ngc Define
Create BSP
MB MEM download.bit
requested Mem
(static + initial
| newly generated
configuration) SysAce functions
(RcBk) .bit HWICAP Function library .bit (partial)
(partial reconfig) ...
Figure 3.6: Basic RecoBlock Design Flow. Tools domains are vertical. Central and Run-Time
sections describe the straightforward flows to reuse the platform. Right section belongs to the
offline recurrent flow when configurations are not in library. Upper-left section is a recommended
Function librarydesign
(.bit ) + entry [Paper I]. A complementary design flow is presented in [Paper II].
Intitial configuration (download.bit)
Straightforward Recurrent / Manual
RcBk System & Application Generation Only for new configurations
decoder d() functions can be applied. In this application, function adaptivity is not
realized by setting control bits, changing input variables, or switching circuits, but
with actual hardware run-time reconfiguration. During implementation, process P0
and P1 are mapped into RPs in two RecoBlock IP-cores. Buffers are mapped into
built-in buffers in the IP-cores or standard memory buffers in SDRAM. Finally, S
and D are mapped to MicroBlaze.
The Experiments implement the encoder-decoder application using two Re-
coBlocks and several pairs of encoder-decoder functions [Paper I]. The flexibility
of the array is demonstrated in each experiment that combines a reconfiguration
schedule with an architecture array, i.e., spatial or phased. Spatial and phased
are typical structures in reconfigurable computing [21]. Spatial uses independent
resources for each function, whereas phased optimizes space by multiplexing the
execution in time. Phased uses less number of resources but requires methods to
save the context of current execution for the next one; it, therefore, shows the
context-switch capability.
Figure 3.7: Case study: A function adaptivity model implemented on RecoBlock. An encoder-
decoder application where there are four processes in the model, source (S), destination (D),
encoder (P0 ), and decoder (P1 ). S and D are implemented by software that runs on CPU. P0 and
Case Study :
P1 are implemented by run-time reconfiguration
Test in two RecoBlock IP-cores, where several encoder
Driving Flexibility-Reusability
e() and decoder d() functions can be reconfigured. The experiments combine some reconfiguration
schedules with architecture arrays i.e., spatial or phased. Phased uses only one RecoBlock and
requires contex-switching [Paper I].
Results Overview
In Paper I the reader can found extended information about the following results.
(1) The physical layout of the FPGA after generation of partial configurations in
PlanAhead [62]. (2) Size and resources assigned to each RP also known as pblock
in PlanAhead terminology. (3) Resource utilization (i.e., FFD, LUTs, DSP48E1) of
each encoder and decoder function compared to the RP. (4) Maximum supported
frequency Fmax without slack after implementation in PlanAhead. (5) Reconfigu-
ration time for each experiment of the case study, and size of bitstream files.
3.1.4 Conclusion
To conclude, this section summarized Paper I and introduced a simplified approach
to design run- time reconfigurable systems. Paper I presented the design of the
flexible and reusable RecoBlock SoC Platform characterized mainly by RTR (plug-
and-play) IP-cores, functions loaded at run-time from the library, inter-block com-
munication configured by SW, and built-in buffers for data-driven, interprocess
communication, and context-switching. This work also identified a suitable design
flow, proved flexibility and reusability using a case study of function adaptivity
that combines different encoder and decoder functions, arrays, and schedules that
are reconfigured at run time.
Based on literature research and hands-on experience, this work also defined
a set of recommendations to improve flexibility and reusability and used them to
implement the RecoBlock Concept. The case study verifies the concept and conse-
quently confirms these recommendations. For instance, an IP-core with embedded
Reconfigurable Partition (RP) increases flexibility and reusability, since the same
IP-core is reused to instantiate several RecoBlocks whose functions are not static
but flexible due to RTR. Similarly, fixed I/O interfaces in RP improve reusability
since it facilitates the relocation of the same function to another RecoBlock (al-
though independent reconfigurations are needed for each RecoBlock). The custom
configuration interconnection fabric provides links reconfigurable by software and,
therefore, enables a reconfigurable architecture array, as the case study demon-
Through the entire investigation consolidated in this thesis, these recommen-
dations were also validated and extended in different realms. For instance, after
concluding Paper V the author considers that by providing learning capability to an
adaptive embedded system, this system can take decisions not even programmed.
Therefore, learning could increase flexibility and reusability since it reduces the
code required for decision making (decide phase) and reuses the rest of the code
that executes functions (act phase), all in different, unexpected situations.
This section summarized the initial state of the RecoBlock concept presented
with detail in Paper I. Chapter 5 provides a summary of the final state. Details of
all incremental enhancements can be found in the papers reprinted in Part II.
In the
100 10
1 A A A A
The question in this paper was how good can the performance of the RecoBlock
SoC platform be as HW accelerator in terms of speed-up and bandwidth?. In that
sense, the work in Paper II might be a test drive of the RecoBlock SoC platform
for high performance computing.
Consequently, some improvements were needed before answering this question
since Paper I demonstrated the RecoBlock concept to load functions and to assemble
architectures, but the library of reconfigurations had simple mathematical and logic
functions, which were useful for demonstrative purposes but not for generating
computational load.
Therefore, an improved but economical method to create configurations for com-
plex algorithms described and verified on high-level models was required. Besides,
the IP-core architecture had to be revised for enabling this new approach and en-
hancing the synchronization and execution mechanism from software.
The research in Paper II is also important in an era of omnipresent multi-
core heterogeneous embedded systems is becoming real. As Figure 3.8 illustrates,
Hartenstein argues that the personal supercomputer is near [20], high-performance
computing (HPC) is becoming affordable and feasible [19]. We already have in
our mobile devices the computational power of supercomputers from last decades.
Additionally, in a heterogeneous computing survey [8], authors predicted that the
efficient use of symmetric multiprocessing is unlikely. But, a hundred of accelerator
cores with a dozen of CPU cores is a more sustainable scenario. This prediction
can bring another question: Does the industry need to redesign the hardware and
software of hundreds of accelerators for each application? What if the cores could
additionally be RTR?
Thus, combining run-time reconfigurability and high performance in multicore
embedded systems will dramatically increase their chances of success, market flex-
ibility, and durability. Besides, this will enable a dream of self-organizing, cooper-
ative, and high-performance cores, which is envisioned in this thesis.
To illustrate the importance of a generic accelerator, a RTR SoC can not be seen
anymore as blank chips where the whole space is used to accelerate one algorithm.
New challenges arise when a heterogeneous RTR SoC is implemented in an FPGA,
where functionalities in the IP-cores can be switched in and out at request. As
Figure 3.9 illustrates, one of the main problems is that classic approaches tend to
increase the design complexity, design time, and resources and reduce reusability,
maintainability, and portability. But in RTR design there may be a more important
reason, which is relocation. Without standardization of input and outputs (I/O),
relocation becomes highly improbable. Figure 3.9 explains one of the reasons to
look forward a single one-fit-all IP-core.
3.2.2 Challenges
The feedback received about Paper I motivated the investigation conducted in
Challenges in this paper
Paper II, which aims at answering the research questions illustrated in Figure 3.10.
The challenges associated with these issues are briefly discussed in the following
How to develop
algorithms that can be How CAN the same IP (HW + SW)
economically and easily EFFICIENTLY hold the • HLS is ONLY one
step in the process
RP RP ported to a RTR SoC, reconfigurations of a • RTR is complex, so
RP enabling verification ? high variety of BLIND
RP RP ISSUES implementation is
HLS complex algorithms ?
KALMAN • differe
• withou
How GOOD is the Maint
X RecoBlock as RTR +
What is the effect of ACCELERATOR ?
its iP architecture ? No benchamarks in RTR How to measure
No standar methodsFigure3.10: Challenges regarding algorithm development, a one-fit-all IP-core architecture, performance
and ? - No
performance analysis, which are addressed in Paper II
benchamarks in RTR
- no standar methods
First, when accelerating complex algorithms in FPGAs and the goal is to find a
reusable method targeting a custom partial and RTR core, the traditional require-
ment of a description in a high-level language and a High-level Synthesis (HLS)
tool is shifted to a secondary plane. It seems hard to believe, given the availability
of complex algorithms already described in C or HDL and the current research
and industry interest in HLS. The problem is that HLS tools do not and could
not be aware of specific architectural requirements of every design and particularly
of RTR design constraints. Additionally, algorithm verification is not always per-
formed, and in the best case is done at high levels, in general, sometimes from
other parties, or without architectural details. Another concern is the complexity
and extended time required to generate partial reconfigurations (e.g., 30 to 50 min-
utes). Therefore, in this scenario, it is not wise to start the generation process of a
partial reconfiguration just to discover that once it is loaded in the RTR core, the
functionality is not what was expected.
Secondly, regarding the IP-core architecture, this work realized that the config-
urations need to comply with the RecoBlock architecture and particularly with the
RP interface to facilitate the relocation but simultaneously be flexible enough to
accept different algorithms with different I/O specifications. Additionally, the de-
velopment method has to be repeatable and use available HLS tools to facilitate the
adoption of other researchers, since the generation of configuration is complex by
itself. To exemplify the reason for this architectural concern, let’s cite the challenge
of I/O interfaces in RTR design. Given that every algorithm has its requirement of
number and size of input variables and output variables (e.g., f (x, y) = (a, b, c)),
each RP in the IP-core would require that specific I/O configuration. To load a
different reconfiguration during run-time, the I/O settings for the function and RP
should match. With this premise, designers would need to create a different IP-
core and partial reconfiguration for every function. Thus, this approach without
a standard interface would escalate the design complexity, resources needed, and
lead us to early years of FPGA designs.
Finally, no standard benchmarking tools are available for conducting perfor-
mance analysis in reconfigurable computing or RTR. Therefore, the architecture
must be first analyzed as a reconfigurable computing architecture to identify se-
quential phases involved in reconfiguration and algorithm execution. These phases
determine the overall acceleration performance following Amdahl’s laws. Addi-
tionally, custom instrumentation methods are required, at least time measurement
3.2.3 Idea
This work proposes that the answer to the first two challenges exists but there is
not a way to connect and unify the existing flows. Figure 3.11 shows individual
development flows that exist today but cover part of the whole process. There are
tools for modeling, hardware and software co-design, and configware design. How-
ever, two issues avoid their utilization, first they are not connected and secondly,
they are not aware of the custom reconfigurable architectures. Hence, this lost link
is based on a revised RecoBlock IP-core architecture that combines the existing
flows as Figure 3.11 shows.
Figure 3.11: Unconnected flows in RTR SoC design. Individual development flows that exist
today but cover part of the whole process. They are not connected and are not aware of custom
reconfigurable architectures such as the RTR RecoBlock IP-core
For instance Regarding HLS and HW-SW co-design, Matlab tools for HW and
SW code generation, i.e., HDL coder and Matlab coder respectively, provide au-
tomated flows for a limited set of third-party FPGA boards [24]. Most of these
boards run C code implemented in an embedded processor or HDL code into stan-
dalone FPGA boards, in the best case. Only during last years, tools are gradually
adding support to some SoC implementations [58]. However, there is still not total
support for RTR systems and particularly for custom modular designs based on
multiple RTR IP Cores, such as the RecoBlock concept Paper II. On the other
hand, during the configware flow, the VHDL description should be first merged
with the reusable static part of the IP-core architecture before the partial reconfig-
urations can be generated. Besides, verification and test-vectors generation since
modeling stages until onboard test will avoid disappointing surprises when the re-
configuration has been created after a long process, but its behavior does not match
the original algorithm. To clarify, using HLS tools to generate VHDL files for the
RecoBlock platform might look like a straightforward solution. However, it is far
from simple. Although there is much interest in developing High-level Synthesis
(HLS), few of them are reliable. However, most importantly, HLS covers only one
stage of the design process.
In summary, generating reconfigurations from complex algorithms is a complex
process. Hence, flexibility, reusability, portability, verification must be considered
at all times. For a successful utilization of the RecoBlock IP-core for acceleration
of complex algorithms, HW, SW, CW, and even test vectors need to be generated
using different tools. The following sections introduces the devised development
flow and the modified IP-core, which are thoroughly presented in Paper II.
3.2.4 Summary
Algorithm Development Flow
RecoBlock RP RP
rp.vhd IP-LIB core.vhd
fx.vhd user.vhd
Figure 3.12: Simplified flow for algorithm and reconfiguration development in RTR RecoBlock
IP-core. The modified IP-core, with an additional layer structure, unifies the flows and allow the
algorithm (fx) transformation from model to bitstream file while keeping the execution logic and
the rest of the IP-core constant for all algorithms
The devised development flow uses COTS tools to encourage the adoption of
RTR and avoid dealing with the lack of support when using open source solutions.
The flow uses Xilinx and Mathworks tools, but the concept could be generalized.
Detailed information is found in Paper II.
The algorithm, named fx for simplicity, is described, simulated and verified in
Matlab. HDL coder [24] performs automatic verification and generates the HDL
description. Matlab Coder creates the C description that becomes useful to run the
algorithm on the embedded processor. In this way, the application designer can
choose to run the same algorithm whether on CPU (software) or in RecoBlock core
(hardware) while still being able to correlate results using the same test vectors.
The key in this unified flow is how the IP-core is transformed along different tools
in this production line, enabled by the new layer structure. In VHDL abstraction,
the IP-core has now four layers core (AXI interfaces), user (user registers, buffer),
rp (the partial reconfiguration partition) and fx (the algorithm obtained from HDL
coder) The innovation is the sublayer (fx.vhd) which separates the algorithm from
the rest of the execution logic (fire-generator) that is kept constant although lays in
the RP logic. Figure 3.12 shows how this fx description is processed, it is wrapped
with the rp file, and it then is converted into netlist files and later in a bitstream
file. The other two layers (core and user) represent the custom IP-core available in
XPS IP library and which remains constant for any algorithm. A semi-automated
script helps to process these files.
actual algorithm and is the only part extracted from HDL Coder [24], the rest (e.g.
control-path, memory) are not suitable for a SoC environment and are dropped or
replaced withOne-fit-all
custom components (e.g., fire-generator,
RecoBlock architecturebuilt-in FIFO). No DMA is
enabled to avoid memory contention, race conditions, arbitration and scheduling
CONSTANT: to increase HW/SW
reusability and maintainability
AXI4-Lite AXI4 Burst
Slave IPIF Master IPIF
I/O compatible
Master Regs with Matlab
Slave Regs function and
Burst & Cmd
RP RcBk2
clk fifo_wr
clk_fsm fire_clk INV
Soft Reset trg_fsm
reset y_out(31:0)
Figure 3.13: The RecoBlock IP-core as an universal function container. The Reconfigurable
Partition (RP) is now a wrapper that holds the FX-box that contains the actual algorithm, an
I/O interface compatible with Matlab model and RecoBlock, and the Fire-generator that allows
synchronization and execution controlled by software.
API consist of only 4 files (*.c and *.h) and the Xilinx generated xparameters.h.
In this way, the application programmer reuses the same functions to control hard-
ware features such as software-reset, algorithm execution, data write and read,
burst read, timer stamps for performance analysis, and run-time reconfiguration as
well. Further details are described in Paper II.
Yout[ ]
Xin[ ]
3 CM
1 0
Figure 3.14: Abstract view of the RTR RecoBlock SoC as reconfigurable computing architec-
ture. The SoC can be visualized as a partition of three flows or paths, i.e., control, data, and
reconfiguration path. For performance analysis, the algorithm execution is decomposed in phases
or sequential segments, i.e., reconfiguration of algorithm bitstream in IP-core (RTR), write data
(WR), execution in IP-core (EXE), read results (RD).
Path includes source (Xin[ ]) and destination (Yout[ ]) arrays in SDRAM that holds
the input and output data stream, the data input (DIN) and output (DOUT) slave
registers, the algorithm holder (FX) and the inter-process communication (FIFO).
Finally, (3) the Reconfiguration Path includes the pre-fetched buffer (CONFIG
BS), the internal configuration access port (ICAP), and the FPGA Configuration
Memory (CM). This analysis is fundamental to conduct performance analysis.
different targets (DUT), the MicroBlaze and the RecoBlock IP-core. It also verifies
the influence that burst transfers, reconfiguration pre-fetch, and type of IP-core
implementation (reconfigurable or non-reconfigurable) have over performance. In
summary, the results show that: The speedup and throughput increase is both 2x
when burst reads are used. The total speedup (including acceleration and burst
reads) ranges from 2 to 3 orders of magnitude compared to execution in MicroBlaze
soft processor. The speedup originated by burst reads increases according to the size
of data. The reconfiguration pre-fetch improves the RTR time with 80%. (speedup
is 5x compared to direct reconfiguration from CF memory). Moreover, finally, the
acceleration is greatly affected by the sequential calls to the phases of algorithm
execution and the SW overhead. Further details are presented in Paper II.
3.2.5 Conclusion
This section introduced the research in Paper II, which presents a methodology
to develop and verify complex algorithms and to generate a highly generic RTR
IP-Cores to accelerate them. The accelerators can also be used as static non-RTR
functions. It increases controllability and enables data-flow driven execution, re-
location, reuse, and portability, almost independently of the accelerated function.
The workflow uses common access tools with good cost-performance trade-off. Ex-
periments show total speedups ranging from 2 up to 3 orders of magnitude compared
to soft processor execution.
Chapter 4
Self-healing Run-time
Reconfigurable Fault-tolerance
"The fossil record implies trial and error, the inability to anticipate
the future, features inconsistent with a Great Designer (though not a
Designer of a more remote and indirect temperament.)"
4.1 Motivation
A system can be highly reliable but still not capable to tolerate, mask, or recover
when faults occur. Hence, fault-tolerance is important even when current FPGA
fabrication is highly reliable. This section introduces the idea of having a group
of reconfigurable cores that self-organize during run-time to create hardware re-
dundant structures and consequently to provide fault-tolerance to the functions
accelerated on those cores. This dynamically reconfigurable structures are called
run-time reconfigurable (RTR) FT schemes [Paper III] [Paper IV], or alternatively
dynamic fault-tolerance (dynFT). The Upset-Fault-Observer (UFO) is an advanced
case of the concept of dynFT.
This approach is the first step to enable the design of systems that overcome the
scalability cost of hardware redundancy and adaptively trigger the best self-healing
strategy according to execution environment metrics or mission state. The impor-
tance of this kind of systems was illustrated in Chapter 1 (Figure 1.1). SEUs are bit-
fluctuations in configuration memory of SRAM-based FPGAs where their logic and
interconnections are defined. SEUs are not considered a permanent fault but repre-
sents an dependability issue. In such a scenario, if cosmic radiation produces SEUs
in the configuration memory of the FPGA-based on-board computer and without
earth control, the computer must adaptively select the best FT structure that pro-
duce optimal energy and resource consumption. This sort of self-optimization could
not be enabled with classical rigid FT structures such as those shown in Figure 4.1,
which in the best case would allow reconfiguration of the function but not of the
structure itself.
4.2 Challenges
As stated in Chapter 1, the dependability of computing systems has received an
increased interest during the last years due to issues carried out by integration
technology reaching atomic scales or problems affecting specific technologies such
as FPGA. Besides, embedded systems are becoming drastically immersed in life-
critical applications. At atomic scales, logic states will be stored in smaller tran-
sistors. With fewer atoms available, bits of information will become less stable
during lifetime, and fabrication defects will yield less reliable chips. Cosmic, at-
mospheric and electromagnetic radiation can produce SEUs in system memories
and configuration memories of FPGA. Design complexity and human errors may
also increase failure rates. Therefore, dependability or the ability of a system to
deliver its intended level of service in the presence of faults [11] is every day more
State-of-the-art studies in fault-tolerance design also expose some issues. Among
other FT design techniques, a combination of hardware redundancy and RTR is typ-
ically applied at different granularity and abstraction levels (e.g., from gate level to
component or even board level). As illustrated in Figure 4.1, most states of the art
Trends in Fault tolerance
(Hardware redundancy + Reconfiguration)
a v d v
a v d v
a v d v V
c v
c v
b v
a d c v
b v
c IC 1 IC 2 IC 3
b v
In a
Out Reconfigurable ADDRESS INTERRUPT MUX Fault Tolerant
c v
c v Region Controller
c v
CONTROL LOGIC ICAP PR 1 PR 2 PR 3 Reconfigurable
(c) Application specific designs with fixed buses, (d) Partial RTR FPGA SoC.
no processor, maybe reconfigurable But, fixed PR structure and FT logic
Problem: Poor scalability, programmability, low reusability, small performance-cost ratio e.g., area, power, resources
Figure 4.1: Issues in fault-tolerant approaches that combine hardware redundancy and run-time
reconfiguration. In summary, there is poor scalability, low programmability, low reusability, and
small performance-cost ratio.
approaches succeed in their purposes, but they ignore the implications mentioned
above of integration scales. Furthermore, they fail to consider that application
specific designs with fixed buses have poor scalability and programmability, board
or component approaches have low performance compared to power and area uti-
lization, or designs exploiting RTR regions to create modular redundancy are yet
rigid since they offer function reconfiguration but not FT structure reconfiguration.
These technological and design issues demand paradigm shifts, e.g., life-time man-
agement of fabrication defects and variability of logic states, embedded test, on-line
monitoring, post-fabrication reconfigurability [10]
Hardware Duplication
Accelerated TMR TMR With Comparison
Functions with-spare (DWC)
RTR IP-cores
(a) (b)
Figure 4.2: Reconfigurable TMR schemes based on RTR IP-cores. In (a), a FPGA-based SoC
with RTR IP-cores. Functions F,G, and H are accelerated in hardware but they are not fault-
tolerant. (b) Reconfigurable fault-tolerant schemes can be strategically assembled during run-time
to provide fault-tolerance to those functions, i.e., Triple-Modular Redundancy (TMR), TMR-with-
spare (TMRwS), and DwC. An economic scheme (DWC) can be selected to optimize resources
and performance, although fault-tolerance is decreased (i.e., detect only). TMRwS provides better
fault-tolerance (i.e., locate , repair, and replace), but consumes more resources.
Hardware redundancy schemes are based on several identical units that operate
with common inputs and ideally produce the same outputs. A majority voter
masks faults and a comparator detects faults. TMR schemes mask one fault only
whereas DwC cannot mask but detects one fault only and saves resources. A
spare is used to replace faulty units, but it requires an improved voter that detects
and locates the faulty unit. Selecting a scheme represents a trade-off between the
level of dependability achieved (i.e., detection, location, masking) and resource
consumption. Additional information is found in [11] [15].
Figure 4.2 illustrates the basic idea of the proposed solution. In this example,
functions F, G, and H are accelerated in hardware cores. Next, in order to provide
fault-tolerance to those functions, different FT schemes are strategically assem-
bled during run-time, i.e., Triple-Modular Redundancy (TMR), Triple- modular-
redundancy-with-spare (TMRwS), and DwC.
RB0 VOTER 0,1,…,fault-threshold
RcBk0 state
Self - repair
RB1 Processor
RcBk1 chunks scheme none, DWC, TMR
F data
F function TMRwS, UFO
RcBk2 none, F,G,H, …, Spare
F Memory
assembly S
(a) Initial State (b) Self-assembly and normal execution (c) Service- Map
Figure 4.3: Overview of run-time reconfigurable (RTR) fault-tolerant (FT) schemes based on
RecoBlock IP-cores.
Figure 4.3 provides a brief overview of the management and processes that
enable the RTR FT schemes and also the UFO. Figure 4.3 (a) describes the SoC
initial state with several RTR IP-cores. Some functions, e.g., F, are accelerated
in the cores while other RecoBlocks are free. Figure 4.3 (b) shows how a TMR-
with-spare (TMRwS) is assembled during run-time by cloning the function F in
two empty cores and reserving another core as spare. After a fault is detected and
located by the voting-function, self-repair is performed by reconfiguring F and self-
replicate by relocating F in a spare. During normal-execution, the three replicas
compute the same data stream from memory and stores results in different memory
buffers. The voting-function implemented on the processor compares all results,
detects faults and triggers the adequate self-repair or self-replication process in
the case of permanent faults. To return to the previous state after healing, the
system computes data in chunks, and manages checkpoint and roll-back recovery
algorithms. Figure 4.3 (c) illustrates how the service-map, which is a data structure
in software whose fields represent each hardware core, keeps record of core status,
fault count, fault threshold, associated group number and scheme, and function
identifier. In this way, a set of functions and pointers operating over the service-
map can dynamically manage the processes governing the behavior of all schemes.
Relocate Beam-Up F* G
if BAD Clone Guest F
ertheless, the design complexity is moved towards other directions: (1) software
architecture, (2) interconnection, (3) physical synthesis, and (4) characterization
of FT metrics. These challenges are summarized in the following paragraphs as
an introduction to the detailed implementation presented in Paper IV. Extended
conceptual information is also offered in Paper III.
First, an efficient software model is required because it has to manage all pro-
cesses involved in the organization and behavior of reconfigurable schemes such as
voting, comparison, self-assembly, normal execution, TMR and Array-scan cycle,
cloning, self-repairing, self-replicating, check-pointing, roll-back recovery. These
features are combined with the reconfiguration management, communication, and
acceleration API functions already discussed in Chapter 3. As the software com-
plexity grows, they can not just be added to the API set. Therefore, the Operating-
Lastly, characterizing a custom RTR system that also has a custom FT ap-
proach requires also custom solutions. The performance evaluation of RTR archi-
tectures is still complicated since standard benchmarking tools are yet missing in
reconfigurable computing, it may not be difficult to realize why. The challenge is
even higher for new approaches such as the RecoBlock platform and particularly
for the reconfigurable FT schemes. Thus, another contribution in this thesis is
the characterization and evaluation of FT metrics adapted to the unique platform
functionalities. A reconfigurable FT system can be characterized in terms of time
complexity and space complexity. Thus, this work identified, defined and evalu-
ated the following metrics. Assembly-latency (time to assemble an FT scheme),
recovery-latency (since location by the voting function until repairing or replicat-
ing), detection-latency (since event-upset until fault location by comparing buffers
in memory), throughput-reduction (due to CPU’s processing overhead caused by
a TMR cycle). Finally, a comparatively analysis of the scalability and core usage
of each FT scheme provides an illustration of the advantages of the initial solution
proposed in this chapter, the dynamic creation of reconfigurable FT schemes in a
system with RTR IP-cores.
4.5 Conclusion
In conclusion, the work summarized in this chapter provides a motivation for Pa-
per III and Paper IV. This chapter introduced the novel idea of self-healing adap-
tive FT SoC that reuses RTR IP-cores to assemble different TMR schemes during
run-time. Paper III demonstrates the feasibility of the main concepts introduced in
Paper III, particularly the IP-core based RTR TMR schemes. The advanced scheme
UFO is a run-time self-test and recovery strategy that delivers fault-tolerance over
functions accelerated in RTR cores, and that saves resources by running TMR-
scan-cycles periodically. The characterization results in Paper IV demonstrates
that the UFO has better scalability and cost optimization compared to the almost
geometrical trend of the TMRwS and TMR schemes. However, the UFO detection-
latency and throughput is compromised by the duration of the Array-scan-cycle,
and the number of cores scanned in the loop. Therefore, it would be useful to pro-
vide embedded-test and economic and scalable lifetime self-healing in RTR SoCs
designed without redundancy.
This work aims to bring a new perspective on the dependability improvement of
FPGA-based SoCs. Fault-tolerance techniques, run-time-reconfiguration, and the
advantages of the RecoBlock concept are the vehicle to fulfill this purpose and to
materialize concepts borrowed from emerging fields such as bio-inspired hardware,
organic computing, and adaptive systems.
On the other hand, this research originates a set of questions such as what hap-
pens with the fault tolerance in the other FPGA regions. This method protects only
the functions in the reconfigurable cores, but who is protecting the cores? Using
the processor as voter represents a single-point of failure and produces overhead.
Answers to this questions can be found in Paper III, which also suggests a roadmap
for future research on adaptation and learning developed in Paper V and Paper VI.
Regarding adaptation, this investigation and implementation enabled the self-
healing and self-assembly autonomic properties and defined an organization model
and road-map for the complete implementation of a self-adaptive system. The
system can provide dynamic FT and self-reconfiguration, which means that it al-
ready enables hardware adaptation. But, to become self-adaptive, the system also
requires to observe (self-monitoring), to decide (analysis, evaluation, and decision
making), and to learn. The solution to this new requirement is covered in Chapter 5
Chapter 5
5.1 Motivation
This section introduces the author’s vision of cognitive reconfigurable hardware
systems as a pathway to designing adaptive systems and overcome the design com-
plexity of current reconfigurable heterogeneous multicore SoCs that are increasingly
confronted to work under uncertain environments and to simultaneously optimize
its resources. The relevance of this concept is illustrated with the following discus-
sion. Last year, the community became impressed by the announcement of the first
commercial drone-based delivery projects [34], an example of autonomic systems.
Surprisingly, few days later the news media reported that a delivery mission was
canceled not because of system failure but strangely due to something unexpected,
weather conditions . This embedded system was unable to re-route the mission,
and the batteries were running out. In this example, people may conclude that
the programmer did not write the code to deal with this situation. However, an
adequate conclusion is that designers cannot predict, model, or write code for all
unexpected situations. In this context and given that current embedded systems
have enough computational power, it would be valuable if these systems become
self-aware and are capable to find their own solutions. However, challenges like this
require new ways of developing adaptive embedded systems.
5.2 Challenges
Complex 100
80 Battery Level F
60 Work Load
needs 40 F V
hardware Bandwidth
20 F
acceleration 0
Voter (CPU)
1 2 3 4 5 6 7 8 9 10
Is this a good time to accelerate in hardware or
a) First case b) Second case
enable fault-tolerance (hardware redundancy) ?
how can uncertainty be simulated with similar computational loads as those found
on embedded systems in real environments.
Regarding previous research, Paper IV enabled the system that dynamically
triggers the RTR FT schemes and conducted a comparative analysis of their per-
formance and scalability (Chapter 4). This system was implemented following the
software organization model based on autonomic computing, which was conceived
in Paper III. Nevertheless, these papers did not present (1) a solution for self-
monitoring, (2) analysis of performance metrics and self-awareness, and (3) the
decision mechanism to trigger those FT schemes. Therefore, the work described in
the current chapter complements the missing parts of the autonomic self-adaptive
system design that is addressed in this thesis.
System Cores,
Frequency Control
Know what
Know I know Self Standards
Figure 5.2: Cognitive Reconfigurable Hardware (CRH) and its relationship with human develop-
ment in psychology. A path for designing self-adaptive and learning reconfigurable systems.
and adaptation to reduce the stress of not failing into standards. During this pro-
cess, we also learn based on our experience. By analogy, Figure 5.2 illustrates that
these processes have their counterpart in hardware. In this way, the author real-
izes that hardware and living beings can share similar cognitive development and
consequently proposes the concept of Cognitive Reconfigurable Hardware (CRH)
that is explained in more detail in Paper V. There is an emphasis on reconfigurable
systems since the author believes that reconfiguration will be fundamental in near
future adaptive embedded FPGAs. Besides, reconfiguration embraces all current
interpretations such as the actual hardware RTR on FPGA, and also parameter
configuration or mode switching on CPU and other IP-cores.
In classic control theory, both perception and action functions are typically
focused on the external (context), whereas in CRH these basic functions are ad-
ditionally concentrated on the internal (self). In CRH, embedded monitoring is
an essential point to enable self-awareness and the subsequent processes. Unfortu-
nately, run-time monitoring for embedded FPGAs is still a work in progress [23].
DYNAMIC The RecoBlock RcBk 0 RcBk 1 RcBk 2 RcBk 3
RcBk 1
axi2axi ---
(fixed) (run-time)
MDM D I GPIO Hardware Performance Monitor
RcBk 4 RcBk 5 RcBk 6 RcBk 7 F
RecoBlock IP-core
• The platform illustrated in Figure 5.3 builds-up from the previous RecoBlock
platform instantiation in Paper II. In hardware, the APM core is added to en-
able self-monitoring. In software, the improved code and API functions allow
among other features: control and configuration of the APM, self-awareness,
decision-making, rule-based learning, and each phase in the sequence-diagram
of the adaptive cycle.
Two significant challenges were experienced. First, the instantiation and con-
figuration of the APM required several iterations. The APM is compatible
with higher versions of ISE tools and Vivado. For Zynq-based boards, it is
fully supported, but it is not for previous devices that are still used in the com-
munity, such as Virtex-6 boards. Secondly, timing closure became a serious
issue for the generation of partial reconfigurations. This issue required time-
consuming exploration of different implementation strategies and constraints
in PlanAhead [62]. Apparently the high number of AXI signals connected
in each monitoring slot complicated the mapping and routing towards the
reconfigurable blocks, despite that resource utilization is not significant.
avoided. Then, after practical exploration, a set of six signals and agents (AXI
cores) were selected since they provided the most representative information
of the system activity. More information about the configuration is found in
Paper V and Figure 5.4 reproduces the plot that illustrates the self-monitored
signal metrics and agents.
Among the reasons for a careful selection of monitored metrics is that slots
are not software configurable. In contrast to slot assignment, counters and
metrics are software controlled. However, up to ten counters can be assigned
to one slot and only one metric to each counter. To exemplify the deci-
sion criteria, SDRAM carry most traffic activity in a shared-memory system,
the RecoBlock ports show result data from the accelerated functions, or the
Hardware ICAP port shows run-time reconfiguration activity that is rarely
monitored in current research but is important in reconfigurable computing.
To enable software control, the rbAPMReadReset() function was created to
work as wall-clock timer that measures elapsed-time and simultaneously reads
the counters.
Byte count (MB)
rd MicroBlaze
wr MicroBlaze
0.5 rd SDRAM
0 wr RecoBlock
5.5 Conclusion
The work introduced in this chapter aims at completing the missing observe and
decide phases of the self-healing adaptive system presented in Chapter 4 so that it
becomes self-aware (observe), and learns using a reinforcement learning Q-algorithm
to self-optimize the decisions that trigger the self-healing RTR FT schmes. In this
way, the RecoBlock SoC becomes self-adaptive.
A significant contribution in this chapter is the Cognitive Reconfigurable Hard-
ware (CRH) that defines a model for the cognitive development of reconfigurable
hardware systems consisting of five phases, i.e., self-monitor, self-awareness, self-
evaluation, self-learning, and self-adaptation. This model helps to overcome the
complexity of current reconfigurable heterogeneous multicore SoCs that has to
work under unpredictable situations such as drone-based delivery, robotic space
exploration. The CRH model borrows concepts from psychology and autonomic
computing. In this regard, three models have been defined at the end of this re-
search: (1) The OSMF in (Chapter 4), which guideline the hardware, configware,
and particularly the software organization. (2) the CRH in (Chapter 5) which de-
fines a path to design advanced self-adaptive and learning reconfigurable embedded
systems, and (3) the sequence-diagram in (Chapter 5) that delineate the processes
interaction between threads of software managers and hardware modules during
the training and testing loops. These three models consolidate the guidelines to
design self-adaptive and to learn reconfigurable embedded systems.
This research formally defined the experience, performance, and task of a learn-
ing problem that launches the most suitable FT scheme to keep the system per-
formance under an ideal range of memory traffic. This problem is implemented
current(black), next(red), goal (green) Optimal action for each state
10 3
FT schemes F=(0,1,2,3)
8 2.5
action f
state b
2 0.5
0 0
0 50 100 0 5 10
iteration k state b
byte count intervals B=(0,1,2,...,9)
2.5 1 80
2 60
2 3
action f
1.5 20
1 6 0
0.5 −20
9 −40
0 50 100 0 none 1 DC 2 TMR3 UFO
iteration k
reward r
0 50 100
iteration k
Figure 5.5: Q-algorithm applied to learn an optimal policy to select dynamic FT schemes ac-
cording to memory transactions performance (Matlab simulation). The simulation shows that
the algorithm finds an optimal policy to select dynamic FT schemes according to the current
state of memory transactions and a performance goal. The discrete space-state has ten states b.
The discrete action-space has four actions f that triggers any of the four available dynamic FT
schemes: none, DC, TMR, UFO. The optimal policy is shown on the plot "RESULTS". e.g., for
state b = 1 the UFO scheme (f = 3) is optimal and maintains the next state exactly in the goal
interval b̄goal = 4 (green line). The "Q-TABLE" illustrates the returns (accumulated rewards) for
all actions (columns) in each state (rows) in the end of the simulation. Higher returns (dark cells)
indicate the best policy for each state (i.e., in state 1, choose UFO). The plot "STATE" shows
current states (black), the goal interval (green), and next states (red) produced after applying an
action f (plot "ACTION") which in turn generates a reward r (plot "REWARD"). This process
repeats during the learning experience (100 iterations), which is divided in an initial exploration
phase (random selection) and exploitation phase in the end (actions associated with higher returns
are selected). Detailed theory and information is presented in [Paper VI].
Chapter 6
6.1 Conclusion
This thesis began motivated by the need to find a simplified and reusable approach
to developing reconfigurable embedded SoCs and explore the applications of run-
time reconfiguration (RTR). This thesis investigated how the flexibility, reusability,
and productivity in the design process of FPGA-based partial and run-time re-
configurable embedded systems-on-chip can be improved to enable research and
development of novel applications in areas such as hardware acceleration, dynamic
fault-tolerance, self-healing, self-awareness, and self-adaptation.
This thesis demonstrated that the proposed approach based on modular RTR
IP-cores and design-and-reuse principles helps to overcome the design complex-
ity, maximized the productivity of RTR embedded SoCs, and enabled research in
initially unexpected fields.
Topics and sciences such as reconfigurable computing, dependability and fault-
tolerance, complex adaptive systems, bio-inspired hardware, organic and autonomic
computing, psychology, and machine learning provided inspiration to this investi-
To sum up, some relevant contributions in this thesis are
• The RecoBlock SoC concept and platform with its flexible and reusable array
of RTR IP-cores.
• The concept and the implementation of the self-healing RTR FT schemes, es-
pecially the UFO that reuse available RTR IP-cores to self-assemble hardware
redundancy during runtime,
• An adaptive self-aware and fault-tolerant RTR SoC that learns to adapt the
RTR FT schemes to performance goals under uncertainty using rule-based
decision making.
This investigation has also helped to visualize immediate additions that will
bring more benefits or overcome observed drawbacks. (1) A muti-thread OS: the
self-adaptive RecoBlock has API support for reconfiguration, acceleration control,
self-healing management, self-awareness, and learning. However, the system so far
lacks a multi-threading OS with or without a multi-core processor. This addition
will enable the parallelization of tasks assigned to the RecoBlock or the threads
described in the sequential-diagram of Paper V. Otherwise, the software execution
remains sequential although in Paper II significant speedups were reported by the
acceleration in the RecoBlock architecture. (2) Multi-core Processor: (2) a second
soft processor (i.e., MicroBlaze) or a hard multi-core processor (e.g., ARM) will be
used as autonomic-manager to control all reconfiguration, self-healing (e.g. voting
and comparing function), self-x, and learning processes. (3) Internal thermal mon-
itoring: this option is typically used to explore self-optimization but has not been
implemented to prioritize less explored and more expressive monitoring such as the
hardware-performance-monitor implemented in Paper V. Furthermore, some less
immediate improvements and future work is suggested in the next section.
Finally, to overcome the design complexity of modern RTR SoCs and increase
their productivity particularly in real life uncertain environments, this thesis rec-
• adequate system organization at all levels, i.e., hardware, software, and con-
figware, probably based on autonomic computing.
6.2 Outlook
The discovery of new knowledge is one of the rewards of scientific research. However,
knowledge soon leads the researcher to realize that more can still be done, and new
fields can yet be explored. In this thesis, the benefits of a modular approach based
on flexible and reusable RTR cores has been explored in several fields, determined by
and analysis of the state-of-the-art and perception of future trends. For instance,
the need for cognitive and bio-inspired properties to address the complexity of
emerging and ubiquitous embedded systems, and the advantages of RTR to enable
their implementation in hardware systems. Similarly, the following briefly suggest
some ideas that can be explored, not all so conventional.
