Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 1

Fault-Resilient PCIe Bus with Real-time Error


Detection and Correction

Mostafa Darvishi, Ph.D., Member, IEEE,


Independent High-Performance Computing Researcher

ABSTRACT presented in [9] which is amenable to a small range


This paper presents a novel IP design for real-time of applications. Most of the error detection
fault/error detection and recovery on a peripheral techniques for PCIe reply on the off-line and
component interconnect express (PCIe) which delayed correction algorithms based on the
interfaces a host system (here a PC) to a slave software algorithms. The number of clock cycles to
design including processing system and memory process the error detection algorithms will
transaction implemented on a Zynq Ultrascale significantly affect the optimum slack for error
Xilinx Kintex FPGA board (KCU105). The correction cycle. The drawback of the technique
proposed IP design is capable of detection and presented in [18-18] is the application dependency
correction of different types of PCIe errors on-the- and exhaustive work to find the error and also
fly. inability of error detection and correction in almost
Index Terms—PCI-Express, Xilinx Kintex real-time mode. Moreover, increasing the design
Ultrascale FPGA, Error detection and correction. overhead will significantly affect the performance
and usability of the proposed techniques.
I. INTRODUCTION In a recent proposed method called the “Jintide”,
which is especially suitable to constitute distributed
FIELD Programmable Gate Arrays (FPGAs) are
an attractive solution to implement systems for
large-scale clusters in CPU, it can amortize
operation overheads. This scheme is effective in
high-performance computing applications.
detecting pervasive hardware security issues,
Amongst different protocols to interface a host
including vulnerabilities, backdoors, and hardware
system (e.g., a computer) to a slave high-speed
Trojans. However, the implementation overhead of
design (e.g., an FPGA design), peripheral
the algorithm itself does not make it adoptable to
component interconnect express (PCIe) is
multi-core CPU systems such as Zynq Ultrascale
dedicated to interface high-speed components
SoCs including real-time and application ARM
which was utilized for various applications in the
Cortex cores.
past decade [1-5]. PCIe slots are available in
The main contribution of this paper is evaluating
different physical configurations, i.e., x1, x4, x8,
the sensitivity of PCIe bus to different types of
x16, x32. The number after the x determines how
errors by an in-situ error detection and correction
many lanes (how data travels to and from the PCIe
mechanism. The system also benefits a real-time
card) that PCIe slot has. A PCIe x1 slot has one
fault injection core for testability and evaluation
lane and can move data at one bit per cycle. A PCIe
purposes. The injected error is stored in a DDR
x2 slot has two lanes and can move data at two bits
memory module which is continuously accessible
per cycle [6, 7]. Similar to any high-speed digital
by the Zynq processing system, software processor
data transmission protocol, PCIe is not excepted
(MicroBlaze), and the host computer. The data
from error bit either in transmitter (TX) or the
discrepancy between the Zynq processor read-out
receiver (RX) ends [8].
mechanism and the host computer (via PCIe) is
Several techniques have been proposed to test
handled for correction by MicroBlaze and the
the vulnerability of PCIe in the literature [9-18]. A
original non-faulty data will be recovered at the
versatile hardware MitM architecture capable of
expense of only one clock cycle. Instead of
interfacing with PCIe bus communications was
employing several error monitoring algorithms
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 2

configured with algorithms, using a novel


integrated IP for error detection and correction on-
the-fly will allow the whole system to continue
operating with no data transmission delay.
Several experiments were performed to identify
different types of errors which are common in
PCIe bus as well as performing a in-situ correction
mechanism. Due to the page limitation of this
paper, the results will be presented only for a few
types of errors and the extended list of results will
be drawn later at the time of presentation.
This paper is structured as follows. Different
types of common errors in PCIe bus are classified
and explained in Section II. This section is
followed by motivation and methodology of
employing an in-situ error detection and correction
mechanism presented in Section III. Experiments
and obtained results performed on a Zynq
Ultrascale SoC FPGA is presented in Section IV
and finally, we conclude in Section V. Fig. 1. PCIe bus layers architecture established between a host
and a target device. In this paper, the host device is a
computer, and the target device is a Zynq Ultrascale FPGA
II. PCIE ERRORS CLASSIFICATION evaluation board (Xilinx KCU105).
In high-speed systems design, PCI Express has
become the backbone. PCIe is a third-generation high-performance, point-to-point [18], dual
high performance I/O bus which is used to simplex, and differential signaling link [16]
interconnect peripheral devices in applications such between a source point (in our design, the host
as computing, and communication platforms computer) and a decontamination point (in our
specifically tailored to high-speed applications. It is design the targeted Zynq Ultrascale
used to provide the connections between implementation). As shown in Fig. 1, PCIe has
motherboard peripherals like graphics card, three layered architecture for communication
Ethernet card to the CPU and main memory. between the source and destination device points,
Investigations on PCIe error handling on SoC namely, “Transaction layer”. “Data Link layer”,
devices has become crucial part because of and the “Physical layer”. The errors associated to
application dependency. PCIe provides rich set of each architectural layer are described as follows
mechanisms for error recognition and handling [15-19].
where error handling may involve only hardware,
device-specific software, or even the system 1. Transaction Layer Errors
software [19]. This paper describes the errors
associated with the PCIe interface and error Transaction layer (TL) is the first and upper layer
occurred while delivery of transactions between where the packet is formed. The transaction layer
transmitter (host computer) and receiver (design only checks the completion of data transfer for end-
implemented on the Zynq Ultrascale FPGA). to-end device interconnections for the occurrence
Details of errors associated with each layer of of the following errors:
PCIe, advanced error reporting (AER), advisory
errors and recommendations for multiple error - ECRC failure,
handling are described as follows. - Corrupted TLP, i.e., error occurred in
packet format,
A. PCIe Errors Associated to Each Layer - Time-outs failure during separate packet
PCIe is a packet-based serial bus which provides transaction
a secure channel for interconnecting high-speed - Flow control protocol error
devices together [19] while ensuring a high-speed,
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 3

- Unsupported requests for data transaction source point and a destination point, the errors can
- Data corruption, i.e., affected packets, be classified as follows [11-17]:
- Host system abort,
1. Correctible Errors
- Unexpected transaction completion: i.e.,
the slave handshakes for a data receive Correctible errors are addressed as those errors
completion while the transmitter is still which impact the performance of the data
transferring data remained from the same transaction between a source point and a
packet, destination point such as bandwidth reduction or
- Receiver slave device overflow: i.e., the transmission latency. Correctible errors do not
receiver stack will fill before ending the impact the data packet and data is not lost. The
PCIe hardware will remain reliable and functional
transaction sequence and the slave does not with the occurrence of correctible errors.
handshake for the transaction completion. Correctible errors will be handled and fixed by the
hardware while no software intervention is
2. Data Link Layer Errors required. Bad DLL packet is one of the correctible
errors which is handled by the DL layer itself.
Data Link layer (DLL) is the middle layer
responsible for packet error and response handling. 1. Non-correctible Errors
DLL will check the occurrence of the following
errors in requester, switch link and the completer: Non-correctible errors are considered as “fatal” or
“non-fatal” error. In the event of fatal errors
- LCRC failure for TL packets occurrence, the system software intervention is
- Sequence number check for TL packets required to handle the error and rewrite the packet
- LCRC failure for DLL packets content fully or partially to repair the data.
- Time-outs However, the non-fatal errors occurrence could be
- DLL protocol error fixed by the intervention of a device-specific
software written by the user. It is noted that in term
of the overhead, system software intervention will
3. Physical Layer Errors be heavier than user-specified software
intervention. The number of clock cycles required
Physical layer (PL) is the third layer which is to repair a fatal error might be significantly higher
responsible for link training and transaction than those needed for a non-fatal error repair on the
handling at interface level. PL will check the same host and target device. Depending on the host
occurrence of the following errors in requester, computer and target device performances, the
switch link and the completer: number of clock cycles for a fatal error repair could
be an order of magnitude to the ones needed to fix
- Receiver errors, i.e., where the receiver a non-fatal error.
reports or includes any receival of Non-correctible fatal errors will impact the
incomplete or corrupted packet due to an integrity of the PCIe hardware established between
error, the host (in our design the host computer) and the
- Link errors, i.e., the receiver includes the target device (in our design the Zynq Ultrascale
corrupted received packet due to a broken FPGA). The PCIe data transmission link will be
or affected link between the transmitter and unstable, and data is lost in such event. That is the
receiver or even between the layer links. reason why the software system needs to intervene
to handle this type of errors. The system software
B. Severity of PCIe Errors will handle and repair the non-correctible fatal
errors by restarting both the target device and the
Depending on the severity of the PCIe errors and PCIe link. Malformed TLP Error [16], Link
how they affect the data transaction between a Training Error[11], DLL Protocol Error [19],
Receiver Overflow [17], and Flow Control
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 4

TABLE I. ERROR PES ASSOCIATED TO PCIE BUS WITH EXAMPLES


Protocol Error [14] are the examples of non- AND POINT OF OCCURRENCE
correctible fatal errors. In this paper, we will Type of Severity Example Corresponding
present how this type of error is handled by our error PCIe layer
which the
proposed IP design implemented on the target error occurs
device, i.e., the Zynq Ultrascale FPGA without Correctible Low RX error (host& PL
necessity to restart the PCIe link and the target target)
Correctible Low Bad TL packet DL
component. In our proposed IP design, the PCIe
link will be recovered as well as the non-correctible Correctible Low Bad DLL packet DL
fatal error on-the-fly thanks to the Partial Correctible Low Time-out DL
Non- Medium Corrupted RX TL
Reconfiguration (PR) feature added to the IP. correctible TL packet
Details of this procedure will be resented in the non-Fatal
following sections. Non- Medium ECRC failure TL
correctible
Non-correctible non-fatal errors will not impact non-Fatal
the integrity of the PCIe hardware established Non- Medium Unsupported TL
between the host (in our design the host computer) correctible request
non-Fatal
and the target device (in our design the Zynq Non- Medium Completion TL
Ultrascale FPGA). However, the data is lost due to correctible time-out
the occurrence of these errors. Non-correctible non-Fatal
Non- Medium Completion TL
non-fatal errors will corrupt the content of data correctible Abort
packet which cannot be repaired by the PCIe link at non-Fatal
the hardware level. It is noted, though the Non- Medium Unexpected TL
correctible Completion
occurrence of non-correctible non-fatal errors, the non-Fatal
PCIe will remain reliable and fully functional and Non- High Training error PL
correctible
would continue to operate properly. In this case, Fatal
the consecutive data transactions will be successful Non- High DLL protocol DL
and safe but only a partial data in the precedent correctible error
Fatal
packet will be affected. Recovery of non- Non- High RX overflow TL
correctible non-fatal errors will be handled by the correctible
intervention of a user-specified software algorithm Fatal
Non- High Flow control TL
which will initiate a new transaction by the correctible protocol error
requester. Corrupted received TL packet [12], Fatal
Unsupported Request (UR) [14, 19], Completion Non- High Corrupted TL TL
correctible packet
Timeout (CTO) [15], Completer Abort (CA) [13, Fatal
19], and Unexpected Completion (UC) [17-19] are
the examples of non-correctible non-fatal errors. In point. The sources point in this paper will be a host
this paper, we will discuss how this type of error is computer running with a Linux Ubuntu 20.04 OS
handled by our proposed IP design implemented on with x8 Gen 3 PCIe slot. The destination point,
the target device, i.e., the Zynq Ultrascale FPGA also called the target device, is a Xilinx Zynq
which includes a specific user-defined software Ultrascale SoC FPGA evaluation board (KCU105)
algorithm. As a conclusion of the PCIe errors which benefits from a x8 Gen 3 PCIe Core and
classifications in this section, Table I summarizes associated interfacing pinouts. Fig. 2 shows the
different types of errors associated to he PCIe bus, host computer (top) and targeted FPGA board
their severity, a well as the examples of each error (bottom) used in this paper with their respective
type and the corresponding PCIe layer which the PCIe slot and pinouts. The targeted FPGA board
error occurs [11, 19]. includes a bitstream of a design implemented by
Xilinx Vivado 2019.2.
III. PROPOSED IP DESIGN FOR PCIE ERROR DETECTION Fig. 3 shows the main block design of the proposed
AND CORRECTION method for on-the-fly error detection and
This section presents a novel IP design for correction for h PCIe bus linking the host computer
detection and correction of errors occurred in PCIe to the targeted KCU105 FPGA board. The main
bus which links a source point to a destination
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 5

Fig. 4. Optimized demonstration of IP blocks defined for


the design.

discrepancies in data packets and then command


the partial reconfiguration IP to start system
reconfiguration on-the-fly. It is noted that this IP
module is optional and is inserted in the design
just for error injection and testing of the error
detection mechanism. Indeed, the PCIe errors
could also be generated through the channel
itself.
Fig. 2. Host computer (top) and targeted FPGA board
(bottom), i.e., Xilinx Kintex Ultrascale KCU105, used in this
- ZYNQ Processor; is the main processing
paper with their respective PCIe slot and pinouts. system hardcoded inside the Kintex Ultrascale
FPGA and is responsible for processing of the
block design is comprised of the following IP whole system and control of its slave modules. The
cores: Zynq Processor is implemented on the PS side of
- PCIe core; which is responsible to establish the FPGA fabric and includes multi-core ARM
the PCIe link between the source point (host Cortex application and real-time microcontroller
computer) and the destination (ZCU105 FPGA cores.
board; - Partial Reconfiguration IP; comprised of a
- Software Processor; comprised of a Xilinx partial reconfiguration controller; DDR
MicroBlaze soft processor and corresponding memory module, and a user-defined partial
memory modules for data transactions from/to reconfiguration memory controller. This module as
the processor and its inbound and outbound its name indicates, will perform the partial
modules (Fig. 3); reconfiguration on the corrupted data packets on-
- SEM Fault Injection IP; comprised of a user- the-fly while both host and destination devices
defined fault injection tool including the SEM IP continue to operate without system interruption.
core from Xilinx Inc. as well as the interrupt For the sake of space and limited number of pages,
controller module to trigger the interrupt port of Fig. 4 shows only an optimized demonstration of
the Software Processor to start detecting IPs used in this design.

Fig. 3. Block design of the proposed technique implemented on the targeted CU105 FPGA board.
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 6

Fig. 6. Patrial result of error detection and correction for the


Fig. 5. Optimized demonstration of IP blocks defined for
PCIe bus. This is a TL-related error.
the design.
reconfiguration module (see Fig 4). The stored data
IV. EXPERIMENTS AND RESULTS
and the status of RX and TX signals from/to the
The design in Fig. 3 was implemented on the PCIe are monitored by ZYNQ processor. Upon
Xilinx KCU105 development board connected to detection of a difference between RX and TX data
the PCIe x8 lane of the host computer (Fig. 2). packs, the error detection flag is raised. This flag
After restarting the host computer, the KCU105 sends an interrupt signal to the Software processor
board will be recognized by the host computer via module, i.e., the MicroBlaze (pin interrupt from
PCIe bus as shown in Fig. 5. Integrated Logic SEM_Fault_Injection _IP to Software_Processor
Analyzer (ILA) core was used to capture the IP in Fig. 3). The Software Processor module will
corresponding PCIe bus signals for error detection. then initialize the Partial Reconfiguration
The SEM IP core which is an optional IP as Controller core (Fig. 4) to start correcting the
described in Section III is only used for fault affected partial data packet. The recovered data
injection and quick testing of the fault detection will be captured again in the memory and is sent to
scheme. The established PCIe bus between the host the PCIe bus.
computer and the targeted device does not need this Fig. 6 shows a partial result for error detection in
IP. The PCIe-related errors are usually generated the PCIe bus and how the partial reconfiguration
during data transaction in different layers of the module (PR_recovery signal) corrects the
PCIe bus as described in Table I. The extended erroneous data due to the injected fault. It is noted
version of this paper with extensively discuss the that the transmitted data (TX_DATA) always
root cause of errors how the proposed scheme will remains identical to the received data (RX_DATA)
detect and correct them on-the-fly. and is never informed about the occurred error
As shown in Fig. 3, the input system clock because the error was masked and fixed on-the-fly.
(sys_diff_clock) is an onboard 100 MHz This type of error was a transaction layer (TL)
clock fed to the partial reconfiguration IP which is error. For the sake of space in this paper, we could
also used by ILA to capture data signals at desired not cover detection of all other PCIe-related errors
checkpoints. The rest of the system is clock with a (Table I). Also, we avoided presenting the error
50 MHz clock (FCLK_CLK0) driven by the ZYNQ flag circuitry presentation. Indeed, each PCIe error
processing system. Clocking the entire system only flag has a user-defined combinational circuity.
with the single 100 MHz clock has two drawbacks; More details will be presented at the time of the
first, the distribution of the onboard 100 MHz conference.
clock signal created clock jitter for some modules,
and, second, clocking the rest of the system with V. CONCLUSION
higher frequency does not add any value to the
This paper presented a novel IP design for resilient
overall performance of the system. The
PCIe bus linking a host computer to a targeted
FCLK_CLK0 clock signal is a pure and jitter-free
Kintex Ultrascale FPGA device. The proposed IP
clock. The system reset is also provided through an
design is capable of detection and correction of
onboard reset pin (reset). different types of PCIe errors on-the-fly. Future
Design operating mechanism: upon running the works include addressing all types of errors
design, the fault injection tool (SEM) will start extensively for the PCIe bus as well as scheduling
injecting faults (optional module). At the same time an experiment for fault injection at TRIUMF
ILA is continuously capturing data transmitted to laboratory. These experiments are expected to be
the PCIe bus and snapshots of data are being stored presented for the potential journal paper.
into the DDR3 memory located inside the partial
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 7

REFERENCES [12] Zhu, Jianfeng, et al. "Jintide: Utilizing Low-


Cost Reconfigurable External Monitors to
[1] Bielich, Luis. "In-System Eye Scan of a PCI
Substantially Enhance Hardware Security of
Express Link with Vivado IP Integrator and
Large-Scale CPU Clusters." IEEE Journal of
AXI4." Application Note XAPP 1198 (v1. 1).
Xilinx Inc., 2014. Solid-State Circuits (2021).
[13] Tsounis, Ioannis, et al. "Analyzing the
[2] Bielich, Luis. "7 Series In-System Eye Scan of
Resilience to SEUs of an Image Data
a PCI Express Link with Vivado IP Integrator
and AXI4." (2013). Compression Core in a COTS SRAM
[3] Reljin, Miloš, Nebojša U. Pjevalica, and Miloš FPGA." 2019 NASA/ESA Conference on
Subotić. "Design and Verification of FPGA Adaptive Hardware and Systems (AHS). IEEE,
2019.
High Speed PCIe Real-Time Data Acquisition
[14] Rota, L., et al. "A new DMA PCIe architecture
System."
for Gigabyte data transmission." 2014 19th
[4] Lawley, Jason. "PCI Express for UltraScale
IEEE-NPSS Real Time Conference. IEEE,
Architecture-Based Devices." Technical report.
2014.
Xilinx Inc., 2015.
[5] Luo, Yawen, and Yuhua Chen. "FPGA-Based [15] Martinasso, Maxime, et al. "A PCIe
congestion-aware performance model for
Acceleration on Additive Manufacturing
Defects Inspection." Sensors 21.6 (2021): 2123. densely populated accelerator servers." SC'16:
Proceedings of the International Conference for
[6] Park, Chan-Ho, et al. "PCIe Bridge Hardware
High Performance Computing, Networking,
for Gen-Z Memory System." 2021 International
Conference on Electronics, Information, and Storage and Analysis. IEEE, 2016.
[16] Neugebauer, Rolf, et al. "Understanding PCIe
Communication (ICEIC). IEEE.
performance for end host
[7] Kuga, Yohei, et al. "NetTLP: A Development
networking." Proceedings of the 2018
Platform for PCIe devices in Software
Conference of the ACM Special Interest Group
Interacting with Hardware." 17th {USENIX}
on Data Communication. 2018.
Symposium on Networked Systems Design and
[17] Anderson, J., et al. "FELIX: a PCIe based high-
Implementation ({NSDI} 20). 2020.
throughput approach for interfacing front-end
[8] Meng, Entong, and Xiangyuan Bu. "Design and
and trigger electronics in the ATLAS Upgrade
Implementation of High-Speed Transmission
framework." Journal of Instrumentation 11.12
Link Based on PCI-E." 2020 Information
(2016): C12023.
Communication Technologies Conference
[18] Tu, William Cheng-Chun, and Tzi-cker
(ICTC). IEEE, 2020.
Chiueh. "Seamless fail-over for PCIe switched
[9] Khelif, Mohamed Amine, et al. "Toward a
networks." Proceedings of the 11th ACM
hardware man-in-the-middle attack on PCIe
International Systems and Storage Conference.
bus." Microprocessors and Microsystems 77
2018.
(2020): 103198.
[19] U. P. Singh, "PCIe error logging and handling
[10] Ruytenberg, Björn. "Breaking Thunderbolt
on a typical SoC," in Truechip
Protocol Security: Vulnerability Report."
Solution,[Online] Available :
(2020).
https://www.design-
[11] Tian, Shanquan, et al. "Fingerprinting cloud
reuse.com/articles/38374/pcie-error-logging-
FPGA infrastructures." Proceedings of the 2020
ACM/SIGDA International Symposium on and-handling-on-a-typical-soc.html. Accessed :
Field-Programmable Gate Arrays. 2020. April 14, 2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy