Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction
Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction
Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction
- Unsupported requests for data transaction source point and a destination point, the errors can
- Data corruption, i.e., affected packets, be classified as follows [11-17]:
- Host system abort,
1. Correctible Errors
- Unexpected transaction completion: i.e.,
the slave handshakes for a data receive Correctible errors are addressed as those errors
completion while the transmitter is still which impact the performance of the data
transferring data remained from the same transaction between a source point and a
packet, destination point such as bandwidth reduction or
- Receiver slave device overflow: i.e., the transmission latency. Correctible errors do not
receiver stack will fill before ending the impact the data packet and data is not lost. The
PCIe hardware will remain reliable and functional
transaction sequence and the slave does not with the occurrence of correctible errors.
handshake for the transaction completion. Correctible errors will be handled and fixed by the
hardware while no software intervention is
2. Data Link Layer Errors required. Bad DLL packet is one of the correctible
errors which is handled by the DL layer itself.
Data Link layer (DLL) is the middle layer
responsible for packet error and response handling. 1. Non-correctible Errors
DLL will check the occurrence of the following
errors in requester, switch link and the completer: Non-correctible errors are considered as “fatal” or
“non-fatal” error. In the event of fatal errors
- LCRC failure for TL packets occurrence, the system software intervention is
- Sequence number check for TL packets required to handle the error and rewrite the packet
- LCRC failure for DLL packets content fully or partially to repair the data.
- Time-outs However, the non-fatal errors occurrence could be
- DLL protocol error fixed by the intervention of a device-specific
software written by the user. It is noted that in term
of the overhead, system software intervention will
3. Physical Layer Errors be heavier than user-specified software
intervention. The number of clock cycles required
Physical layer (PL) is the third layer which is to repair a fatal error might be significantly higher
responsible for link training and transaction than those needed for a non-fatal error repair on the
handling at interface level. PL will check the same host and target device. Depending on the host
occurrence of the following errors in requester, computer and target device performances, the
switch link and the completer: number of clock cycles for a fatal error repair could
be an order of magnitude to the ones needed to fix
- Receiver errors, i.e., where the receiver a non-fatal error.
reports or includes any receival of Non-correctible fatal errors will impact the
incomplete or corrupted packet due to an integrity of the PCIe hardware established between
error, the host (in our design the host computer) and the
- Link errors, i.e., the receiver includes the target device (in our design the Zynq Ultrascale
corrupted received packet due to a broken FPGA). The PCIe data transmission link will be
or affected link between the transmitter and unstable, and data is lost in such event. That is the
receiver or even between the layer links. reason why the software system needs to intervene
to handle this type of errors. The system software
B. Severity of PCIe Errors will handle and repair the non-correctible fatal
errors by restarting both the target device and the
Depending on the severity of the PCIe errors and PCIe link. Malformed TLP Error [16], Link
how they affect the data transaction between a Training Error[11], DLL Protocol Error [19],
Receiver Overflow [17], and Flow Control
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 4
Fig. 3. Block design of the proposed technique implemented on the targeted CU105 FPGA board.
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems 6