0% found this document useful (0 votes)
22 views

Evaluation of Xilinx Deep Learning Processing Unit Under Neutron Irradiation

This document evaluates the reliability of the Xilinx Deep Learning Processing Unit (DPU) when exposed to neutron irradiation. It implements the ResNet50 convolutional neural network model on a Xilinx Zynq Ultrascale+ system-on-chip and observes the system's behavior under neutron radiation. Few errors occurred in the CPU caches or DPU application, indicating the system was resilient to single event upsets caused by neutrons. The study analyzes the reliability of deploying complex neural networks on field-programmable gate array systems for data center applications.

Uploaded by

paulbluefield
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Evaluation of Xilinx Deep Learning Processing Unit Under Neutron Irradiation

This document evaluates the reliability of the Xilinx Deep Learning Processing Unit (DPU) when exposed to neutron irradiation. It implements the ResNet50 convolutional neural network model on a Xilinx Zynq Ultrascale+ system-on-chip and observes the system's behavior under neutron radiation. Few errors occurred in the CPU caches or DPU application, indicating the system was resilient to single event upsets caused by neutrons. The study analyzes the reliability of deploying complex neural networks on field-programmable gate array systems for data center applications.

Uploaded by

paulbluefield
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Evaluation of Xilinx Deep Learning Processing


Unit under Neutron Irradiation
Dimitris Agiakatsikas, Nikos Foutris, Aitzan Sari, Vasileios Vlagkoulis, Ioanna Souvatzoglou, Mihalis Psarakis,
Mikel Luján, Maria Kastriotou and Carlo Cazzaniga

Abstract—This paper studies the dependability of the Xilinx works have targeted relative simple CNN case studies for
arXiv:2206.01981v1 [physics.ins-det] 4 Jun 2022

Deep-Learning Processing Unit (DPU) under neutron irradiation. edge computing that neither require an Operating System (OS)
It analyses the impact of Single Event Effects (SEEs) on the nor have complex CNN topologies. Simple case studies serve
accuracy of the DPU running the resnet50 model on a Xilinx
Ultrascale+ MPSoC. well when focusing on analysing the reliability tradeoffs of
various CNN configurations, e.g., the reliability of a CNN
Index Terms—DPU, Data-center, Radiation-test, Neutrons, Re-
under different quantisation and model-compression schemes.
liability, DNN, AI
Exploring the reliability of large-scale datacenter CNN appli-
cations, however, requires a different testing paradigm. The
I. I NTRODUCTION case studies should test additional aspects of a system, such

F IELD Programmable Gate Arrays (FPGAs) have evolved


from glue-logic devices to sophisticated heterogeneous
computing platforms that pave the way to the new Artificial
as the operating system’s stability, while SEUs are occurring
in resources of the CPU, e.g., L1 and L2 caches, on the on-
chip Ethernet controller and in custom logic implemented with
Intelligence (AI) wave. Programmable Logic (PL).
As a result, datacenter market leaders increasingly integrate To this extent, this work analyses the dependability of the
FPGA System-on-Chip (SoC) in their infrastructure to target whole computational stuck of a commercial CNN inference
complex AI applications. Companies like Microsoft, Amazon solution, namely the Xilinx Vitis AI Deep Learning Processing
Web Services, and Baidu scale up AI and high-performance Unit (DPU). In more detail, we implemented the DPU on a
applications on hundreds of thousands of Intel and Xilinx Zynq Ultrascale+ XCZU9EG MPSoC, which executed image
FPGA devices [1]. classification with the resnet50 CNN model. By performing
A popular application of AI is Convolutional Neural Net- Neutron Irradiation experiments, we observed that the FPGA-
works (CNNs). A growing body of research shows that de- SoC OS did not crash due to errors in L1 and L2 caches of
ploying optimised CNN models on FPGAs achieves a higher the CPU, while the DPU application had a very low AVF.
performance per watt than CPU and GPU solutions [2]. How-
ever, modern FPGAs are vulnerable to Single Event Upsets II. BACKGROUND
(SEUs) due to their reliance on SRAM memory to store their
configuration and application data. SEUs in SRAM FPGA- A. Vitis AI and the Deep Learning Processing Unit (DPU)
SoCs are not destructive but cause various failure modes, Xilinx has introduced a rich ecosystem of tools and In-
such as Silent Data Corruption (SDC), application crashes and tellectual Property (IP) cores to ease the development of AI
kernel panics when an OS is used. applications. In more detail, Xilinx provides the Vitis AI devel-
Previous works have explored the Architectural Vulnerabil- opment environment that encompasses 1) AI frameworks (e.g.,
ity Factor (AVF) of custom FPGA CNN designs with fault- Tensorflow), 2) pre-optimised AI models, 3) quantization and
injection campaigns and irradiation experiments [3], [4]. These model compression tools, and 4) the DPU, with all necessary
Linux drivers to seamlessly deploy a CNN application on
Experiments at the ISIS Neutron and Muon Source were supported by a
beamtime allocation RB2000230 from the Science and Technology Facilities Xilinx devices.
Council. This work has been partially supported by the University of Piraeus The DPU is a convolution neural network accelerator IP
Research Center and the EU Horizon 2020 EuroEXA 754337 grant. offered by Xilinx for Zynq-7000 SoC and Zynq Ultrascale+
Dimitris Agiakatsikas (e-mail: agiakatsikas@gmail.com), Aitzan Sari,
Vasileios Vlagkoulis, Ioanna Souvatzoglou, and Mihalis Psarakis (e-mail: MPSoC devices. The DPU is implemented with PL and is
mpsarak@unipi.gr) are with the Dept. of Informatics, University of Piraeus, tightly interconnected via the AXI bus to the SoC processing
Greece. system (PS), as shown in Fig. 1. The DPU executes special
Nikos Foutris and Mikel Luján are with the Dept. of Computer Science,
The University of Manchester, UK. instructions that are generated by the Vitis AI compiler. A
Maria Kastriotou and Carlo Cazzaniga are with the ISIS Facility, STFC, typical Vitis AI development flow involves 1) the optimisation
Rutherford Appleton Laboratory, Didcot OX110 QX, UK. and compilation of a CNN model to DPU instructions and
©2021 IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media, 2) the compilation of software running on the Application
including reprinting/republishing this material for advertising or promotional Processing Unit (APU), i.e., CPU.
purposes, creating new collective works, for resale or redistribution to servers The APU pre- and post-processes DNN data, controls the
or lists, or reuse of any copyrighted component of this work in other works.
This paper has been accepted by the 2021 European Conference on Radiation DPU, and orchestrates the movement of instructions and data
and Its Effects on Components and Systems (RADECS) between the DPU, the CPU, and the off-chip DDR memory.
2

Off-Chip DDR memory

APU
APU DDR
APU
core_N Memory Controller
APU

AXI interconnect bus


Processing System (PS)

On-Chip BRAM Buffer


On-Chip BRAM Buffer
On-Chip BRAM Buffer
Instruction Fig. 2. Neutron beam experiment at the ChipIr facility of RAL, UK.
Scheduler
Computing Engine TABLE I
Computing Engine
Computing Engine R ESOURCE UTILISATION AND OPERATING FREQUENCY OF THE DPU
TARGETED REFERENCE DESIGN
Deep Learnign Processing Unit (DPU)
Resource Utilisation Available Utilisation % Frequency
Programmable Logic (PL)
LUT 108,208 274,080 39.48 325 MHz
LUTRAM 11,960 144,000 8.31 325 MHz
Fig. 1. Deep-learing acceleration with the Xilinx Deep Processing Unit (DPU) FF 203,901 548,160 37.20 325 MHz
on Zynq© -7000 SoC and Zynq© Ultrascale+TM MPSoC devices. BRAM 522 912 57.24 325 MHz
DSP 1,395 2,520 55.36 650 MHz
IO 7 328 2.13 325 MHz
BUFG 6 404 1.49 325 MHz
The DPU consists of an instruction scheduler and up to MMCM 1 4 25.00 325 MHz
three on-chip BRAM buffers and computing engines. The PLL 1 8 12.50 325 MHz
instruction scheduler fetches and decodes DPU instructions APU 1 1 100.00 1200 MHz
DDR ctrl. 1 1 100.00 533 MHz
from off-chip memory and controls the on-chip memories
and computing engines. The DPU is available in eight con-
figurations, i.e., B512, B800, B1024, B1152, B1600, B2304,
B3136, and B4096. Each configuration utilises a different B. Design Under Test (DUT): DPU-B4096
number of computing engines and on-chip memories in order The Vivado DPU targeted reference design (TRD) [7]
to target different sized devices and support various DPU provided by Vitis AI v1.3.1 was implemented with Vivado
functionalities, e.g., ReLU, RELU6, or Leaky ReLU. 2020.2 for a ZCU102 development board. The ZCU102
board features the Zynq UltraScale+ XCZU9EG MPSoC.
III. E XPERIMENTAL S ETUP The B4096 configuration of the DPU was synthesised
with default settings, i.e., with RAM_USAGE_LOW,
A. ChipIr Neutron Beam CHANNEL_AUGMENTATION_ENABLE, DWCV_ENABLE,
ChipIr is an ISIS neutron and muon facility instrument at POOL_AVG_ENABLE, RELU_LEAKYRELU_RELU6,
the Rutherford Appleton Laboratory (UK), designed to deliver Softmax.
an atmospheric-like fast neutron spectrum to test radiation The design was implemented with Vivado’s
effects on electronic components and devices [5], [6]. The ISIS Performance_ExplorePostRoutePhysOpt* run
accelerator provides a proton beam of 800 MeV, 40 µA, strategy because Vivado’s default run strategy resulted
10 Hz, impinging on the tungsten target of its Target Sta- in time violations. Table I shows the resource utilisation
tion 2, where ChipIr is located. The spallation neutrons and operating frequency of the DPU TRD. Due to the high
produced illuminate a secondary scatterer which optimises resource utilisation of the TRD, Vivado reported a relatively
the atmospheric-like spectrum arriving at ChipIr, with an high percentage (41.45%) of essential bits – 59281993 out
acceleration factor of up to 109 for ground-level applications. of the 143015456 total configuration bits were essential
With a frequency of 10 Hz, the beam pulses consist of bits. Please recall that essentials bits are configuration bits
two 70 ns wide bunches separated by 360 ns. The beam that, when corrupted have the potential to cause functional
fluence at the position of the Device Under Test (DUT) was errors. Two important notes can be made for Table I.
continuously monitored by a silicon diode, while the beam flux 1st note: All resources in the DPU operate at
of neutrons above 10 MeV during the experimental campaign 325 MHz except for the DSPs, which run at
was 5.6 x 106 neutrons/cm2 /s. The beam size was set 2 x 325 MHz = 650 MHz. This is because the DPU
through the two sets of the ChipIr jaws to 7cm x 7cm. design applies a double data rate technique on DSP
3

resources. Since DSPs are able to operate in a much higher


frequency than other PL resources, one can perform N times
more computation by running the DSPs with N times the Beam room Control room
frequency of the surrounding logic while multiplexing and
de-multiplexing their input and output data, respectively.
2nd note: The design utilises 319, 55, 405, 4 and USB Ethernet switch Ethernet switch
1 LUT, LUTRAM, Flip-Flops (FF), BRAM and DSP more extender
resources, respectively, than the baseline DPU-TRD design. USB
This is because we included the Xilinx Soft Error Mitigation Reset
(SEM) controller in the design to perform fault injection and ZCU102 12VDC PSU Laptop
validate our experimental setup prior to the radiation tests. The Check
clock of the SEM controller was gated off during beamtime Results
to replicate a simple out-of-the-box implementation scenario.
We used Petalinux 2020.2 to generate a Linux OS image
for the ZCU102 by using the default Board Support Package
(BSP) provided by the DPU-TRD, except 1) the nfs_utils
package which was additionally added in RootFS to mount a
Network File Sharing (NFS) folder on Linux, and 2) the u-boot Fig. 3. Test setup at the ChipIr facility of RAL, UK.
bootloader that mounted an external SD EXT4 file system
instead of INITRD RAM disk. IV. E XPERIMENTAL R ESULTS
The CNN application that run on the DPU was the
resnet50.xmodel, which was also provided by the Vitis In this section, we discuss the impact of neutron radiation
AI DPU-TRD. This resnet50 model is neither compressed effects on the reliability of the DPU accelerator. Given that the
nor quantised but serves nice as a baseline application for DPU comprises of a heterogeneous architecture including the
comparison with more optimised models that we aim to ARM SoC and the FPGA fabric, we first present the cross-
implement and test in future work. sections of the memories of the PS part (i.e. CPU caches) and
then discuss how the SEUs in the PL configuration memory
affect the behaviour of the system. Please note that the neutron
C. Test procedure radiation experiments took place at ChipIr on May 2021.
Table II presents the cross-sections of the 32KB Level-
Fig. 3 shows the test setup of the radiation experiment. A 1 Data (L1-D) Cache, the 32KB Level-1 Instruction (L1-I)
laptop in ChipIr’s control room orchestrated the test proce- Cache, and the translation lookaside buffer (TLB) – a two-
dure of the DUT. The ZCU102 development board (hosting level TLB with 512 entries that handles all translation table
the DUT), an Ethernet-controlled Power Supply Unit (PSU), operations of the CPU. Table III presents the cross-sections
and a USB device that remotely reset the ZCU102 (i.e., by of the 1MB Level-2 (L2) Cache and the Snoop Control Unit
electrically shorting the SRTS_B and POR_B buttons of the (SCU). The SCU has duplicate copies of the L1 data-cache
board) was located in the beam room. tags. It connects the APU cores with the device’s accelerator
The test of the DUT took place as follows. 1) An Experi- coherency port (ACP) to enable hardware accelerators issue
ment Control Software (ECS) running on the laptop remotely coherent accesses to the L1 memory space. The upsets in the
resets the DUT and waits for the DUT to boot, 2) the DUT data and tag arrays in both the L1 and L2 caches have been
Linux OS restarts, and 3) after a successful kernel boot, an separately identified. The cross-sections of the tag arrays have
/etc/init.d/startup.sh script executes the following been calculated based on the tag sizes of the caches, e.g.,
sub-tasks: 4a) the DUT connects on an NFS folder located a 16-bit tag in the 16-way set associative 1MB L2 cache.
on the laptop, 4b) the DUT writes a sync.log file in the The cross sections have been calculated for a total fluence
shared NFS folder to notify the ECS of a successful boot, of 5.5x1010 neutrons/cm2 on a more than 3-hour radiation
4c) an initial resnet50 classification takes place to warm- experiment. The results show that the cross-sections of the
up the CPU caches, 4d) the sync.log is updated to notify tag arrays are slightly lower than those of the data arrays. Our
ECS that it is ready to start image classifications, 4e) the cross-section calculations for all caches (i.e., L1 and L2) are
/etc/init.d/startup.sh enters an infinite loop where very close to those reported in [4].
it continuously runs DPU classifications and stores the results Fig. 4 presents the proportion of detected upsets during
in the NFS folder to be checked by the ECS. The result cache accesses per CPU core. As shown in the figure, the
checking (i.e. by the ECS) of each classification iteration is upsets in the L1 caches are balanced between the four cores,
synchronised with the DUT via a mutex stored in the shared while in the L2 cache, more upsets were observed in Core 3. In
sync.log file. The ECS remotely resets the DUT when it future work, we aim to save the utilisation of all MPSoC cores
detects a boot timeout, a Critical Error (see Sec. IV) or a and OS running processes to better understand the unbalanced
result timeout. It is worth noting that for each classification distribution of detected upsets per core in L2.
iteration, the DUT saves the classification result and the Linux All MPSoC caches are protected against SEUs with Error
dmesg.log for post-analysis. Correction Code (ECC) mechanisms, e.g., L1-D and L2 caches
4

to misclassification and b) Tolerable Errors, where the errors


observed in a result do not affect the classification decision.
Table IV presents the number of Critical Errors and Tolerable
Errors and their cross-sections. Based on the SEU vulnerability
of the configuration memory and the BRAM reported in [4]
and the programmable resources of the DPU (i.e. essential
configuration bits and BRAMs used), we calculate the rate
of upsets affecting the DPU execution. These are 0.14 and
0.55 upsets per classification run in the configuration mem-
ory and BRAM contents, respectively. This means that each
classification run (which lasts 1.7 seconds), experienced, on
Fig. 4. Cache upsets per CPU Core. average, 0.69 upsets. Notice that since scrubbing was not
supported in the experiment, the upsets in the configuration
memory were accumulated until the next reset cycle of the
incorporate ECC with Single Error Correction Double Error
system. Moreover, we estimated that more than one upsets
Detection (SECDED) capability, while the L1-I cache has
were accumulated during the 38 seconds boot and warm-up
parity that supports only SED. All these SED and SECDEC
period of each reset cycle. Thus, for all classification runs,
mechanisms in L1 and L2 caches mitigated soft errors in
the DPU circuit encountered more than one upsets in the PL
the CPU of the FPGA-SoC, therefore resulting in a stable
memories. As a worst-case analysis, we estimate that the AVF
OS execution, with no application crashes or kernel panics
of the DPU accelerator is less than 0.78%, assuming any single
during the radiation tests. This was achieved either by hav-
upset in the DPU leads to a critical error.
ing the ECC scheme correct the error or by flushing and
reloading the cache during exceptions of the Linux EDAC V. C ONCLUSION
driver (/sys/devices/system/edac/mc). The neutron radiation experiment demonstrated that the
TABLE II ECC and the interleaving schemes integrated into the Ultra-
L1 C ACHE C ROSS S ECTION Scale+ MPSoC caches considerably protect the software stack
(OS, pre and post-processing of DNN data, data movement) of
Upsets Cross Section Conf. Level 95%
(cm2 /bit) Lower Upper the Xilinx Vitis AI DPU from radiation-induced SEUs. It was
L1-D Data 32 2.20E-15 1.50E-15 3.11E-15 shown that the most vulnerable part of the DPU is the logic
L1-D Tag 3 3.47E-16 7.16E-17 1.02E-15 implemented in the FPGA PL. Due to the large amount of
L1-D Total 35 1.51E-15 1.05E-15 2.10E-15 utilised PL resources by the DPU, we reasoned that it is likely
L1-I Data 25 1.72E-15 1.11E-15 2.54E-15
L1-I Tag 4 4.89E-16 1.33E-16 1.25E-15 that the CNN application will be highly vulnerable to SEUs.
L1-I Total 29 1.28E-15 8.54E-16 1.83E-15 However, due to the inherent error resiliency of the neural
L1 TLB 9 9.90E-15 4.53E-15 1.88E-14 network, only a small portion of SEUs lead to classification
errors, resulting in a significantly small AVF. In future work,
TABLE III we aim to perform fault injection experiments in the DPU to
L2 C ACHE C ROSS S ECTION obtain a better understanding of its failure mechanisms and
Upsets Cross Section Conf. Level 95%
propose efficient SEE mitigation approaches to further reduce
(cm2 /bit) Lower Upper its AVF.
L2 Data 293 6.29E-16 5.59E-16 7.06E-16
L2 Tag 20 8.59E-17 5.25E-17 1.33E-16 R EFERENCES
L2 Total 313 4.48E-16 4.00E-16 5.01E-16 [1] A. M. Keller and M. J. Wirthlin, “Impact of Soft Errors on Large-Scale
Snoop CU 4 4.63E-16 1.26E-16 1.19E-15 FPGA Cloud Computing,” in ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA). New York, NY, USA:
Association for Computing Machinery, 2019, p. 272–281. [Online].
TABLE IV Available: https://doi.org/10.1145/3289602.3293911
DPU C ROSS S ECTION [2] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk,
P. Y. K. Cheung, and G. A. Constantinides, “Deep Neural Network
Classification Cross Conf. Level Approximation for Custom Hardware: Where We’ve Been, Where We’re
runs Section 95% Going,” ACM Comput. Surv., vol. 52, no. 2, May 2019. [Online].
# % (cm2 ) Lower Upper Available: https://doi.org/10.1145/3309551
Correct runs 2964 50.27% - - - [3] F. Libano, B. Wilson, J. Anderson, M. Wirthlin, C. Cazzaniga, C. D.
Critical (C) 46 0.78% 8.29E-10 6.07E-10 1.11E-09 Frost, and P. Rech, “Selective Hardening for Neural Networks in FPGAs,”
Tolerable (T) 2886 48.95% 5.20E-08 5.01E-08 5.39E-08 IEEE Transactions on Nuclear Science, vol. 66, no. 1, pp. 216–222,
C+T errors 2932 49.73% 5.28E-08 5.09E-08 5.48E-08 2019. [Online]. Available: https://doi.org/10.1109/TNS.2018.2884460
[4] J. D. Anderson, J. C. Leavitt, and M. J. Wirthlin, “Neutron Radiation
Next, we discuss the impact of radiation effects on the Beam Results for the Xilinx UltraScale+ MPSoC,” in IEEE Radiation
Effects Data Workshop (REDW), 2018, pp. 1–7. [Online]. Available:
DPU classification accuracy. During the 3-hour experiment, https://doi.org/10.1109/NSREC.2018.8584297
the DPU performed 5896 classification runs. Only a very small [5] C. Cazzaniga, M. Bagatin, S. Gerardin, A. Costantino, and C. D. Frost,
portion of runs (0.78%) resulted in misclassification. Notice “First Tests of a New Facility for Device-Level, Board-Level and System-
Level Neutron Irradiation of Microelectronics,” IEEE Transactions on
that the errors of the resnet50 classification results are Emerging Topics in Computing, vol. 9, no. 1, pp. 104–108, 2021.
categorized similarly to [3] as a) Critical Errors, which lead [Online]. Available: https://doi.org/10.1109/TETC.2018.2879027
5

[6] C. Cazzaniga, R. G. Alı́a, M. Kastriotou, M. Cecchetto, P. Fernandez-


Martinez, and C. D. Frost, “Study of the Deposited Energy Spectra in
Silicon by High-Energy Neutron and Mixed Fields,” IEEE Transactions
on Nuclear Science, vol. 67, no. 1, pp. 175–180, 2020. [Online].
Available: https://doi.org/10.1109/TNS.2019.2944657
[7] “Zynq UltraScale MPSoC DPU TRD V3.3 Vivado 2020.2,” Xilinx
Inc. [Online]. Available: https://github.com/Xilinx/Vitis-AI/blob/v1.3.1/
dsa/DPU-TRD/prj/Vivado/README.md

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy