2020-Jan - Hot Interconnect 26

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

VOLUME 40, NUMBER 1 JANUARY/FEBRUARY 2020

Hot Interconnects 26

www.computer.org/micro
EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin Journals Production Manager: Joanna Gojlik,
j.gojlik@ieee.org
EDITORIAL BOARD Peer-Review Administrator:
micro-ma@computer.org
R. Iris Bahar, Brown University
Publications Portfolio Manager: Kimberly Sperka
Mauricio Breternitz, University of Lisbon
Publisher: Robin Baldwin
David Brooks, Harvard University
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
IEEE Computer Society Executive Director:
Nation al Lab
Melissa Russell
Shane Greenstein, Harvard Business School
Natalie Enright Jerger, University of Toronto IEEE PUBLISHING OPERATIONS
Hyesoon Kim, Georgia Institute of Technology
John Kim, Korea Advanced Institute of Science Senior Director, Publishing Operations:
and Technology Dawn Melley
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Director, Editorial Services: Kevin Lisankie
Manufacturing Company Director, Production Services: Peter M. Tuohy
Richard Mateosian Associate Director, Editorial Services:
Tulika Mitra, National University of Singapore Jeffrey E. Cichocki
Trevor Mudge, University of Michigan, Ann Arbor Associate Director, Information Conversion
Onur Mutlu, ETH Zurich and Editorial Support: Neelam Khinvasara
Vijaykrishnan Narayanan, The Pennsylvania Senior Art Director: Janet Dudar
State University Senior Manager, Journals Production:
Per Stenstrom, Chalmers University of Patrick Kempf
Technology CS MAGAZINE OPERATIONS COMMITTEE
Richard H. Stern, George Washington
University Law School Sumi Helal (Chair), Irena Bojanova,
Sreenivas Subramoney, Intel Corporation Jim X. Chen, Shu-Ching Chen,
Carole-Jean Wu, Arizona State University Gerardo Con Diaz, David Alan Grier,
Lixin Zhang, Chinese Academy of Sciences Lizy K. John, Marc Langheinrich, Torsten Möller,
David Nicol, Ipek Ozkaya, George Pallis,
ADVISORY BOARD VS Subrahmanian
David H. Albonesi, Erik R. Altman, Pradip Bose, CS PUBLICATIONS BOARD
Kemal Ebcioglu, Lieven Eeckhout,
Fabrizio Lombardi (VP for Publications),
Michael Flynn, Ruby B. Lee, Yale Patt,
Alfredo Benso, Cristiana Bolchini,
James E. Smith, Marc Tremblay
Javier Bruguera, Carl K. Chang, Fred Douglis,
Subscription change of address: Sumi Helal, Shi-Min Hu, Sy-Yen Kuo,
address.change@ieee.org Avi Mendelson, Stefano Zanero, Daniel Zeng
Missing or damaged copies:
COMPUTER SOCIETY OFFICE
help@computer.org
IEEE MICRO
c/o IEEE Computer Society
10662 Los Vaqueros Circle
Los Alamitos, CA 90720 USA +1 (714) 821-8380

IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
January/February 2020
Volume 40 Number 1

50 A
Special Issue High-Throughput
Network Processor

6 H
Guest Editors’ Introduction
Architecture for Latency-
ot Interconnects 26
Critical Applications
Sourav Roy, Arvind Kaushik,
Ryan E. Grant and Khaled Hamidouche Rajkumar Agrawal, Joseph Gergen,
Published by the IEEE Computer Wim Rouwet, and John Arends
Society

Theme Articles
General Interest
8 P ath2SL: Leveraging
InfiniBand Resources to
57 W
arp: A Hardware
Reduce Head-of-Line Blocking Platform for Efficient
in Fat Trees Multimodal Sensing With
German Maglione-Mathey, Adaptive Approximation
Jesus Escudero-Sahuquillo, Phillip Stanley-Marbell and Martin Rinard
Pedro Javier Garcia, Francisco J. Quiles,

67 Δ
and José Duato
NN: Power-Efficient
Neural Network
15 A Bunch-of-Wires (BoW)
Interface for Interchiplet Acceleration Using
Differential Weights
Communication
Ramin Farjadrad, Mark Kuemerle, Hoda Mahdiani, Alireza Khadem,
and Bapi Vinnakota Azam Ghanbari, Mehdi Modarressi,
Farima Fattahi-Bayat, and
Masoud Daneshtalab

25 T
oward FPGA-Based HPC:
Advancing Interconnect
75 A
utoML for Architecting
Technologies Efficient and Specialized
Joshua Lant, Javier Navaridas, Mikel Luján,
Neural Networks
and John Goodacre
Han Cai, Ji Lin, Yujun Lin,
Zhijian Liu, Kuan Wang, Tianzhe Wang,

35 C
ommunication Ligeng Zhu, and Song Han
Profiling and
Characterization of
Deep-Learning Workloads
on Clusters With High-
83 I n-Hardware Moving
Compute to Data Model
to Accelerate Thread
Performance Interconnects Synchronization on Large
Ammar Ahmad Awan, Arpan Jain, Multicores
Ching-Hsiang Chu, Hari Subramoni, and Masab Ahmad, Halit Dogan, José A. Joao,
Dhableswar K. Panda and Omer Khan

44 H igh-Quality Fault
Resiliency in Fat Trees
John Gliksberg, Antoine Capra,
Alexandre Louvet, Pedro Javier García,
and Devan Sohier
Image credit: Image licensed by Ingram Publishing

COLUMNS
From the Editor-in-Chief
4 Connectivity! Connectivity!
Connectivity! May You Be
More Connected Than Ever!!
Lizy Kurian John

Micro Economics
94 The Vital Two Percent
Shane Greenstein
Column: From the Editor-in-Chief

Connectivity! Connectivity! Connectivity!


May You Be More Connected Than Ever!!
Lizy Kurian John
The University of Texas at Austin

& BOB METCALFE, THE internet pioneer fre- on interconnect technologies circa 2019. Six
quently speaks and writes about Connectivity and papers are presented with the Hot Interconnect
the Transistor-Neuron Paradox. The human brain theme. Ryan Eric Grant from Sandia National Labs
contains approximately 100 billion neurons, inter- and Khaled Hamedouche of Advanced Micro Devi-
connected with 10 000 or more synaptic connec- ces served as the guest editors for this special
tions per neuron. Transistors can switch more issue. A preview of the six articles can be obtained
than 100 billion times per second, while neuron from the comprehensive introductory message
synapses tick along relatively slowly, at about a written by Grant and Hamedouche. Topics dis-
thousand times per second. Although the transis- cussed include leveraging InfiniBand to reducing
tor can switch a million times faster than neurons,
blocking in fat trees, workload characterization on
when billions of transistors are put together in a
clusters with high-performance interconnects,
chip, the chip is nowhere near as intelligent as the
fault resiliency in fat trees, FPGA-based high
human with an equal number of neurons. Met-
performance computing, a Bunch-of-Wires inter-
Calfe presented this comparison in a 2007 Forbes
face for inter-chiplet communication, and a high
article and posited a theory attributing the superi-
throughput network processor architecture.
ority of the brain to the larger number of synapses
The six articles from the Hot Interconnects
or connections each neuron has to other neurons
theme are accompanied by four general articles
versus the small number of connections a transis-
and a Micro Economics column by Shane
tor has to other transistors in the chip. Each tran-
Greenstein.
sistor in a chip typically connects to three or four
In the first general interest article, Marbell and
or at the most ten other transistors, compared to
Rinard describe Warp, a hardware platform for
10000+ connections for a neuron. There are argu-
multimodal sensing. This platform was designed
ments against his theory that the many neuroglial
to support research on approximate computing.
cells in the brain are ignored in this computation,
Warp incorporates sensors, computation, and cir-
nevertheless it is intuitive to believe that connec-
cuit-level facilities designed to enable approximate
tivity adds to the intelligence of processing units.
computing research. A wide range of precision
Welcome to the first issue of IEEE Micro of the
and accuracy are supported in the platform.
new year, a Special Issue on Hot Interconnects!
In the next general interest article, Cai et al.
This special issue presents a collection of articles
describe an AutoML system for architecting effi-
cient neural networks by autotuning. The authors
Digital Object Identifier 10.1109/MM.2019.2961722 investigate automatically creating specialized
Date of current version 14 January 2020. neural network models for different hardware

0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
4
platforms by automatically performing channel for accelerating deep neural networks models and
pruning, mixed-precision quantizations, etc. Such presents a complete system-level solution for real-
autotuning significantly shortens the design time DNN inference in Microsoft datacenters. In
cycles and reduces the human effort needed to particular, this article addresses the challenge of
optimize hardware implementations. response time/latency for single DNN model evalu-
In the third general interest article, Ahmad ation in a system that must fulfill such requests on
et al. describe a model to accelerate thread syn- demand. The software aspect of the system is pre-
chronization. The article creates a synchroniza- sented as well as evaluations from production
tion-centric analysis of parallel programs and datacenters. Readers who missed these papers
shows that the model the authors present is when they appeared, may want to go and read
most effective when synchronization overheads them now. Congratulations to the Intel and Micro-
are high. soft teams who won the best paper awards.
In the last general interest article, Mahdiani et I wish to express my gratitude to the selec-
al. present power-efficient neural network accel- tion committee consisting of John Kim, Bronis R.
eration using differential weights. Leveraging de Supinski, Tulika Mitra, and Carole-Jean Wu
approximate value locality of neuron weights who evaluated the 2018 papers in order to select
and the algorithmic structure of neuron net- the winners. I am specially indebted to John Kim
works, a power-efficient architecture is created. for serving as the chair of the committee.
In “The Vital Two Percent,” a Micro Econom- Readers of IEEE Micro will continue to get
ics column article for this issue, Greenstein ana- articles on emerging chip technologies and archi-
lyzes the research investment by the U.S. tectures as we go into the coming year. This Hot
government and the private industry. He notes Interconnect Issue will be followed by a Hot Chips
that for decades, the fraction of the GDP that
Issue in March/April. The subsequent issues will
goes into research and development has been
be Top Picks from architecture conferences, and
around 2%. Investing for potential unclear gains
special issues on Agile/Open Source Hardware,
two or three decades later is difficult when a lot
Biologically Inspired Computing, Systems for ML
of emphasis is placed on short-term goals, but it
and ML for Systems, and Chip Design 2020.
may be vital. He writes the story about an immu-
notherapy cancer cure developed based on dec- IEEE Micro is interested in submissions on
ades-old research funded by the National any aspect of chip/system design or architec-
Institutes of Health (NIH) that benefitted his ture. Please consider submitting articles to IEEE
friend. This article will make you think on the Micro and remember, all regular articles will be
importance of investing for the future even when eligible for the best paper award, which will be
the returns are not immediately clear. given annually for the previous year’s articles.
The newly instituted Best Paper award for I wish all readers a very happy 2020 both
IEEE Micro goes to Davies et al. for their article professionally and personally. Enjoy this issue
“Loihi: A Neuromorphic Manycore Processor on Hot Interconnects! It is only appropriate
With On-Chip Learning” (IEEE Micro, vol. 38, no. 1, that we have an issue dedicated to intercon-
pp. 82-99, Jan./Feb. 2018). This article described nects just as the world concludes celebrating
the architectures, circuit methodologies, and pro- the 50th anniversary of Charley Kline sending
cess technology of the Loihi processor, one of the first digital data transmission from a Tele-
the complex chips fabricated from Intel in the type terminal in UCLA to Bill Duvall, a scientist
neuromorphic design space. It describes the at Stanford Research Institute, a breakthrough
modeling of spiking neural networks (SNNs) in sil-
that occurred in 1969 during the ARPANET pro-
icon and is one of the first works to demonstrate
gram. Wishing you better connections in 2020
the feasibility of SNN learning approaches.
in your chips and in your lives!!
The Brainwave article authored by Chung et al.
“Serving DNNs in Real Time at Datacenter Scale Lizy Kurian John is currently a Cullen Trust for
With Project Brainwave” (IEEE Micro, vol. 38, no. 2, Higher Education Endowed Professor with the Elec-
pp. 8-20, Mar./Apr. 2018) was chosen as the run- trical and Computer Engineering Department, Uni-
ner-up for the award. This article describes a versity of Texas at Austin, Austin, TX, USA. Contact
highly interconnected FPGA cloud infrastructure her at ljohn@ece.utexas.edu.

January/February 2020
5
Guest Editors’ Introduction

Hot Interconnects 26
Ryan E. Grant Khaled Hamidouche
Sandia National Laboratories AMD Research

& WELCOME TO THE IEEE Micro Special Issue on use in datacenters. Our talks from industry lead-
Hot Interconnects. In this issue, you will find the ers such as Andy Bechtolsheim, CDO of Arista
best articles on the hottest, most cutting-edge Networks, and panel discussions led to an inter-
interconnects design occurring in industry and esting debate about the future of edge computing
academia from this year’s IEEE Symposium on in the context of expected networking hardware
High Performance Interconnects (Hot Intercon- advances. We expect that these discussions will
nects). Like our sister conference, lead to a number of articles on
Hot Chips, Hot Interconnects is a related subjects at future Hot Inter-
A major theme of the
choice venue for revealing the submissions at this connects conferences.
latest and greatest advances in year’s Hot Interconnects Our 2019 conference featured
hardware. This year at Hot Inter- was the intersection of 16 highly rated extended abstracts
connects, we were pleased to wel- AI and networks that went through rigorous peer
come talks on new hardware from and alternative review, with at least four reviews
Cray, Inc., as well as articles on architectures. We saw per submission. These extended
new network topologies and dis- work on topologies and abstracts served as the basis for
cussions on new local intercon- network congestion, the high-quality talks at this year’s
nect links such as CXL and characterization of AI/ Hot Interconnects. The top-rated
ML workloads, and
evolutions of existing technolo- articles from Hot Interconnects
proposals for new
gies such as NVLink. were selected to undergo expan-
underlying architectures
A major theme of the submis- sion for inclusion in this Special
such as FPGAs
sions at this year’s Hot Intercon- and chiplets. Issue. A group of the top seven
nects was the intersection of AI papers were invited to extend their
and networks and alternative works for IEEE Micro. These then
architectures. We saw work on topologies and went through additional rounds of peer review
network congestion, characterization of AI/ML followed by a rebuttal and revision phase. The
workloads, and proposals for new underlying top six of those top articles are presented in this
architectures such as FPGAs and chiplets. This is issue.
showing a trend toward smart networks and their The first article, “Path2SL: Leveraging Infini-
Band resources to reduce head-of-line blocking in
fat trees” by Maglione-Mathey et al., shows how net-
Digital Object Identifier 10.1109/MM.2019.2959114
work queuing schemes can be adapted to utilize
virtual lanes in InfiniBand hardware to avoid
Date of current version 14 January 2020.

0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
6
head-of-line blocking. The authors demonstrate Finally, the article “A high-throughput net-
that their proposed approach is more efficient than work processor architecture for latency-critical
previously proposed schemes for fat-tree topology applications” by Roy et al. presents a design for
networks. This is important for overall network per- an advanced-IO-processor. This processor has
formance on congested datacenter networks using dedicated hardware for low-latency task switch-
traditional high-performance network topologies. ing. This hardware is utilized to provide two
The article “A bunch-of-wires (BoW) interface times better latency for high-priority real-time
for interchiplet communication” by Farjadrad tasks by preempting work ongoing on network
et al. details the design of a low-power, area-effi- processors. The overhead in terms of area and
cient chiplet interconnect. The bunch-of-wires power consumption for adding this specialized
approach is meant to combat the significant hardware scheduling preemption is only 3% on
expense of current inter-chiplet interconnects by modern processors.
combining the ease of development of parallel In addition to the best of Hot Interconnects
interfaces with low-cost organic substrates. 26 in this issue, the presentation slides for all of
In “Toward FPGA-based HPC: Advancing the content of Hot Interconnects 26 are posted
interconnect technologies” by Lant et al., the at http://www.hoti.org/hoti26/program/. This
authors show that embedding custom network content includes our invited talks and keynotes.
and transport layers into FGPA hardware can sig- Copies of Hot Interconnects 26 articles can be
nificantly improve performance over current found through the IEEE Xplore Digital Library at
FPGA accelerator solutions. By decoupling the http://ieeexplore.ieee.org. We hope that you will
FPGA from dependence with the host CPU for enjoy the best of this year’s Hot Interconnects
network communications, significant perfor- and join us for Hot Interconnects 27 in August
mance benefits can be seen that are useful in 2020 at QCT, San Jose, CA, USA. Participating in
high-demand compute systems. person is the only way to access some of the
The article “Communication profiling and best interactive portions of the conference
characterization of deep-learning workloads on including our popular technical panel sessions
clusters with high-performance interconnects” and talks by industry leaders. In the day preced-
by Awan et al. examines the impact of distrib- ing the conference, we offer an array of technical
uted deep learning (DL) applications on hybrid tutorials to keep attendees up to date on the
CPU+GPU nodes with high-speed interconnects. latest advances in networking and give them
The authors develop a profiler that provides an opportunity to have hands-on learning experi-
insight into the performance of DL workloads on ences facilitated by leaders from industry
modern hybrid systems. The authors provide an and academia. Further details can be found at
evaluation using many different interconnects, http://www.hoti.org.
GPUs, and CPU host architectures.
The article “High-quality fault resiliency in fat
Ryan E. Grant is currently a Principal Member
trees” by Gliksberg et al. introduces a new deter-
of Technical Staff with the Center for Computa-
ministic routing algorithm for parallel general-
tional Research, Sandia National Laboratories,
ized fat-tree networks called Dmodc. Dmodc can
Albuquerque, NM, USA. He is a Senior Member of
reroute networks with tens of thousands of
the IEEE. Contact him at regrant@sandia.gov.
nodes in less than one second, letting datacenter
networks have more flexibility to respond to
changing workloads and individual distributed Khaled Hamidouche is currently a Senior Mem-
jobs. The authors show their solution to be ber of Technical Staff with AMD Research, Austin, TX,
robust and efficient. USA. Contact him at khaled.hamidouche@amd.com.

January/February 2020
7
Theme Article: Hot Interconnects 26

Path2SL: Leveraging
InfiniBand Resources to
Reduce Head-of-Line
Blocking in Fat Trees
German Maglione-Mathey Francisco J. Quiles
University of Castilla-La Mancha University of Castilla-La Mancha
Jesus Escudero-Sahuquillo Jose
 Duato
University of Castilla-La Mancha cnica de Vale
Universitat Polite ncia
Pedro Javier Garcia
University of Castilla-La Mancha

Abstract—The number of endnodes in high-performance computing and datacenter


systems is constantly increasing. Hence, it is crucial to minimize the impact of network
congestion to guarantee a suitable network performance. InfiniBand is a prominent
interconnect technology that allows implementing efficient topologies and routing
algorithms, as well as queuing schemes that reduce the head-of-line (HoL) blocking effect
derived from congestion situations. Here, we explain and evaluate thoroughly a queuing
scheme called Path2SL that optimizes the use of the InfiniBand Virtual Lanes to reduce
the HoL blocking in fat-tree network topologies.

& FAT-TREE 1
ARE a popular and tra-
TOPOLOGIES systems, and, more recently, also of datacenters.
ditional choice to configure the interconnection This is because they feature high-bisection band-
networks of high-performance computing (HPC) width and path diversity, low latency, and inher-
ent deadlock-freedom for minimal-path routing.
Thanks to these features, they allow the imple-
Digital Object Identifier 10.1109/MM.2019.2949280 mentation of very efficient routing algorithms
Date of publication 25 October 2019; date of current version such as D-mod-K.2 Overall, fat trees offer an
14 January 2020. interesting performance/cost ratio, and indeed,

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
8
fat-trees (or fat-tree-like) networks are present
in many supercomputers in the Top500 list.3 On
the other hand, the networks of many high-
performance systems are built from components
based on the InfiniBand specification (IB).4 The
IB specification allows us to configure almost any
network topology and provides tools to imple-
ment suitable and efficient routing algorithms,
Figure 1. HoL-blocking situation and mitigation.
quality of service, congestion management, etc.
(a) HoL-blocking situation. (b) Reducing the HoL-blocking.
IB packets can be separated into different queues
(virtual lanes, VLs) based on the packet service
level (SL) and on the SL2VL tables. This can be down cold flows, significantly increasing cold
used, for instance, to implement static queuing flows packet latency and reducing network
schemes (SQSs). SQSs reduce the harmful head- throughput. This is due to the HoL-blocking
of-line (HoL) blocking effect5 by reducing the effect. Figure 1(a) shows a HoL-blocking example,
number of packet flows that share the same VL. where the hot flow (red) at the head of the s1 input
This approach is followed by many proposals queue is restraining the cold flow (green).
that can be either topology agnostic (i.e., suitable
for any topology) or topology aware (i.e., tailored
to specific topologies). Among the latter, several
STATIC QUEUING SCHEMES FOR FAT
TREES
SQSs have been proposed for fat trees, such as
The static queuing scheme (SQS) is one
vFtree6 and Flow2SL.7 Both SQSs assume the use
among many approaches to deal directly with
of D-mod-K routing.
the HoL blocking. SQSs map flows to queues fol-
The efficiency of any SQS can be analytically
lowing a static policy (i.e., regardless of network
evaluated by means of the theoretical metrics
status), so that the number of flows sharing a
defined by Escudero-Sahuquillo et al.7 According
given queue is reduced, and so HoL blocking. An
to these metrics, Flow2SL is a more efficient SQS
example of HoL-blocking mitigation using a SQS
than vFtree in fat trees consisting of more than
can be seen in Figure 1(b).
two stages. In Maglione-Mathey et al.8 proposed
A handful of SQSs are tailored to fat trees,
Path2SL, a SQS that further improves these met-
such as vFtree6 and Flow2SL.7 In the original
rics, outperforming Flow2SL, and can be config-
article of Flow2SL, a set of metrics is defined to
ured in any IB-based system. In this article, we
measure the quality of a given mapping policy:
explain more thoroughly how Path2SL solves
the limitations of other SQSs, and we add more 1) Load rate (G): The number of different flow
evaluation results, for larger fat trees, that con- destinations that are mapped to a specific VL
firm that Path2SL reduces the HoL blocking in a port.
more efficiently than Flow2SL. 2) Balancing degree (): It measures how evenly
flow destinations are distributed among all
CONGESTION AND HEAD-OF-LINE the VLs of a port. If  ¼ 1, all the VLs in the
BLOCKING port have the same number of assigned flow
When several packet flows simultaneously destinations.
and persistently contend for the access to the 3) Overlapping degree (F): It measures the over-
same network resource, a congestion situation lapping in the assignment of flow destina-
may occur, that can eventually jam several net- tions to VLs in a port. If F ¼ 0 for all the VLs,
work paths and degrade network performance. any flow destination is assigned to only one
When a congestion situation occurs, the flows VL at that port (i.e., there is no overlapping).
actually contributing to congestion are called hot
flows or culprits, while those not contributing to Both vFtree and Flow2SL are proposed for
congestion are called cold flows or victims. If both IB-based fat trees using D-mod-K2 routing, sym-
types of flows share queues, hot flows will slow metrically distributing different flow destinations

January/February 2020
9
Hot Interconnects 26

several complete pods (so consecutive switches),


and the specific VL that a flow is mapped to
depends on its source and destination groups. If
properly configured, Flow2SL assigns any flow
destination to a single VL at any port (i.e., F ¼ 0),
including third-stage (and beyond) ones.
Figure 4(b) and (a) shows, respectively, the
Figure 2. Example of a three-stage fat tree. Switches s18 to s20 are
vFtree and Flow2SL mappings, assuming the fat
called root switches and interconnect pods. Switches s0 to s8 are
tree of Figures 2 and 3 available VLs. Note that
called leaf switches and connect the endnodes. Each pod is
both techniques base the flow-to-VL mapping on
highlighted in gray.
leaf switches (i.e., those that contain endnodes).
As can be seen, vFtree maps to different VLs the
flows that start at different switches of the same
pod and are addressed to the same destination
switch. For instance, flows starting at switches
s0 and s1 and addressed to s3 are mapped to VLs
2 and 0, respectively. This situation is shown in
Figure 3. Different flows starting from the same pod and mapped Figure 3. On the contrary, Flow2SL maps to the
to different VLs, while their destination is the same leaf switch. same VL the flows with the same source and des-
tination pods.
among the available VLs. However, vFtree was
designed for two-stage fat trees, so for more PATH2SL BASICS AND ADVANTAGES
stages it is unable to properly isolate flow destina- The main limitation of Flow2SL is that the
tions (i.e., F 6¼ 0). Specifically, in a fat tree with number of available VLs must be a divisor of the
more than two stages, a given flow destination number of fat tree pods.7 For example, the num-
can be assigned to all the VLs at the ports of the ber of available VLs for the fat tree of Figure 2
third-stage switches (and beyond), the specific must be exactly 3, otherwise the same destina-
VL depending on the source switch inside the tion being assigned to different VLs at third-stage
source pod. Therefore, if a flow destination switches (i.e., F 6¼ 0). This wrong mapping is
becomes congested, contributing hot flows start- shown in Figure 4(c), where the number of avail-
ing at the different switches of the same pod able VLs is 2: for instance, switch s3 has a VL map-
would occupy all the VLs of the third-stage ports ping different from s4 and s4 , despite all them
(and beyond), so producing general HoL blocking being in the same pod. Flow2SL may also produce
to any other flow reaching those ports. By con- mapping imbalance (i.e.,  6¼ 1), if the number of
trast, Flow2SL avoids this problem by basing the available VLs is not a divisor of the number of
mapping of flows to VLs on groups of switches destinations reachable from a particular input
instead of individual switches. For a correct port. Moreover, this imbalance increases with
configuration, each group must contain one or the number of stages as the number of possible

Figure 4. Source-destination VL mapping policies for the fat tree of Figure 2 using two and three VLs.
(a) Flow2SL 3VLs. (b) vFtree 3VLs. (c) Flow2SL 2VLs. (d) Path2SL 2VLs.

IEEE Micro
10
Figure 5. Random uniform and hot-spot synthetic traffic in 216- and 1000-node fat tree. (a) 216-node FT Random Uniform.
(b) 216-node FT Hot-spot. (c) 1000-node FT Random Uniform. (d) 1000-node FT Hot-spot.

destinations decreases when a packet “walks up” always arranges switches in a number of source
the fat tree. groups equal to the number of pods, while the
To overcome those limitations, we proposed number of destination groups is equal to the num-
Path2SL,8 a SQS that allows using any number of ber of available VLs. In this way, Path2SL always
VLs without producing destination overlapping. obtains a mapping with F ¼ 0, i.e., flow destina-
In practice, this number can be any divisor of tions are assigned to a single VL at any port,
the number of flow destinations to improve the regardless of the number of available VLs. Algo-
mapping balance. rithm 1 shows the complete Path2SL mapping pol-
icy. Basically, it assigns a VL (actually an SL, but
Algorithm 1. Path2SL mapping policy. we assume that all the SL2VL tables are config-
function get_sl#pods; #leafs; #vls; sws ; swd ured so that VL = SL) to a pair of source–destina-
" Size of the source groups tion switches (sws and swd , respectively), given a
src grps size b#leafs = #podsc number of available VLs (#vls), the total number
grps bsws = src grps sizec " sws group
of leaf switches (#leafs), and the number of pods
" sws group first switch
first sws grps  src grps size (#pods). Figure 4(b) shows the source–destina-
" Find the distance between sws and swd tion mapping for the fat tree of Figure 2. Note that,
if swd < first sws then at any third-stage port of this fat tree, any destina-
offsetd #leafs  first sws þ swd tion reachable from that port is assigned to a sin-
else gle VL, despite using only two VLs.
offsetd swd  first sws
end if
" Size of the destination groups EVALUATION
dst grp size b#leafs = #vlsc We quantify the Path2SL performance by com-
" Size of the first destination group
paring with Flow2SL7 configured with several VLs,
dst grp size0 dst grp size þ ð#leafs % #vlsÞ
" Find the destination group and SL
and with D-mod-K routing2 using 1 VL as baseline.
sl 0 We have obtained the evaluation results through
if offsetd  dst grp size0 then simulation of 216-node (6 pods) and 1000-node
offsetd offsetd  dst grp size0 (10 pods) fat trees, and through experiments in a
grp offset boffsetd = dst grp size0 c real IB-based cluster (configured as a 45-node fat
grpd 1 þ grp offset
tree) under several benchmarks.
sl grpd % #vls
end if The simulations were carried out using ran-
return sl dom uniform and hot-spot synthetic traffic pat-
end function terns. The hot-spot traffic is generated by a small
fraction of the total endnodes, around 2%, equally
Path2SL, like Flow2SL, is suitable for IB-based split among two and five hot-spot destinations for
fat trees with any number of stages, also using D- the 216- and 1000-node fat trees, respectively.
mod-K routing. However, it does not divide the Figure 5 shows the simulation results, specifi-
leaf switches into the same number of source– cally the average network throughput for each
destination groups as Flow2SL. Instead, Path2SL load point, normalized against the maximum

January/February 2020
11
Hot Interconnects 26

Figure 6. MPI-based real workloads in the IB-based 45-node fat tree. Notice the different scale in the Y axes.
(a) Netgauge and Graph500. (b) HPCC (Ordered Ring and Ping Pong). (c) HPCC (Ordered Ring and Ping Pong).

theoretical network bandwidth. Each value cor- of the experiments in the real system is to vali-
responds to 5 ms of simulated time, plus 1 ms of date our Path2SL implementation. Those bench-
warm-up time. marks generate different traffic patterns and
The use of additional VLs translates into a different network loads. In particular, Netgauge
 8% of performance improvement, over ftree, splits the network into two halves and generates
shown in Figure 5(a) and (c). Even when just a a one-to-one communication with a high injec-
small number of endnodes are generating hot- tion ratio between nodes in different halves,
spot traffic, Flow2SL is not able to mitigate the while Graph500 mainly generates an all-to-all
HoL-blocking effect, actually aggravating this traffic pattern with a low-injection ratio, due to
problem due to destination overlapping. On the the input size (Toy). We have tested each
contrary, Path2SL is able to reduce the HoL scheme in two different scenarios: alone, where
blocking, outperforming Flow2SL by 5% up to each benchmark runs without any interference,
12% in the 216-node fat tree [see Figure 5(b)], and 2flows, where an endnode at switch sw1 is a
and by  7% up to  32% in the 1000-node fat hot-spot and two hot-flows start from two endno-
tree [see Figure 5(c)]. Note that, for larger net- des at switches sw4 and sw7 , respectively.
works, or for a larger fraction of Figure 6 shows the obtained
endnodes generating hot-spot traf- results. Notice that Figure 6(a)
Path2SL optimizes the
fic, it is likely that Path2SL would and (b) depicts throughput but
mapping of packet flows
outperform even more signifi- to VLs based on quality Figure 6(c) latency. It is worth not-
cantly the other schemes. metrics, efficiently ing that at a very low-injection
For the real-system experi- leveraging the available ratio there are not significant dif-
ments, first we have implemented VLs without restrictions in ferences among techniques, due
Path2SL into the IB Subnet Man- their number or in the to the lack of significant conges-
ager (OpenSM),9 on top of the fat-tree configuration. tion. At extreme cases (e.g., for
ftree routing engine (i.e., the IB Simulations and Graph 500), using more VLs may
implementation of D-mod-K). experiments in a end up being counterproductive
real IB-based cluster due to the delays derived from
Then, we have configured the 45-
confirm that Path2SL the switch VL management, that
node fat tree shown in Figure 2 in
outperforms Flow2SL,
an IB-based cluster, with five endn- could be a burden for the overall
offering better
odes per switch. Next, we have run performance.
performance and higher
a single iteration of the following In general, Path2SL achieves
flexibility regarding the
benchmarks: Netgauge, HPCC,10 11
number of VLs. better results in hot-spot scenar-
and Graph500.12 Note that, due to ios than both ftree and Flow2SL.
the small size of the cluster, it is Also, if we compare the results
not likely to obtain from these experiments per- between scenarios with and without hot-spot
formance differences as significant as from the traffic, Path2SL is able to reduce the HoL
simulation experiments. Indeed, the main utility blocking.

IEEE Micro
12
CONCLUSION 6. W. L. Guay, B. Bogdanski, S. Reinemo, O. Lysne, and T.
In this article, we explain and evaluate thor- Skeie, “vftree—A fat-tree routing algorithm using virtual
oughly Path2SL, a static queuing scheme for lanes to alleviate congestion,” in Proc. IEEE Int. Parallel
InfiniBand-based fat trees. Path2SL optimizes Distrib. Process. Symp., May 2011, pp. 197–208.
the mapping of packet flows to VLs based on 7. J. Escudero-Sahuquillo et al., “A new proposal to deal
quality metrics, efficiently leveraging the avail- with congestion in infiniband-based fat-trees,” J. Parallel
able VLs without restrictions in their number Distrib. Comput., vol. 74, pp. 1802–1819, 2014.
or in the fat-tree configuration. Simulations 8. G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. Garcia,
and experiments in a real IB-based cluster con- F. Quiles, and J. Duato, “Path2sl: Optimizing head-
firm that Path2SL outperforms Flow2SL, offer- of-line blocking reduction in infiniband-based fat-tree
ing better performance and higher flexibility networks,” 2019, accepted for HOTI’19.
regarding the number of VLs. Moreover, these 9. OpenSM, 2019. [Online]. Available: http://git.openfabrics.
improvements are achieved without need of org/tilde;halr/opensm.git/
additional resources, as only an appropriate 10. T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm,
implementation of the scheme in the OpenSM “Netgauge: A network performance measurement
software is required. framework,” in Proc. High Perform. Comput. Commun.,
vol. 4782, Sep. 2007, pp. 659–671.
11. “The HPCC benchmark.” 2019. [Online]. Available: Icl.
ACKNOWLEDGMENTS
cs.utk.edu/hpcc
This work was supported by the Spanish
12. “The graph500 benchmark.” 2019. [Online]. Available:
MCIU and European Commission (EC) under Proj-
Www.graph500.org
ect RTI2018-098156-B-C52 (MCIU/FEDER); Span-
ish MINECO under Project UNCM13-1E-2456; and
German Maglione-Mathey is currently working
JCCM under Projects POII10-0289-3724, PEII-2014-
toward the PhD degree at the University of Castilla-La
028-P, and SBPLY/17/180501/000498. The work of
Mancha, Spain. He received the MS degree in com-
G. Maglione-Mathey was supported by the Uni- puter science from the University of Castilla-La Man-
versity of Castilla-La Mancha (UCLM) with a pre- cha. His research interests include high-performance
doctoral contract PREDUCLM16/29. The work of computing interconnects and data center networks.
J. Escudero-Sahuquillo was supported by EC and Contact him at german.maglione@dsi.uclm.es.
UCLM with a postdoctoral position (resolution
Jesus Escudero-Sahuquillo is a postdoctoral
31/07/2014). researcher at the University of Castilla-La Mancha,
Spain. He is also co-organizer of the IEEE International
Workshop HiPINEB. His research interests include the
& REFERENCES design aspects of high-speed interconnects for clus-
ters and datacenters. He has published around 40
1. C. E. Leiserson, “Fat-trees: Universal networks for
articles in peer-reviewed conferences and journals,
hardware-efficient supercomputing,” IEEE Trans. and participated in several research projects funded
Comput., vol. C-34, no. 10, pp. 892–901, Oct. 1985. by public bodies. He serves as program committee,
2. E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang, guest editor, and reviewer in several conferences and
“Optimized infiniband TM fat-tree routing for shift all- journals. He received the PhD degree in computer sci-
to-all communication patterns,” Concurrency Comput., ence. Contact him at jesus.escudero@uclm.es.
Pract. Experience, vol. 22, pp. 217–231, 2010.
3. “Top 500 the list,” 2019. [Online]. Available: www.
Pedro Javier Garcia is an associate professor of
computer architecture and technology at the University
top500.org
of Castilla-La Mancha (UCLM), Spain. He has published
4. The InfiniBandÒ Architecture Specification, Volume 1,
around 60 refereed articles in journals and conferences.
Release 1.3. The InfiniBand Trade Association, Mar.
He has guided two doctoral thesis. He has coordinated
2015, [Online]. Available: http://www.infinibandta.org
three research projects funded by public bodies and
5. M. Karol, M. Hluchyj, and S. Morgan, “Input versus four R&D agreements between UCLM and different
output queueing on a space-division packet switch,” companies. He has also participated in other 35
IEEE Trans. Commun., vol. C-35, no. 12, pp. 1347–1356, research projects. He has organized several interna-
Dec. 1987. tional conferences and has been also a guest editor of

January/February 2020
13
Hot Interconnects 26

several journals. His research focuses on high-perfor- computer science from the University of Valencia,
mance interconnection networks. He received the PhD Spain. Contact him at francisco.quiles@uclm.es.
degree in computer science from UCLM. Contact him
at pedrojavier.garcia@uclm.es. Jose Duato is currently a full professor in the Depart-
ment of Computer Engineering. His current research
Francisco J. Quiles is currently a full professor of interests include interconnection networks and multipro-
computer architecture and technology, and the chair cessor architectures. He has published more than 500
of the Computing Systems Department, University of refereed articles. He received the PhD degree in electri-
Castilla-La Mancha (UCLM), Spain. His research cal engineering from the Technical University of Valen-
interests include high-performance interconnection cia, Spain. He served as a member of the editorial
networks for multiprocessor systems and clusters. boards of IEEE TPDS, IEEE TC, and IEEE CAL. He also
He has published more than 240 articles in interna- served as the general chair of ICPP 2001 and HiPEAC
tional journals and conferences and participated in 2019, the PC chair of HPCA 2004, and as a member of
70 research projects supported by the NFS, the Steering Committee, vice-chair, or member of the
European Commission, the Spanish Government Program Committee in more than 60 conferences,
and R&D agreements with different companies. Also, including the most prestigious ones in his area. He is a
he has guided over nine doctoral theses. He member of the Spanish Royal Academy of Sciences.
received the PhD degree in electronics and Contact him at jduato@disca.upv.es.

IEEE Micro
14
Theme Article: Hot Interconnects 26

A Bunch-of-Wires (BoW)
Interface for Interchiplet
Communication
Ramin Farjadrad Bapi Vinnakota
Marvell Technology Group Talumbra Services
Mark Kuemerle
Marvell Technology Group

Abstract—Multichiplet system-in-package designs have recently received a lot of attention


as a mechanism to combat high SoC design costs and to economically manufacture large
ASICs. These designs require low-power area-efficient off-die on-package die-to-die
communication. Current technologies either extend on-die high-wire count buses using
silicon interposers or off-package serial buses. The former approach leads to expensive
packaging. The latter leads to complex and high-power designs. We propose a simple
bunch-of-wires interface that combines ease of development with low-cost packaging
techniques. We develop the interface and show how it can be used in multichiplet systems.

& CHIPLET-BASED DESIGNS, BASED on the integra- design starts and innovations. To reduce design
tion of multiple die in a single package using sys- costs, designers purchase IC components as
tem in package technologies, have recently third party IP. Even with IP purchase, analog,
received attention as a mechanism to extend photonic, or RF IC developments in new process
Moore’s law.1–4 AMD,5 Intel,6 Marvell,7 and nodes, like FinFet, consume more time and
Xilinx8 have announced chiplet-based products. effort, require more verification, and carry more
SoC development costs in newer process nodes risk.2–4
are rising exponentially,9 resulting in limited Chiplet-based designs can lower develop-
ment cost and time1 by decoupling development
cycles of complex SoCs through heterogeneous
Digital Object Identifier 10.1109/MM.2019.2950352 integration. In chiplet designs, RF, analog, pho-
Date of publication 30 October 2019; date of current version tonic, logic, and memory can be developed in a
14 January 2020. process node optimized for that specific function.

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
15
Hot Interconnects 26

Chiplets from multiple process nodes are inte- reduces wire count, is easy to design, and can also
grated into a single package to form a product.1–4 A be used with inexpensive package manufacturing
single chiplet can be used in designs across several techniques. The designer can choose to increase
process nodes providing economies of scale. interface implementation complexity in line with
Breaking large chips into multiple chiplets performance requirements. We discuss the trade-
increases product yield and lowers offs in the “BOW Design and
total cost of final product.5,10 Reuse” section.
A single chiplet can be
Chiplet-based designs incur The open-domain specific
used in designs across
higher packaging costs than do architecture (ODSA) is a new work-
several process nodes
monolithic devices. They also group in the Open Compute
providing economies of
require interchiplet links to transport scale. Breaking large Project.16 The ODSA workgroup
data between chiplets. These links chips into multiple aims to reduce accelerator devel-
carry data that would be transported chiplets increases opment costs by creating an open
on-die in a monolithic design. Off-die product yield and interface for interchiplet communi-
data transfers may consume more lowers total cost of final cation.17 This will allow product
energy than on-die data transfers. product. designers to create best-in-class
Individual chiplets will need accelerators by assembling best-
new interfaces for die-to-die communication. in-class chiplets from multiple vendors. In the
Two classes of interfaces have been developed: “System Integration” section, we demonstrate
how the BoW interface can be integrated into the
 Interfaces, such as the Intel advanced inter- ODSA reference accelerator architecture. We
face bus11 and the (application-specific) high- start with a review of connectivity and packaging
bandwidth memory interface (HBM)12 are for chiplets.
derived from highly parallel on-chip buses
and use many slow wires, each operating at
CHIPLET CONNECTIVITY AND
1–2 Gb/s, to transport data between chiplets. PACKAGING
While they offer design simplicity, these inter- Multichip packaging technologies are of two
faces incur higher packaging costs. types:1–4 1) traditional multichip module (MCM)
 Serial interfaces derived from board-level packaging; and 2) newer packaging techniques
SerDes links, e.g., PCI express, use a few serial such as wafer-level fanout (WLFO),18,19 silicon
high-speed wires, operating at 10 s of Gb/s, to bridges,6 and silicon interposers.10,20
transport data.13,14 While suitable for tradi-
tional packages, these interfaces are more Multichip Modules for Regular Bumps
expensive to design and potentially experi- A traditional approach is the MCM packaging,
ence higher latency and incur higher power.15 where the chiplet dice all sit on an organic (e.g.,
FR4) package substrate and are connected using
We propose a new interface, the bunch-of-wires the PCB traces on the package substrate. MCMs
(BoW) that transfers data at up to 4 Gb/s over lim- have been in volume production at low costs for
ited trace length up to 10 mm. The authors have decades. The pad pitch of the chiplets on an MCM
learned (informally) that many companies have substrate typically are 100 mm or higher. Such
similar internal interfaces. We propose a standard Chiplets can be screened for known-good-die
interoperable definition. The basic interface can (KGD) at the wafer level during the production
be enhanced for higher data rates by: 1) using ter- with standard test equipment, a major advantage
minated impedance matched traces to increase in improving the yield of the packaged part, and a
the data rate per trace without trace length limita- major cost saving.
tion; and 2) using bidirectional data transfer to The low pad and trace density of the MCM
again the double data rate. The interface is package substrate can limit the interchip through-
described in detail in the “BoW Interface” section. put. One can use SerDes cores to multiplex lower
The BoW interface can combine the best attrib- rate parallel data into higher speed data pipe over
utes of parallel and serial interfaces. The interface each package trace. Conventional multi-Gbps

IEEE Micro
16
SerDes incur higher power, area, latency, and 5. Total Latency < 5–10 ns.
design complexity to achieve this throughput. 6. Minimal complex circuitry to enable easy and
fast port into wide range of process nodes.
Advanced Packaging for Microbumps 7. Single supply compatible with logic Vdd in
Silicon Interposers (e.g., TSMC CoWoS)20 or existing SoCs/ASICs in popular process nodes.
silicon bridges (e.g., Intel EMIB)6 provide higher 8. Minimal technology licensing requirements.
density routing between the chiplets than a sim-
ple organic substrate. This allows chiplets to use
BOW INTERFACE
microbumps for IO (with a bump spacing of 50-
The BoW specification is a simple, open, and
80u) to greatly increase the interchiplet band-
interoperable interchiplet interface technology
width density. In the interposer solution, the
that meets the requirements listed above. The
chiplet dice are assembled on top of a large sili-
content of this section expands by Farjad and
con chip that acts like the package substrate with
Vinnakota22 and is complementary to the con-
high density routing between the chiplets. In the
tent of Kuemerle et al.23
case of an embedded silicon bridge, a small slice
of silicon is embedded in the organic package
substrate to provide the same high-density rout- BoW Interface
ing as an interposer, but with a smaller size and The BoW interface uses the simplest form of
6
thus lower cost. Silicon-based interconnect solu- CMOS IOs. A BoW implementation is expected to
tions are much more expensive be easy to port to multiple pro-
solution than the traditional MCM The BoW interface can cess nodes. At the transmitter,
packaging. optionally be enhanced a CMOS inverter is used to
WLFO, 18,19
a more recent but rela- for the better send full levels of Gnd and Vdd,
tively simpler packaging technology, performance. Any for 0 or 1 logic values respec-
uses a redistribution layer to connect enhanced mode is tively, to generate single-ended
required to be bump NRZ (non-return-to-zero) sig-
fine pitch pads for dense interchiplet
compatible with the naling. At the receiver, a CMOS
connectivity. The redistribution layer
basic interface.
is also used to fanout to regular-size Latch is used to latch in the
We envision two
bumps for lower density connectivity received signal at the source-
enhancements suitable
on a regular laminate. synchronous clock edge. The
for regular bump IO.
One challenge with fine pitch inter- latching of the received data
connect is that screening for KGD can be done on both edges of
before packaging will require denser probe cards the clock (DDR) and as a result
or newer test techniques. However, adequate cov- the clock runs at half the data baud rate.
erage has been reported for the HBM volume Because no line termination is used in the
production. 1,21 BoW Base, the signaling baud rate (Gbaud) and
trace roundtrip delay (ns) product is relatively
Interconnect Requirements fixed, which is a function of the signal baud and
Based on a survey of current technologies, 15 slew rate. In most CMOS IOs, this “Gbaud x nsec”
the requirements for interchiplet interconnect product is 0.20–0.4 depending on the signal slew
are as follows: rate. For example, a BoW Base with 2 GBd signal-
ing, the practical trace delay will be 0.10–0.2 ns
1. Throughput efficiency: 0.1 Tbps/mm to (0.20/2–0.40/2 GBd). This roundtrip timing delay
1 Tbps/mm. is equivalent to 10–20 mm over a typical FR4 sub-
2. Energy efficiency < 0.5–1.0 pJ/bit. strate. If we can limit the die-to-die distance to
3. Small silicon area/port for dense integration. 10 mm, the BoW Base data rate can grow to
To be pad limited, not silicon limited, for 4 GBd at a clock rate of 2-GHz DDR. At very short
pitch <120 mm. trace lengths,  1 mm, the BoW base data rate
4. Trace length range: 1–50 mm for arrangement can go as high as 8 GBd, but more complex
flexibility and heat dissipation. clock-data alignment circuitry may be required.

January/February 2020
17
Hot Interconnects 26

in both directions simultaneously,


every port is both an input and out-
put port. A hybrid block, placed
between the pad and BoW Tx/Rx
ports, creates a Bow BiDi port that
separates the receive and transmit
signals. BoW BiDi provides a maxi-
mum aggregate (i.e., receive þ
transmit) throughput twice that of
BoW TD over one set of wires. BoW
BiDi can provide up to 32 Gb/s per
trace with a DDR clock of 8 GHz.

BOW DESIGN AND USE


BoW can offer a graceful tradeoff
of packaging versus circuit design,
as shown in Figure 1 below. BiDi
mode adds IP design complexity,
but doubles the data rates. We
show that BoW Basic with micro-
bumps and BoW BiDi with regular
Figure 1. BoW interface operating modes. bumps offer very similar perfor-
mance at similar total costs. Our
Enhanced Modes discussion on the BoW design and
The BoW interface can optionally be use focuses on these two modes.
enhanced for the better performance. Any
enhanced mode is required to be bump compati-
ble with the basic interface. We envision two Bump Maps
enhancements suitable for regular bump IO. A BoW Slice has two clock ports per 16 data
When used with regular bumps, the circuitry per ports. The BoW specification itself does not
bump may be larger than the area of the pad for specify a bump map. Figure 2 shows a bump
a microbump. Figure 1 captures the relationship map for a BoW slice. We expect a bump map to
between the basic and enhanced modes. be suitable for all BoW modes. Multiple slices
can be stacked vertically or linearly, to achieve
BoW Terminated (BoW TD) By designing the higher throughput per mm at the die edge. All
traces to have a fixed characteristic impedance other modes of BoW are expected to be compati-
(typically 50 V), and terminating them, we can ble with BoW Basic.
suppress most of the signal reflections, thus,
removing the constraint of being able to drive
the fixed baud rate (GBd) and trace timing delay
(ns) product. As a result, a terminated link can
push the data rate higher for longer trace
lengths. This enhancement is called BoW TD.

BoW Bidirectional (BoW BiDi) Most physi-


cal interfaces instance offer symmetric band-
width in both directions. BoW BiDi leverages
this by using simultaneous bidirectional signal-
ing. Data are transmitted in a physical channel Figure 2. BoW bump map.

IEEE Micro
18
with similar attributes that have been imple-
mented and lab tested.
BoW basic mode is a simplified version of
many DDR interfaces in production today, for
example, DDR, LPDDR, and GDDR memory inter-
faces operate with very similar clocking and
clock/data relationships at high baud rates (e.g.,
16 GBd) from module to module. BoW has the
advantage of lower skew signal routing on sub-
strate and no discontinuities caused by addi-
Figure 3. MCM versus 2.5D Packaging costs. tional package and board components, along
with being implemented wholly in technologies
better suited for IO and clocking than DRAM.
Packaging AQlink a die-2-die interface by Aquantia uses
BoW can be used with both traditional lam- both terminated and bidirectional transmission
inate and advanced packaging technologies. lines as proposed in BoW BiDi. AQlink was imple-
Figure 3 compares the relative costs of: 1) mented on 14-nm silicon and the measured per-
organic laminate with 6-2-6 substrate; 2) a sim- formance data serves as a reference point15 for
pler substrate where WLFO is used for dense BoW BiDi. Based on the simulation and silicon
interconnect; and 3) silicon interposer for an measurements on AQlink, BoW BiDi can comfort-
example package with four chiplets. The mod- ably operate at 16 GBd over a trace length of
els use confidential manufacturing data avail- 50 mm. The trace length limitation is caused
able to the authors. These models are used to by high-frequency signal attenuation, which can
estimate the performance and package costs be addressed by equalizer circuits. Because the
of combinations of BoW modes and packaging maximum BoW baud rate is significantly lower
technology. than other solutions (e.g., XSR at 56 GBd), it can
Figure 4 examines the tradeoffs of the BoW use very simple and low-power equalizers
technology with various types of packaging. Band- relatively.
width calculated for 5-, 16-, 32-Gb/s operation
using BoW standard footprint at
130- and 55-mm pitch. Package
costs estimated using proprietary
cost models available to the
authors, based upon expected
build-up requirements, are shown
in a relative value to each other.
The cost models assigned more
cost to more complex laminate
wire counts, assuming they lead
to more layers. For example,
using the WLFO technology adds
additional wafer-level processing
cost, but reduces substrate layer
counts, reducing overall cost in
some cases.

Expected Performance
We believe reasonable esti-
mates of performance can be
extrapolated from interfaces Figure 4. Packaging tradeoff with BoW.

January/February 2020
19
Hot Interconnects 26

Table 1. BoW parameters. of BoW using fine pitch interconnect with WLFO
technology, where fanout can provide a similar
BoW Mode Basic BiDi cost/bandwidth solution to four row implemen-
Data rate/bump 5 Gb/s 32 Gb/s tations of BoW TD.
Bump type micro bump regular bump Relative to SerDes interfaces, BoW achieves
bandwidth efficiency without the complexity
Minimum pad
55 mm 100 mm and latency of multilevel PAM that needs FEC.
pitch
By being easier to design, we believe BoW is
Single supply
0.7–0.9 V (5%) 0.7–0.9V (5%) more easily ported across multiple process
voltage
nodes, a key requirement for the heterogeneous
Power efficiency 0.4–0.7 pJ/bit <0.70 pJ/bit integration.
Substrate Organic/WLFO/Silicon Organic PAM4 signaling typically leads to undesired
error rates (e.g. >1E-9) that mandates the use of
Max Trace
Length
10 mm 50 mm forward error correction (FEC).13,14 FEC not only
increases the link power, but also increases the
Max 1.76 Tbps/mm
1.9 Tbps/mm (55 mm link latency. Based on silicon results using simi-
Throughput/ (130 mm pad pitch,
pad pitch interposer)
Chip Edge organic) lar bidirectional interface (i.e., AQlink), BoW
Power & Area/
BiDi, using NRZ signaling, can operate at BER
Tbps (14 nm/ <600 mW, 1.01mm2 <600 mW, 0.73 mm2 <1E–15, acceptable in most use cases. If better
0.7 V) error rates are required, FEC codes can be used.
BER (No FEC) <1E–15 <1E–15 A proposed Reed–Solomon24 FEC code of RS
(34,32,8) can correct one error within its RS
Latency (no FEC) <3 ns < 3 ns
frame of 272 bits (34  8 bits). In this case, an
ESD/CSM
250 V/50 V 250 V/50 V input BER ¼ 1E–15 is equal to an input frame
Requirement
error rate (FER) of 2.72E–13. The proposed RS
Silicon Proof frame remains uncorrected after RS decoding if
multiple GF 14 nm
Point
there are two or more random errors across the
frame. The frame error probability of such event
is (2.72E–13)2 or FER ¼ 7.4E–26, which is equal
Table 1 captures the parameters for BoW to BER ¼ 2.7E–28. Such low BER is acceptable
Basic with microbumps and BoW BiDi with regu- for all practical purposes, but the FEC incurs
lar bumps. power and latency overhead.
BoW BiDi can provide over 1.76-Tbps/mm bidi- In summary, BoW is area, power, and band-
rectional throughput per chip edge with a stan- width efficient, offers a graceful tradeoff of design
dard bump pitch of 130 mm. The same IP Core versus packaging costs, and combines the best
provides a maximum throughput of 0.88 Tbps/mm attributes of parallel and serial interfaces.
with BoW TD, and 0.22 Tbps/mm with BoW base.

SYSTEM INTEGRATION
Technology Discussion
Multi-chiplet products are usually motivated
Relative to parallel interfaces, BoW can
by one of the two following requirements:25
achieve high bandwidth and power efficiency
without requiring expensive silicon interposers  Board-to-Chiplets: A need to reduce the foot-
or bridges. The different modes of BoW allow print, power, and cost of a board product.
implementers to trade off IP complexity and  Die-to-Chiplets: A large and/or complex
package complexity options to find the right fit design that needs to be partitioned to reduce
for their chiplet. Specifically, each version of manufacturing and/or design costs.
BoW technology has the effect of reducing the
bump density requirements, simplifying test Multichiplet products require both physical
support. This is evident when we compare four connectivity and logical data transactions
rows of BoW on organic with four or eight rows between the chiplets in a package.

IEEE Micro
20
Figure 5. Open domain-specific architecture stack and PIPE adapter.

Domain-specific architectures have recently interchiplet link. (The adapter will match the bit-
received renewed attention.26 The ODSA aims width the data rates of the PIPE interface to the bit-
to create a chiplet marketplace to enable width and data rates of the PHY.) This implies two
domain-specific architectures to be created by chiplets connected by PHYs with PIPE adapters
integrating best-in-class chiplets from multiple can use the PCIe protocol for data transactions, as
vendors. The marketplace will be enabled by used in board-to-chiplet designs, as well as any
developing an open interface so chiplets from protocols that use the PCIe for data transport.
multiple vendors can interoperate easily. The The use of a PIPE adapter for interchiplet links
ODSA stack aims to support open physical and was first proposed by Kandou corporation for its
logical data transactions between the chiplets USR SerDes.28 More recently, Intel announced
in a package. Figure 5 shows the interchiplet support for a low-power mode for interchiplet
networking stack under development. (and intrapackage) PCIe links on PCIe PHYs.27
We demonstrate the use of the BoW interface Figure 5 shows the functionality required by
for an example design for the first case, board-to- an adapter. The adapter maps the interface data
chiplets. In this approach, the PCIe transactions to a 16-bit BoW Turbo slice. The PIPE specifica-
between chiplets are executed over the interchip- tion defines two types of interfaces, parallel and
let BoW PHY, rather than the long range PCIe PHY. SerDes. Figure 5 shows the clock rates for PIPE
adapter that maps PCIe Gen 4 lanes to a 16-bit
Pipe Interface Adapter Turbo BoW slice through a 40-bit SerDes inter-
The PHY interface for PCI Express (PIPE)27 is face. At the data rates shown, a single BoW
an open standard interface defined between PHY Basic/Turbo slice can transport 4/32 PCIe Gen
physical coding sublayer and media access layer 4 Lanes. With this adapter, a design can poten-
(MAC). The PIPE interface serves as an abstrac- tially use a commercial controller29 to execute
tion layer between the PHY implementation and PCIe transactions over a BoW interface.
higher layers of the interface.
If an interchiplet PHY supports the PIPE inter- System-Level Impact
face through an adapter, the MAC and transaction The ODSA is building a proof-of-concept
layers of the PCIe protocol can be run over the (PoC) multichiplet product with chiplets from

January/February 2020
21
Hot Interconnects 26

 In networking applications, a minimum size 40-


Byte packet (which occupies 64 Bytes of wire
bandwidth) will result in accessing a 20-Byte
5-tuple from memory in IPv4 networks.

The BoW interfaces can be used to support a


board-to-chiplets use case. A BoW interface with
a PIPE adapter can support PCIe Gen 4 for inter-
chiplet communication. This change will require
new die that implement the BoW interface, but is
Figure 6. ODSA PoC prototype. transparent to the application software. Table 2
estimates the total power savings from using the
multiple vendors. The first prototype will use die
BoW interface for PCIe Gen 4 transactions
produced to be standalone product as chiplets.
instead of a traditional PCIe PHY31 in the ODSA
A block diagram of the reference design used in
reference architecture.
the PoC is shown in Figure 6. The PCIe Gen 3 pro-
tocol is the logical interface between the compo-  Arrows in green are interchiplet PCI links,
nents, with the exception of the network I/O.
rounded up to 512 Gb/s to estimate power costs.
The reference design targets SmartNIC and net-  Arrows in gray are links to memory, not
work storage use cases.
included in the estimate, though they can
The second-generation implementation of the
also be BoW interfaces.
ODSA reference architecture will use a low-  Arrows in yellow are off-package interfaces.
power interchiplet PHY. Figure 6 shows the esti-
mated internal bandwidth requirements for the
reference design. The 1:1 ratio for bandwidth in CONCLUSION
links between significant components and 1:2 We proposed a new open BoW interface for
ratio for bandwidth to memory is consistent inter-Chiplet communication. The basic interface is
with designs for two use cases: derived from the HBM specification. The Termi-
nated and BiDirectional modes increase the speed
 In the Google TPU accelerator, the host-accel- of the basic interface by 4 and 8. The BoW com-
erator bandwidth is 14 GB/s. Correspondingly, bines the process portability of parallel interfaces
the link between the unified buffer and the with the easy packaging attributes of serial interfa-
host interface is 10 GB/s, the memory band- ces. The definition offers a tradeoff—BoW allows
width is 30 GB/s.30 designers to start with a basic interface and add
either more complex IP development and/or more
Table 2. BoW system benefits. complex packaging technology to maximize edge
and substrate data bandwidth. Our next step with
Interface parameters this interface is to build a test chip, potentially with
PHY PCIe BoW Basic BoW TD BoW BiDi a PIPE adapter and a commercial PCIe controller.
Trans. Protocol PCIe4 PCIe4 PCIe4 PCIe4

pJ/bit 7.531 0.6(est) 0.7(est) 0.6(est)


ACKNOWLEDGMENT
The authors would like to thank B. Bahador
Cost of 512 Gb/s interface from Eliyanpro, Inc. This work was performed
(2  16 PCIe Gen4 lanes, Tx ¼ Rx ¼ 512 Gb/s)
while Bapi Vinnakota was with Netronome.
Power 3.84 W 0.3 W 0.35 W 0.3 W

Bump count 416 104 54 & REFERENCES


Area sq mm. 3.9 5.1 1.8 0.9
1. IEEE Electronics Packaging Society Heterogeneous
Impact on PoC (3  512 Gb/s interfaces) Integration Roadmap 2019 Edition, [Online]. Available:
https://eps.ieee.org/technology/heterogeneous-
Total power 11.5 W 0.9 W 1.05 W 0.9 W
integration-roadmap/2019-edition.html

IEEE Micro
22
2. B. Bayraktaroglu, “Heterogeneous integration technology,” 19. K.-L.-Suk et al., “Low cost Si-less RDL interposer
AFRL-RY-WP-TR-2017-0168, Aug. 2017. package for high performance computing
3. A. Olofsson et al., “Enabling high-performance applications,” in Proc. IEEE 68th Electron. Compon.
heterogeneous integration via interface standards,” in Technol. Conf., 2018, pp. 64–69.
Proc. IP Reuse, Modular Des. Int. Symp. 20. Y.-L.- Chuang et al., “Unified methodology for
Microelectron., vol. 2018, pp. 000246–000251, 2018. heterogeneous integration with CoWoS technology,” in
4. J. H. Lau, “Recent advances and trends in Proc. IEEE 63rd Electron. Compon. Technol. Conf.,
heterogeneous integrations,” J. Microelectron. May 2013, pp. 852–859.
Electron. Packag., vol. 16, pp. 45–77, 2019. 21. D.U. Lee et al., “A 1.2V 8 Gb 8-channel 128 GB/s
5. L. T. Su et al., “Multi-chip technologies to unleash high-bandwidth memory (HBM) stacked DRAM
computing performance gains over the next decade,” with effective microbump I/O test methods using
in Proc. IEEE Int. Electron Devices Meeting, Dec. 2017, 29nm process and TSV,” in Proc. IEEE Int. Solid-
pp. 1.1.1–1.1.8. State Circuits Conf. Dig. Tech. Papers, 2014,
6. R. Viswanath et al., “Heterogeneous SoC integration pp. 432–433.
with EMIB,” in Proc. IEEE Elect. Design Adv. Packag. 22. R. Farjad and B. Vinnakota, “A bunch of wires (BoW)
Syst. Symp., Dec. 2018. interface for inter-chiplet communication,” in Proc. Hot
7. Marvell Corporation, “MoChi architecture,” Tech. Rep., Interconnect, 2019.
[Online]. Available: http://www.marvell.com/ 23. M. Kuemerle, R. Farjad, and B. Vinnakota, “Bunch of
architecture/mochi/ Wires Interface Proposal, v0.7,” 2019. [Online].
8. G. Singh et al., “Xilinx 16 nm datacenter device family Available: https://www.opencompute.org/wiki/Server/
with in-package HBM and CCIX interconnect,” ODSA
HotChips, 2017. 24. Reed Solomon Error Correction Intro, [Online].
9. B. Bailey, “The impact of Moore’s law ending,” Oct. 18, Available: https://en.wikipedia.org/wiki/
2018. [Online]. Available: https://semiengineering. Reed%E2%80%93Solomon_error_correction
com/the-impact-of-moores-law-ending/ 25. R. Nagisetty, “Intel ODSA Workshop,” Jun. 2019.
10. D. Stow et al., “Cost-effective design of scalable high- [Online]. Available: https://146a55aca6f00848c565-
performance systems using active and passive a7635525d40ac1c70300198708936b4e.ssl.cf1.
interposers,” in Proc. 36th IEEE/ACM Int. Conf. rackcdn.com/images/be20ea9409cc558936fa2623c
Comput.-Aided Design, Nov. 2017, pp. 728–735. 5222792e8118c69.pdf
11. Intel AIB Bus, 2018. [Online]. Available: https://github. 26. J. L. Hennessey and D. A. Patterson, “A new golden
com/intel/aib-phy-hardware age for computer architecture,” Commun. ACM, vol. 62,
12. JEDEC, “High bandwidth memory (HBM) DRAM,” no. 2, pp. 48–60, Feb. 2019.
JESD235B, 2018. 27. Intel, “PHY interface for the PCI express, SATA, USB
13. Kandou Bus, 2018. [Online]. Available: https://kandou. 3.1, DisplayPort and converged IO architectures,
com/technology/ Version 5.2,” 2019. [Online]. Available: https://www.
14. OIF, Common Electrical Interface—112G—XSR, 2018. intel.com/content/dam/www/public/us/en/documents/
15. G. Taylor, R. Farjad, and B. Vinnakota, “High capacity white-papers/phy-interface-pci-express-sata-usb30-
on-package physical link considerations,” Hot architectures-3-1.pdf
Interconnect, 2019. 28. Kandou Corp, “Kandou bus document: 16-lane
16. Open Domain-Specific Architecture, 2019. [Online]. femtoserdes PIPE Adapter,” 2019.
Available: https://www.opencompute.org/wiki/Server/ 29. PCIe PIPE 4.4.1: Enabler for PCIe Gen4, 2018. [Online].
ODSA Available: https://blogs.synopsys.com/vip-central/2018/
17. ODSA Workgroup, “The open domain-specific 01/17/pcie-pipe-4-4-1-enabler-for-pcie-gen4
architecture: A chiplet-based open architecture,” 30. “An in-depth look at Google’s First Tensor Processing
[Online]. Available: https://www.netronome.com/m/ Unit,” 2017. [Online]. Available: https://cloud.google.
documents/ com/blog/products/gcp/an-in-depth-look-at-googles-
WP_ODSA_Open_Accelerator_Architecture.pdf first-tensor-processing-unit-tpu
18. J. A. Lim et al., “Fan-out wafer level eWLB technology 31. S. Li et al., “A power and area efficient 2.5-16 Gbps
as an advanced system-inpackage solution,” in Proc. gen 4 PCIe PHY in 10nm FinFET CMOS,” in Proc. IEEE
IWLP, San Jose, CA, USA, Oct. 2017. Asian Solid-State Circuits Conf., Nov. 2018, pp. 5–8.

January/February 2020
23
Hot Interconnects 26

Ramin Farjadrad is currently a CTO & VP of Net- has ben PHY/Interface workstream leader in the
working/PHYs at Marvell Semiconductor, in charge ODSA, member and contributor to various IEEE and
of developing multi-GHz connectivity solutions for other industry publications, other previous industry
autonomous vehicles, hyperscale data centers, and group leadership includes HMC Consortium protocol
enterprise networks, and the Chairman of technical committee. He received the Master of Science
committee at Networking for Autonomous Vehicles degree from Case Western Reserve University. Con-
(NAV) Alliance. He proposed signaling schemes tact him at mark.kuemerle@globalfoundries.com.
adopted as IEEE Standard for 10 Gbps Automotive
Ethernet and Multi-Gbps Enterprise Ethernet. He has Bapi Vinnakota is currently a consultant with
domain expertise in high-speed communication cir- Talumbra Services. While at Netronome, he started
cuits and systems, signal processing/coding, opti- and is the leader for the Open Domain-Specific
mized mixed-mode architectures. He is author of Architecture, a chiplet-based open architecture proj-
100þ granted/pending U.S. patents. He received the ect in the Open Compute Project. After a Ph.D. at
MSc/PhD degrees in electrical engineering and com- Princeton, he taught at the University of Minnesota,
puter science from Stanford University. Contact him where he received an NSF CAREER Award and three
at farjad@gmail.com. IBM Faculty Development Awards. He joined Intel
through an acquisition and was an architect of a VoIP
Mark Kuemerle is a fellow of Integrated Systems flow processor and worked in networking technol-
with Avera Semiconductor. He has broad experience ogy, and incubated a networking SaaS product. At
in a wide range of large complex infrastructure Netronome, he created and ran open-nfp.org, a ser-
ASICs, significant IP and expertise in power effi- vice for research in networking. He is the corre-
ciency methodologies, complex interconnect and sponding author of this article. Contact him at bapi.
packaging, and integrated memory structures. He vinnakota@gmail.com.

IEEE Micro
24
Theme Article: Hot Interconnects 26

Toward FPGA-Based
HPC: Advancing
Interconnect
Technologies
Joshua Lant, Javier Navaridas, Mikel Luja
n,
and John Goodacre
The University of Manchester

Abstract—HPC architects are currently facing myriad challenges from ever tighter power
constraints and changing workload characteristics. In this article, we discuss the current
state of FPGAs within HPC systems. Recent technological advances show that they are
well placed for penetration into the HPC market. However, there are still a number of
research problems to overcome; we address the requirements for system architectures
and interconnects to enable their proper exploitation, highlighting the necessity of
allowing FPGAs to act as full-fledged peers within a distributed system rather than
attached to the CPU. We argue that this model requires a reliable, connectionless,
hardware-offloaded transport supporting a global memory space. Our results show how
our fully fledged hardware implementation gives latency improvements of up to 25%
versus a software-based transport, and demonstrates that our solution can outperform the
state of the art in HPC workloads such as matrix–matrix multiplication achieving a 10%
higher computing throughput.

CURRENT TRENDS IN HPC


& IN RECENT YEARS,
there have been two great architects must think about future technology
changes that have affected the way systems within high performance computing (HPC). The
first and most obvious of these is the breakdown
of Dennard scaling around 2004. Since this time,
Digital Object Identifier 10.1109/MM.2019.2950655 there has been an explosion in the scale-out of
Date of publication 31 October 2019; date of current version HPC systems architectures, and consequently, a
14 January 2020. dramatic rise in their power consumption.

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
25
Hot Interconnects 26

The second great change is that we have promising candidate for increasing the energy
now entered the Fourth Paradigm of scientific efficiency of heterogeneous HPC systems. The
discovery. Modern applications are moving main rationale for FPGA technology is its ability
from computational science into big-data ana- to use novel algorithms and custom memory lay-
lytics. The rapid growth of data-intensive outs to optimize the number of memory accesses.
workloads has instigated a convergence Furthermore, since FPGAs do not require a stored
between data center and HPC technology, and program, they alleviate the effects of the von Neu-
techniques such as hyperconverged storage mann bottleneck with reduced memory band-
are increasingly used to bring data closer to width requirements. Utilizing the increasing
compute resources. The price of computation capacity of on-chip memory and using multiple
has been falling dramatically for decades, to FPGAs is a key method to enable larger datasets
the point that the energy required for on-chip to be stored closer to compute, reducing data
data movement is now signifi- movement and increasing mem-
cantly higher than floating-point ory bandwidth significantly. Since
operations.1 As such, reducing GPU manufacturers memory bandwidth is the limiting
data movement is essential for have catered to the HPC factor for a number of HPC work-
reducing power consumption. market by providing loads, reducing off-chip DRAM
In order to deal with these increasingly powerful accesses and storing data closer
changes, accelerated computing architectures, with ever to compute is desirable for
has become commonplace within higher floating-point increased performance as well as
performance and
HPC, such that more than 25% of reduced power consumption.
memory bandwidth.
the TOP500 list now employ GPU FPGA vendors are also now
Unfortunately, these
acceleration. GPU manufacturers massive developments
pushing for penetration into the
have catered to the HPC market by in performance have HPC/data center arena and are
providing increasingly powerful come at the cost of slowly addressing the perfor-
architectures, with ever higher much higher power mance gap between GPUs and
floating-point performance and consumption. FPGAs. We are now starting to
memory bandwidth. Unfortunately, see advanced memory systems
these massive developments in (e.g., HBM) integrated on the
performance have come at the cost of much same package. Furthermore, the floating-point
higher power consumption. While high-end performance of FPGAs is increasing; with hard-
GPUs are demonstrated to be more efficient than ened floating-point blocks and increased DSP
CPUs, this gap is narrowed by appropriate opti- capability within the architecture. Such advan-
mizations,2 and thus, their performance scalabil- ces are key to enabling the FPGA to be fully
ity must be questioned in relation to their exploited in the HPC domain.
current configuration within HPC architectures. The next step toward FPGA-based HPC com-
The main issue with these more traditional puting—and the one we focus on within this arti-
forms of architecture is the fact that many cle—is the ability to scale-out FPGA resources
still view the GPU accelerator simply as a by enabling distributed FPGA computing. We,
PCIe bus-attached coprocessor. While they and many others within the community, argue
may allow limited inter-GPU communication that for this to be an efficient process, the FPGAs
between a subset of the accelerators using need to break free from CPU control and become
protocols such as NVLink, such limitations can full-fledged, network-capable computing units;
cause additional data copying and decrease an objective that we address with a custom
locality. interconnect.

FPGAs for HPC Programming Models


Given that reducing data movement is the key Given the complete ubiquity of MPI, any new
factor for improving energy efficiency,1 field-pro- HPC system produced in at least the medium
grammable gate arrays (FPGAs) have become a term is certain to use MPI as one of its

IEEE Micro
26
communication paradigms. While MPI is highly messaging rate, and overlap. These have been
efficient for the communicating sequential pro- overlooked in the development of GPU architec-
cesses (CSPs) model for parallelism, modern tures. The Berkeley taxonomy of HPC applica-
architectural features are not easily represented tions3 shows that over half of the application
or exploited in MPI, and shared memory cannot types are limited by memory latency, instead of
be utilized within MPI ranks on the same node. memory throughput. Such workloads are well
The hybrid use of MPI and partitioned global suited for FPGAs as custom memory layouts can
address space (PGAS) languages is gaining trac- be exploited.
tion as it overcomes some of the main limitations Workloads exhibiting irregular memory access
with MPI, providing naturally simpler one-sided patterns combined with high levels of arithmetic
operation semantics, reduced mem- computation already provide for
ory footprint, and solving issues more efficient FPGA-based imple-
The next step toward
with overlapping communication mentations over CPU or GPU solu-
FPGA-based HPC
and computation. The main hurdle tions.4 Workloads such as stencil
computing—and the
toward a wider adoption of PGAS is codes are suited to the FPGA
one we focus on within
the lack of direct hardware support this article—is the ability due to the high volume of on-
for its communication primitives. to scale-out FPGA chip memory, reducing DRAM
Our custom interconnect supports resources by enabling accesses. Other computations
these primitives by enabling to distributed FPGA such as sparse matrix–matrix/
write operations directly into a computing. matrix–vector multiplication are
global memory space. highly suitable for FPGAs, as they
With respect to the programming of the feature a relatively low FLOP count per memory
FPGA hardware itself, the use of traditional access, meaning that on-chip storage for values is
frameworks and tools such as OpenCL poses preferable in this situation.
problems as it offers only a host/device model The main limitation with FPGAs is currently
of programming. The code is split into two por- in their floating-point capabilities. Typical
tions: the host code, which runs on the CPU, number crunching algorithms (dense matrix–
and the device/kernel code, which runs on the matrix/matrix–vector multiplication, FFTs,
accelerator. An API allows the host code to use N-body simulations, etc.) obtain the best perfor-
the hardware kernel, loading data into the hard- mance on GPUs. However, if we consider
ware for execution and then transferring the energy-to-solution rather than raw perfor-
data back out to the host once computation mance, FPGAs offer a clear advantage with
has been completed. We argue that this model much higher FLOPs/watt.
entrenches the main limitation for exploiting
distributed FPGAs: dependence on the CPU to EVOLVING FPGA ARCHITECTURES
orchestrate data movement. The communica- Recent works4,5 argue that FPGAs should
tion primitives and interfacing of our network evolve beyond a simple coprocessor solution
interface (NI) (see the “Our Solution” section) toward integrating hard-core CPUs with coher-
support data transfer and synchronization ent access to the FPGA fabric. Such archi-
directly between accelerator blocks on differ- tectures enable the effective acceleration of
ent FPGAs. This enables more suitable program- workloads with more complex memory handling,
ming models to be explored, which facilitate e.g., graph traversal and branch-and-bound
dataflow processing not only at the intranode problems. Indeed, Weisz et al.5 examine pointer
level but at a much larger scale. chasing on modern FPGA substrates with direct,
shared-memory access between tightly coupled
CHANGING WORKLOADS accelerator and CPU, concluding that memory
While the memory bandwidth of GPUs is latency is the main bottleneck for these work-
incredibly high and is important for many HPC loads, and that interleaving memory access over
applications, there exist numerous applications several concurrent traversals can alleviate this
that are more sensitive to memory latency, problem.

January/February 2020
27
Hot Interconnects 26

the network as a full peer and having access to


the main memory system. This disaggregates the
FPGA resources from the CPU, so they can be
scaled independently and multiple FPGAs can
communicate with one another over the network,
enabling the pooling of FPGA resources.
Having the FPGA tightly coupled with the
network is vital to allow simple and efficient
algorithm mapping onto distributed FPGAs. We
argue that a custom NI is required to allow for
this without CPU involvement or heavy limita-
tions on the network (as is seen in TCP offload,
for example). With suitable work distribution
in this system configuration, FPGAs may even-
tually be able to compete with GPUs even on
dense linear-algebra problems where the GPU
excels.4
There has been previous effort toward this
goal: upgrading the FPGA to a more central role,
eliminating the CPU entirely from computation
(or at least forcing all network traffic through
the FPGA). Forcing all traffic bound for the net-
work through the accelerator is known as a
bump-in-the-wire architecture [see Figure 1(d)].
Implementations are effective for enabling pools
of FPGA resources to be used for scaled acceler-
Figure 1. Possible system architectures and FPGA
ation, and for near data processing, where the
configurations. The shaded regions represent the limits of the
FPGA may take control of the network, storage
addressability from a given node. (a) FPGA attached as a bus-
capability, local memory, or caches. The most
based coprocessor (e.g. PCIe). (b) FPGA attached via system-bus,
sophisticated of these bump-in-the-wire architec-
able to access local memory directly but not the network. (c)
tures is Microsoft Catapult2. While it offers a
Disaggregated FPGA, able to act as a full network peer but unable
powerful platform, it does not support NUMA-
to address remote memory directly. (d) Bump-in-the-wire
like operations, which is integral for enhancing
architecture. (e) Global address space in which the FPGA can act
the performance of workloads with irregular
as a full network peer and access remote and local memory directly.
memory access patterns.
Consequently, we propose the solution in
Figure 1 illustrates different levels of memory Figure 1(e), where FPGAs can operate indepen-
coupling and addressability of FPGA-based sys- dently within a global shared-memory space. Our
tems. Figure 1(a) shows the traditional bus- NI permits scaling-out FPGA resources, by allow-
based coprocessor model, in which the FPGA ing the FPGA to read and write directly into the
acts as a slave attached to the CPU, which must remote memory of other FPGAs or CPU resour-
direct memory copying between main memory ces. We encapsulate the system-bus protocol for
and accelerator. Figure 1(b) depicts a tighter use over the network, so that any traffic arriving
memory coupling between CPU and FPGA, where from the NI is dealt with as a local memory
the FPGA can directly access main memory, but access, no matter the origin of the transaction.
relies on the CPU for communications.
Extending beyond this tight memory coupling INTERCONNECTING INDEPENDENT
is the idea, as shown in Figure 1(c), that the FPGA FPGAs
should become an independent, stand-alone As we have now established the benefits of
compute element within the system, accessing advancing toward stand-alone FPGA computing

IEEE Micro
28
units, we move to discuss the interconnection to take advantage of fine grained paralle-
substrate required for such architectures. At lism over distributed FPGAs or proper coor-
first glance, TCP/IP-based solutions seem like dination between CPU and FPGA, inhibiting
the ideal candidate given their ubiquitous pres- certain workloads.4
ence in commodity cluster computing. Unfortu-  They do not guarantee reliable transfer, so are
nately, TCP-based reliability suffers severe not amenable for production-level systems.
performance degradation when implemented in  Their hardware-offloaded transport layers
software, and hardware-offloaded techniques require per-connection state information/
are nonscalable due to the complexity of the buffering to be stored within the hardware,
TCP protocol and the dedicated per-connection limiting scalability or degrading perfor-
send/receive buffering requirements. mance by excessive connection setup/
Infiniband is another obvious alternative, but teardown.
there are a number of issues that make Infini-
band unsuitable for FPGA-based HPC. The most
significant of these issues is the sheer complex- OUR SOLUTION
ity of the Verbs API specification, which makes Taking into consideration all of these
hardware offloading incredibly difficult. A clear limitations, we have developed a custom
example can be seen in previous attempts to interconnect solution that supports reliable
implement the Verbs API on a GPU.6 The authors communication among independent and dis-
found that the large overheads involved in work tributed FPGAs. The first step was to design a
request generation are not compensated by the custom NI with hardware primitives to support
savings in context switching, concluding that a hybrid MPIþPGAS programming model. We
CPUs are better suited for this task. They sug- provide support for MPI via an RDMA mecha-
gested that GPU architectures will need to be nism and PGAS-like communications via a sep-
adapted for future systems, including an on- arate interface to the NI to read/write into
board processor to handle Infiniband interac- remote memory using transparent load/store
tions. The alternative being to seek different operations from the CPU/FPGA. We encapsu-
RDMA-capable networking hardware. The same late and extend the system-bus protocol to
holds true for FPGAs. The only FPGA implemen- work over the network. In doing this, there is
tations of Infiniband cores are typically very lim- no change from the user perspective in access-
ited in their feature set, reliability, and number ing remote memory from local memory (other
of concurrent connections. than the NUMA effects on latency).
If we look at custom interconnect solutions Another important aspect of our implementa-
for distributed FPGA computing, there exist tion is that we created a novel transport layer
many proposals, but they are typically limited in that lies within the FPGA fabric. This way, the
(at least) one of the following ways which our CPU is not involved in network communication
solution overcomes: among FPGA resources. Our solution provides
two entirely separate methods for reliable trans-
 They provide only simple point-to-point con- fer: one for RDMA transfers and another one for
nections between FPGAs; limiting available shared-memory operations. This is justified by
topologies and scalability of solutions to the drastic reduction of latency for small mem-
those typically situated within a single rack ory operations.
or chassis.7 As an example, a user-space application
 They require CPU intervention to issue trans- sending a 16 B transfer to the BRAM of a
actions to the network. This causes addi- remote FPGA (1-hop distance) will take 1.1 ms
tional latency/bandwidth overheads, with using our shared-memory engine, but 1.49 ms
additional buffering, system calls, or control using our RDMA engine. The benefit of per-
information required. forming smaller operations using our dedi-
 They do not provide tight coupling to system cated shared-memory path is clear: reducing
memory, reducing the ability of the system the latency by over 25%. These savings come

January/February 2020
29
Hot Interconnects 26

and connectionless communications to guaran-


tee message delivery. Typically, reliable trans-
port layers are connection based, requiring large
amounts of persistent state information when
scaled. Our solution only maintains transient
information regarding outstanding operations,
which have left the sender, making it far more
amenable to a hardware implementation. We are
able to do this because we separate the trans-
port layers, having one for shared-memory oper-
ations in which retransmission information can
safely be held within the NI. We then provide an
entirely separate transport for reliable RDMA
transfer. One that is more scalable as the data is
held in its source memory location, rather than
being copied into retransmission buffers within
Figure 2. Full IO and on-chip network stack within the NI.
FPGA fabric. Figure 2 shows the full network stack imple-
mented within the FPGA, with segregated data-
from the shared-memory path using a transpar- paths for shared-memory and RDMA operations.
ent write instruction into remote memory Our solution allows the accelerator logic to issue
whereas the RDMA engine requires an extra commands to the NI in exactly the same way as
memory copy. the CPU. This is enabled by the use of memory-
Yet another benefit of our implementation is mapped AXI interfaces through the whole stack.
that our transport layer allows for both reliable We effectively translate AXI transactions into a
custom packet format, whilst providing additional
information so that it can be used effectively over
a wider, full-system scale network, serialized and
transmitted over high-speed transceivers. The
AXI transactions are then rebuilt remotely at the
receiving NI. This means that both the accelerator
and CPU can access the network completely inde-
pendently from each other, and simple reading/
writing to remote memory lends itself to estab-
lished FPGA programming models. In the follow-
ing section, we show how the ability of the
accelerator to write directly to the network
reduces the control and datapath complexity of
network transfers between distributed FPGA
resources significantly.

REDUCED CONTROL AND DATAPATH


COMPLEXITY
In order to illustrate the benefits of our
Figure 3. Flow of data and control when using distributed FPGA approach, Figure 3 shows the critical path for
resources using different transport layers. (a) Using TCP to utilize data and control communications for dataflow
distributed FPGA resources. (b) Using a software-based transport style processing initiated through various styles
mechanism with our custom network. (c) Using our hardware of transport layer by a CPU, for processing in a
offloaded transport layer. local (in F1) and then remote (in F2) FPGA.

IEEE Micro
30
Figure 3(a) shows a traditional TCP stack, which MAC/PHY) are implemented within the board
requires multiple extra copies of the data and connected with each other via a 10G SFP
between buffers and intervention from the CPU, cable. All hardware within the FPGA fabric is
creating significant additional latency. clocked at 156.25 MHz, and the ARM-based CPU
Figure 3(b) shows how a software-based is clocked at 1 GHz. One instance of the entire
implementation of our transport layer would networking stack requires 17.1% (46 915) of the
work. In this instance, we are able to submit work available LUTs on the FPGA, giving reasonable
directly to the remote accelerator, reducing the remaining area to dedicate to accelerator proc-
latency induced at F2. However, additional con- essing elements.
trol and off-chip DRAM access is needed in F1 The setup using a single CPU solution as
because the CPU would still be required to coor- opposed to two separate boards is done in order
dinate between the two accelerators. to obtain more accurate timing measurements at
Finally, Figure 3(c) shows control and data- the application level. A user-space application
flow when using our fully implemented hardware- takes a system time stamp before submitting
offloaded transport solution. In this instance, blocks of data for processing in the accelerator
once the accelerator at F1 has completed its block in the F1 stack (see Figure 3), which per-
work, it issues an RDMA operation directly to the forms some work and sends the result forward
NI, and then, writes shared-memory operations (through the network) to the accelerator in the
to the F2 accelerator’s work buffer, informing it F2 stack. Upon reception of the data, F2 per-
that there is new work to be performed. Once this forms some work and then writes the result in
is completed, then the F2 accelerator notifies its local memory and notifies the CPU, which will
local CPU that the work has been done and it has take another system time stamp to determine
new data to process. As is shown, this solution is the overall run time at the application level.
far more amenable to data-flow-type processing, For simplicity, we keep the data block size
allowing for simpler pipelining through the dis- constant through the whole data flow of the
tributed FPGA resources. experiment and use dummy accelerator blocks
that do not perform any genuine function, but
Experimental Setup merely add a computation latency. In this way,
In order to demonstrate the benefits of our the results can be generalized and applied to
implementation, we move now to recreate on many acceleration functions. Blocks of data are
the FPGA the control and dataflow examples transferred using DMA engines to the accelera-
described in Figure 3(b) and (c), transmitting dif- tor, as is the typical case in FPGA-based imple-
ferent sizes of data through the data paths. We mentations of HPC operations such as matrix–
ignore the TCP setup as the performance is vector or matrix–matrix multiplication. We
known to be degraded, and a setup enabling dis- adjust the block size of the data, and adjust the
tributed inter-FPGA communication would be computation latency of the accelerator block in
difficult. The hardware transport layer is fully relation to the pure communication time for a
implemented within the FPGA fabric, but the given solution (block size and data path). A
software reliability is not implemented. In this computation/communication ratio of zero
instance, we simply pass control information denotes instantaneous processing time: data (at
and data around the system and over the net- block level granularity) is simply written into
work as would be required with a complete the accelerator block and back out. A communi-
implementation. The main aim here is to show cation/computation ratio of R denotes that the
the effects of the reducing control and datapath system spends R times as long computing
complexity, as opposed to the actual effects of inside the accelerator as the communication
the specific implementation. path on moving data and control information
Our experimental hardware platform consists around the system. For instance, with R ¼ 1,
of a ZCU102 development board where two full each accelerator’s computation latency will be
instances of the networking stack (two of every- the same time as the whole system spends
thing shown in Figure 2 other than the CPU and communicating.

January/February 2020
31
Hot Interconnects 26

29% reduction) for small and medium block


sizes, which could have dramatic effects on
tightly coupled applications with many small
messages, especially if they have irregular access
patterns, such as workloads involving pointer-
chasing, with list, tree, or graph traversal, for
example, workloads which are of increasing inter-
est within the FPGA community.5
The latency difference in the hardware- and
software-based transport comes from the addi-
tional control path complexity seen between
Figure 3(b) and (c), with the former requiring
CPU orchestration for inter-FPGA data transfer.
It is also worth noting that in a full implementa-
tion of the software transport additional over-
heads would be incurred, further increasing the
latency we see here for the software solution.
If we focus on throughput, we see a gain of
8.6% over the software transport solution. How-
ever, there appears to be a saturation point
toward the upper limit of the block sizes, sug-
gesting that further increasing the maximum
block size for a single accelerator module will
not translate into higher performance. Note,
however, that we support multiple processing
elements within the same FPGA to exploit spatial
parallelism.
Using the raw throughput results, we can esti-
mate the achievable memory-bound flops of such
a solution, using methods similar to those used
by Strenski8 and Williams et al.9 Let us assume a
1-KB block size for transfer, feeding an accelera-
tor block 128 double -precision floats to perform
an 88 matrix–matrix multiplication. This gives us
1024 flops per block (512 multiply-adds).
According to Vivado HLS tools, a simple
Figure 4. Top: Latency to perform a single
matrix–matrix multiply will have a latency of 288
operation. Achievable throughput for (middle) small
cycles (or 0.20 computation/communication
block sizes and (bottom) larger block sizes.
ratio in our experiments above and a throughput
of 1.1 Gb/s). Thus, we can make approximately 134
Results 277 block transfers per second. (1:1  109 =8, 192
Figure 4 shows the latency and throughput for bits per block transfer.) Using this, we see that we
the control and data flow through our fully imple- can extract approximately 137 MFlops per IP block
mented hardware-based transport and the emu- (1024 Flops/block transfer). According to the HLS
lated software transport, as shown in Figure 3(b) synthesis output, each block requires 11 722 LUTs,
and (c) in terms of purely remote-memory bound with the implementation being LUT bound on the
data processing per processing element (not to ZCU102 device.
be confused with the communication throughput With the resources used by our network stack,
over the links or the raw computing throughput we could theoretically fit 19 IP blocks on a
in flops). We see that latency is improved (up to single FPGA. However, if we allow for 9 blocks,

IEEE Micro
32
which are enough to saturate a single 10G link, we ACKNOWLEDGMENTS
could obtain 1.233 GFlops (double precision) per This work was supported by the European
FPGA (9 blocks × 137 MFlops). These results are Union’s Horizon 2020 research and innovation
completely bound by the network, rather than the program under Grant 671553 (ExaNeSt) and
off-chip memory Grant 754337 (EuroEXA). The work of M. Luja n
bandwidth. Com- While we have only was supported by an Arm/RAEng Research Chair
pared with the dealt with a small piece award and M. Lujan is a Royal Society Wolfson
paper by Correa of the puzzle within our Fellow.
and David,7 where a work—the require-
theoretical peak of ments interconnect
8.9 GFlops over 8 technologies to enable & REFERENCES
FPGAs (1.11 GFlops the efficient exploitation 1. A. Tate et al., “Programming abstractions for data
per FPGA) was of FPGAs—we believe locality,” in Proc. PADAL Workshop 2014, Swiss
reported for net- an inflexion point is National Supercomputing Center, Apr. 28–29, 2014,
being reached. We see
work-bound proc- p. 7.
that many other impor-
essing, we can 2. V. W. Lee et al., “Debunking the 100x GPU versus
tant areas are reaching
claim that our com- CPU myth: An evaluation of throughput computing on
sufficient maturity to
munication solution push the use of FPGAs CPU and GPU,” ACM SIGARCH Comput. Archit.
is more effective, into mainstream HPC. News, vol. 38, no. 3, pp. 451–460, 2010.
particularly since 3. K. Asanovic et al., “The landscape of parallel
they are limited to a computing research: A view from Berkeley,” Dept.
basic ring topology, severely limiting scalability. Elect. Eng. Comput. Sci., Univ. California, Berkeley,
Berkeley, CA, USA, Tech. Rep. UCB/EECS-2006-183,
2006.
CONCLUDING REMARKS 4. F. A. Escobar, X. Chang, and C. Valderrama,
There are many in the HPC community who “Suitability analysis of FPGAs for heterogeneous
doubt the viability of reconfigurable computing platforms in HPC,” IEEE Trans. Parallel Distrib. Syst.,
within this domain. It is certainly true that vol. 27, no. 2, pp. 600–612, Feb. 2016.
there are many impediments to the uptake of 5. G. Weisz, J. Melber, Y. Wang, K. Fleming, E. Nurvitadhi,
FPGA technology and many research questions and J. C. Hoe, “A study of pointer-chasing performance
still to be answered: such as standard HPC on shared-memory processor—FPGA systems,” in
library support, portability, and reliability at Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate
scale. However, we have demonstrated that Arrays, 2016, pp. 264–273.
there is much scope for the use of FPGAs € ning, and F.-J. Pfreundt, “Infiniband-
6. L. Oden, H. Fro
within HPC. verbs on GPU: A case study of controlling an
While we have only dealt with a small piece of infiniband network device from the GPU,” in Proc. IEEE
the puzzle within our work—the requirements of Int. Parallel Distrib. Process. Symp. Workshops, 2014,
interconnect technologies to enable the efficient pp. 976–983.
exploitation of FPGAs—we believe an inflexion 7. R. S. Correa and J. P. David, “Ultra-low latency
point is being reached. We see that many other communication channels for FPGA-based HPC
important areas are reaching sufficient maturity cluster,” IEEE Trans. Very Large Scale Integr. (VLSI)
to push the use of FPGAs into mainstream HPC. Syst., vol. 63, pp. 41–55, 2018.
HLS tools have progressed enormously, as has 8. D. Strenski, “FPGA floating point performance—A
the architecture of FPGAs themselves (enhanced pencil and paper evaluation,” HPC Wire, 2007.
floating-point capability, hardened transceiver 9. J. Williams, C. Massie, A. D. George, J. Richardson,
technology, etc.). With tighter power constraints K. Gosrani, and H. Lam, “Characterization of fixed
and burgeoning data-intensive workloads, we see and reconfigurable multi-core devices for
that demand is also growing. It seems that as the application acceleration,” ACM Trans.
pressures on Moore’s Law become greater, so Reconfigurable Technol. Syst., vol. 3, no. 4, 2010,
will the pressure to seek alternative technologies. Art. no. 19.

January/February 2020
33
Hot Interconnects 26

Joshua Lant is currently a Research Associate and characterization of application’s behaviour. Javier
with the APT Group, The University of Manchester. led the workpackage on interconnects of the Exanest
His primary interests include high-performance com- project which aimed to design and prototype the
puting systems, interconnection networks, and interconnection infrastructure for future exascale-
FPGA-based design/microarchitecture. He received capable computing systems. Contact him at javier.
the M.Eng. degree in music technology systems navaridas@manchester.ac.uk.
from the University of York in 2015, and the Ph.D.
degree in computer science from The University of Mikel Luja  n received the PhD degree in computer
Manchester in 2019. Contact him at joshua. science from the University of Manchester. He is the
lant@manchester.ac.uk. Arm/RAEng Research Chair in Computer Systems
and is a Royal Society Wolfson Fellow in the Depart-
Javier Navaridas is a Senior Research Fellow with ment of Computer Science, University of Manches-
the Advanced Processors Technologies group of the ter. His research interests include energy efficient
University of Manchester. He received the M.Eng. computing, managed runtime environments and vir-
degree in computer engineering in 2005 and the Ph.D. tualization, dynamic binary instrumentation tools,
degree in computer engineering (awarded with an and application-specific systems and optimizations.
Extraordinary Doctorate Award - top 5% theses) in Contact him at mikel.lujan@manchester.ac.uk.
2009, both from the University of the Basque Country,
Spain. Afterwards he joined the APT group with a pres- John Goodacre is currently a Professor with the
tigious Newton International Fellowship (7% accep- APT Group and the Director of Systems and Technol-
tance rate) where later became a Lecturer and theme ogy, ARM. He is an expert in big data and high-
leader on Computer Architecture. Researchwise, he performance computing technologies (Euroserver,
has a long list of publications (60+) on different RethinkBIG, HIPEAC 2013 Roadmap, EuroEXA) and
aspects of interconnects, parallel and distributed sys- one of the main drivers in the path to European Exas-
tems, computer architecture, performance evaluation cale. Contact him at john.goodacre@manchester.ac.uk.

IEEE Micro
34
Theme Article: Hot Interconnects 26

Communication Profiling
and Characterization of
Deep-Learning
Workloads on Clusters
With High-Performance
Interconnects
Ammar Ahmad Awan, Arpan Jain,
Ching-Hsiang Chu, Hari Subramoni, and
Dhableswar K. Panda
The Ohio State University

Abstract—Heterogeneous high-performance computing systems with GPUs are equipped


with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink.
However, little exists in the literature that captures the performance impact of these
interconnects on distributed deep learning (DL). In this article, we choose Horovod, a
distributed training middleware, to analyze and profile various DNN training workloads using
TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide
variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and
various interconnects to analyze the following metrics: 1) message-size with Horovod’s
tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4)
time taken by each MPI/NCCL call. We observed extreme performance variations for
non-power-of-two message sizes on different platforms. To address this, we design a
message-padding scheme for Horovod, illustrate significantly smoother allreduce latency
profiles, and report cases where we observed improvement for end-to-end training.

Digital Object Identifier 10.1109/MM.2019.2949986


Date of publication 30 October 2019; date of current version
14 January 2020.

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
35
Hot Interconnects 26

& HIGH-PERFORMANCE COMPUTING (HPC) clus- high-performance interconnects and correlate


ters are increasingly getting equipped with fast low-level performance statistics to high-level
interconnects like Mellanox InfiniBand and Intel application-specific metrics.
Omni-Path to offer low-latency and high-band- Along these lines, we investigate the follow-
width communication across multiple compute ing concrete challenges in this article.
nodes. These internode network interfaces are  How do we correlate low-level performance
complementary to on-node interconnects like
statistics such as the latency for Allreduce
PCIe. On the other hand, we are witnessing an
with high-level metrics such as images per
increase in the adoption of heterogeneous HPC
second for TensorFlow and PyTorch?
systems that are equipped with fast GPUs like  Can we design an application-level profiling
NVIDIA Volta V100 that have led to
infrastructure that allows us
the rise of new and proprietary on-
This is the first study that to obtain such characteris-
node interconnects like NVIDIA’s
addresses performance tics for mainstream commu-
NVLink. The primary goal of these
analysis of a relatively nication middleware like
interconnects is to increase the
new application area, MPI and NCCL for CPU and
speed of data movement and keep i.e., distributed DL on GPU based platforms?
GPUs’ computation units busy, HPC systems, with an  How do we improve perfor-
thereby avoiding stalls and getting emphasis on the mance of Allreduce for non-
the best performance. This is further impact of various
power-of-two message sizes
propelled by emerging application high-performance
that are being used due to
areas like distributed deep learning interconnects.
techniques such as Horovod’s
(DL) that require intra- and internode
tensor fusion?
communication for gradient aggregation during
deep neural network (DNN) training. These appli-
cations require high bandwidth and low-latency
Contributions
communication to scale efficiently.1
To the best of our knowledge, this is the first
To this end, several studies have been pub-
study that addresses performance analysis of a
lished that deal with the performance of different
relatively new application area, i.e., distributed DL
interconnects using microbenchmarks and scien-
on HPC systems, with an emphasis on the impact
tific applications.2–5 However, little exists in the lit-
of various high-performance interconnects. We
erature that addresses the performance impact of
make the following key contributions in this
modern interconnects on emerging application
article.
areas like distributed DL. High-level performance
studies only focus on end-to-end performance6  We propose and design application-level pro-
but do not offer insights into low-level perfor- filing infrastructure called hvprof that can
mance characteristics. In an early version of this capture characteristics of DL workloads for
article,7 we also identified a major problem for various underlying interconnects and com-
communication libraries like NCCL and MPI, i.e., munication libraries.
non-power-of-two message sizes being used for  We analyze low-level performance characteris-
gradient communication that lead to performance tics of models like ResNet-50/101 and Inception-
instabilities. Furthermore, existing studies are v3/4 on CPU- and GPU-based HPC systems.
using a black-box approach, which only provides  We identify inconsistent performance when
a limited and isolated view of performance. Thus, nonpower-of-two message sizes are used and
it is challenging to correlate microbenchmark per- propose a padding scheme to improve the
formance (e.g., osu_allreduce from OSU Micro- performance for such cases.
Benchmarks (OMB) with end-to-end performance  We provide end-to-end application perfor-
(e.g., training a TensorFlow or PyTorch model). mance trends for TensorFlow and PyTorch
To address these problems, we focus on the and correlate them with low-level statistics
following broad challenge: How to effectively pro- to guide designers of MPI/NCCL libraries as
file the performance of DL workloads on various well as DL frameworks.

IEEE Micro
36
CHALLENGES IN PROFILING DL Correlate End-to-End Application-Level
WORKLOADS ON HPC SYSTEMS Performance and Low-Level Profiling Data
We highlight the key challenges that we have The broad goal of our profiling and perfor-
addressed in this article. mance analysis study is to identify bottlenecks
and improve performance for an emerging and
Profiling DL Workloads on Various important class of workloads. Our analysis needs
Interconnects to be helpful for two different communities: 1) DL
Current general-purpose profiling tools such as software developers and 2) communication run-
MPI Profiling (mpiP) and NVIDIA Profiler (nvprof) time designers. Both should be able to gain mean-
do not support profiling communication of new ingful data so they can improve performance at
libraries like NCCL. We note the nvprof profiles their respective levels. For instance, images/sec-
NCCL calls but considers it as a computation ker- ond is a representative metric for DL developers
nel, which does not help understand the communi- to understand the performance of their DNN
cation characteristics of a ncclAllreduce operation. architecture but of little use for an MPI library
mpiP offers profiling data on MPI operations like designer. On the other hand, the latency of an
point-to-point and collective operations whereas MPI_Allreduce operation can be used by an MPI
nvprof profiles CUDA APIs, kernel activities, and library designer to optimize collective communi-
some rudimentary information about MPI opera- cation algorithms and design schemes but may
tions via its annotation feature, i.e., –annotate-mpi, not be very helpful for a DNN designer. The profil-
by utilizing NVIDIA Tools Extensions (NVTX). Thus, ing analysis thus needs to cater to both communi-
there is a need for a more white-box approach to ties with diverse requirements.
profiling different activities at the training middle-
ware (e.g., Horovod) level. The biggest challenge in
profiling GPU-based training with existing tools is PROPOSED DESIGN
an approach called interception using LD_PRELOAD Profiling and Characterization
for MPI calls. This is not supported by frameworks hvprof is our proposed infrastructure built for
like TensorFlow that use low-level memory man- Horovod that enables the users to perform in-
agement infrastructure and thus cannot be profiled depth profiling of various workloads for several
using mpiP and nvprof –annotate-mpi. We observed popular toolkits like TensorFlow, PyTorch,
various failures, including errors like CUDA_INVA- MXNet, and Keras. It provides multiple types of
LID_CONTEXT for such cases. metrics that are valuable for both ML/DL devel-
opers as well as for designers of an MPI library.
Profile Workloads in a DL-Framework Agnostic Because we implement hvprof at the Horovod
Fashion level, we do not need interception/LD_PRELOAD.
Horovod, a distributed training middleware We maintain various profiling variables using
that supports distributed training for TensorFlow, Horovod Engine’s global state. The following
PyTorch, MXNet, and Keras, is gaining consider- concrete statistics are gathered through func-
able attention from ML as well as HPC communi- tion calls to MPI and NCCL primitives: 1) mes-
ties. It is evident from the emerging distribution sage size; 2) message count; 3) latency of a
strategy APIs being developed by TensorFlow and single communication operation; and 4) aggre-
PyTorch teams, both of which are very similar to gate latency. These statistics are then compared
Horovod’s APIs. The main challenges are as fol- to the end-to-end performance (images/second)
lows. 1) How do we profile different kinds of com- reported by application codes like tf_cnn_bench-
munication calls used by Horovod? 2) Do we need marks. An overview of hvprof and where it is
to separate the statistics gathered due to a DL positioned in the overall hierarchy is provided
framework’s call to allreduce operation from what in Figure 1. We broadly divide operations into
is required by Horovod’s multithreaded communi- two types: 1) MPI/NCCL operations used by the
cation and synchronization engine? and 3) How do DL framework, and 2) MPI operations used
we differentiate between different MPI and NCCL by the Horovod communication engine and
calls? response cache.

January/February 2020
37
Hot Interconnects 26

PERFORMANCE ANALYSIS
We provide a brief overview of the evaluation
platforms and metrics followed by comprehen-
sive sections on 1) end-to-end application perfor-
mance; 2) low-level profiling analysis; and 3)
impact of the proposed padding design.

Evaluation Platform
We utilized two well-known HPC clusters: 1)
Skylake partition of TACC Stampede2 equipped
with Intel Omni-Path interconnect and 2) Ohio
Supercomputing Center (OSC) Pitzer equipped
with Mellanox InfiniBand. For GPU-based train-
Figure 1. Overview of the proposed profiling ing, we evaluated and profiled DL workloads on
infrastructure. multiple Volta V100 GPUs connected using
NVLink and IBM X-bus within a node an Infini-
Proposed Padding Design Band across nodes.
Based on characterization results presented We used the following communication librar-
earlier by Awan et al.,7 we observed that message ies: Intel MPI (2018), MVAPICH2 (2.3.1), MVA-
sizes that are not a power-of-two (e.g., 321) cause PICH2-GDR (2.3.1), NCCL2 (2.4), Spectrum MPI
inconsistent performance behavior on all intercon- (10.3), and Open MPI (4.0.1) with UCX (1.6.0).
nects. To address it, we propose a simple padding
scheme that finds the next power-of-two size for a Metrics
given tensor and appends it with additional The main metric for end-to-end application
dummy bytes. Thus, Horovod’s fusion-buffer will performance is images/second (img/sec) for syn-
now become an exact power-of-two sized buffer. chronous distributed training using Horovod for
We call MPI_Allreduce/ncclAllreduce on this TensorFlow and PyTorch models. The low-level
new padded fusion-buffer instead of the original metric we have used is latency (time of each
buffer. The Allreduce latency profiles presented in call) for the Allreduce operation. Message count
Section Impact of Padding Design show that this and size have already been discussed in the ear-
simple padding scheme can significantly reduce lier version of this article7 and have been omit-
the performance instabilities. The main reason for ted here.
poor non-power-of-two performance is because
most MPI libraries implement collective communi- End-to-End Performance
cation algorithms optimized for power-of-two mes- CPU Platforms An overview of end-to-end per-
sage sizes. formance for ResNet-101 and Inception-v4 on

Figure 2. TensorFlow: Multinode training—FE and FD on Stampede2 and Pitzer.

IEEE Micro
38
Figure 3. End-to-end performance comparison for
TensorFlow (TF) and PyTorch (PT) on Stampede2:
FD, FD, and fusion enabled with padding (FE-
Padding).

Pitzer (InfiniBand) and Stampede2 (Omni-Path)


is shown in Figure 2. We use the Intel-optimized
version of TensorFlow (v1.13) on all Intel CPU
platforms to take advantage of MKL optimiza-
tions. For Pitzer, enabling tensor fusion (FE)
results in better performance for both ResNet-
101 and Inception-v4. On the other hand, for
Stampede2, we observed that disabling fusion
(FD) gives slightly better (up to 5%) performance
for a higher number of nodes. The behavior of Figure 4. Allreduce latency for Inception-v4 training on
ResNet-101 and Inception-v4 is slightly different Stampede2 (Omni-Path) and Pitzer (InfiniBand) using Intel MPI
when tensor-fusion is enabled or disabled on and MVAPICH2 with tensor-fusion enabled and disabled.
Stampede2.PyTorch Versus TensorFlow With
GPU Platforms GPU-based end-to-end perfor-
Padding: We now present the experiments per-
mance behavior was evaluated with two differ-
formed to evaluate PyTorch as well as the new
ent configurations.
padding designs in hvprof. Figure 3 shows how
1) GPUs connected with NVLink only — (G1)
TensorFlow and PyTorch perform under differ-
2) GPUs connected with NVLink (same
ent settings like fusion enabled (FE), fusion dis-
socket), X-Bus (across sockets), and Infini-
abled (FD), etc. and how padding can impact the
Band (across nodes) — (G2).
end-to-end performance for both cases. The data
For G1, we trained ResNet-50 using different
is presented only for the Stampede2 cluster.
communication libraries. The NVLink column of
Table 1 shows the end-to-end numbers. The
lower-level allreduce latency for all these is dis-
Table 1. End-to-end performance on GPUs (only the best
cussed further in Section Low-Level Profiling
cases are shown for all configurations).
Analysis. We see that MVAPICH2 and NCCL2 are
Comm. NVLink (3 NVLink-Xbus-Infiniband much better compared to others. Spectrum MPI
Library GPUs) (12 GPUs) is the slowest of all the communication libraries.
MV2-GDR 1129 4274 We note that this high-level img/sec trend
NCCL 1129 4388 matches the low-level allreduce statistics col-
lected using hvprof as shown in Figure 5.
Spectrum
MPI
885 2772 For G2, we trained ResNet-50 on two nodes in
a 6-GPU/node configuration where three GPUs
OpenMPI 1013 1975
are on the same socket and are only connected

January/February 2020
39
Hot Interconnects 26

Low-Level Profiling Analysis


We discuss several low-level profiles in this
section. We profiled ResNet-101 and Inception-v4
training on Stampede2 and Pitzer using Intel
MPI and MVAPICH2. We broadly divide the
results into two combinations: 1) Tensor Fusion
Enabled (FE); and 2) Tensor Fusion Disabled
(FD). Both are shown in Figure 4.
Horovod’s tensor fusion design causes the
use of multiple nonpower-of-two message sizes,
which results in performance variations (spikes)
for both Intel MPI and MVAPICH on both clusters.
Figure 5. Allreduce latency for ResNet-50 training on three On the other hand, disabling fusion restricts the
NVLink-connected V100 GPUs (without padding). message sizes to differ significantly less com-
pared to fusion, which results in fewer spikes for
using NVLinks whereas the other three GPUs both MPI libraries and on both clusters.
within the node are cross sockets and communi- Figure 5 provides a three-GPU profiling analy-
cation will take place using IBM’s X-Bus interface sis for ResNet-50 training on Volta V100 GPUs. We
between two POWER9 CPUs. Traffic across have two main cases: 1) GPUs connected with
nodes will go over a socket-direct InfiniBand NVLink (on POWER9); and 2) GPUs connected
EDR HCA. We report the default numbers as well with PCIe (on Intel). Broadly, we observed higher
as numbers with the proposed padding design. latency and more spikes for OpenMPI and Spec-
The trends for G2 do not match G1. For exam- trumMPI compared to MVAPICH2-GDR and NCCL.
ple, under G2, we observed 4313 img/s with pad-
ding for MVAPICH2-GDR whereas only 4274 img/s
without padding. On the other hand, NCCL2 Impact of Padding Design
offered 4362 img/s with padding, which is lower To further understand the behavior of GPU-
than 4388 img/s without padding. Spectrum MPI based training, we performed the ResNet-50 train-
seems to perform better than Open MPI for this ing experiment on two-nodes with 6 GPUs/node
case and gives 2772 img/s without padding and evaluate the impact of padding designs. As
whereas only 2522 with padding. Open MPI per- shown in Figure 6, we observe new trends for all
formed the worst and gives 1975 img/s without four communication libraries. MVAPICH2-GDR
padding and only 1835 img/s with padding designs. and NCCL2 offer much smoother latency curves
compared to Spectrum MPI and Open MPI.
Latency for large-messages for both Spectrum
MPI and Open MPI is significantly worse com-
pared to MVAPICH2-GDR and NCCL2.
On the CPU side, we performed new experi-
ments on eight nodes for both Pitzer and Stam-
pede2 to study the impact of padding designs.
Figure 7 shows the impact of padding on Tensor-
Flow as well as PyTorch. We note that PyTorch is
much slower compared to TensorFlow as reported
earlier in end-to-end performance (Section End-to-
End Performance). We can see from Figure 7 that
the allreduce latency curves are not as inconsis-
tent for PyTorch as it is for TensorFlow. This result
can be confusing if the end-to-end number is not
Figure 6. Allreduce latency for ResNet-50 training on two nodes considered and one can wrongly conclude that
in 6-GPU/node configuration (with padding). TensorFlow’s behavior is inconsistent.

IEEE Micro
40
RELATED WORK
Most related studies have focused on large-
scale evaluation and characterization of DL work-
loads on GPUs.4,8 Some papers have focused on
CPU based platforms as well.9,10 Large-scale train-
ing experiments for Intel CPUs are discussed by
Codreanu et al.11 The best practices guide12 pub-
lished by Intel provides data on multinode training
with Intel-optimized TensorFlow. However, all
these papers only provide end-to-end performance
behavior for DL workloads. To the best of our
knowledge, this article is the first study that pro-
vides not only end-to-end performance but also
offers in-depth data about low-level metrics like
number and time taken by allreduce calls used for
synchronization and gradient communication. Fur-
thermore, none of the published works explore the
performance impact of interconnects like Infini-
Band and Omni-Path on distributed training.

CONCLUSION
DL workloads are becoming increasingly impor-
tant for HPC clusters equipped with cutting-edge
CPUs and GPUs as well as interconnects like Infini-
Band and Omni-Path. In this article, we highlighted
challenges in evaluating and profiling DL workloads
and how to overcome these through the proposed
Figure 7. PyTorch allreduce profiller for eight nodes, 48 ppn,
profiling infrastructure called hvprof. We offer per-
and ResNet-50.
formance characterization for various combinations
of DNNs like ResNet(s) and Inception variants on
CPU and GPU systems with interconnects like Infini-
Band, Omni-Path, X-Bus, and NVLink. We highlight #CNS-1513120, #ACI-1450440, #CCF-1565414, and
that MPI as well as NCCL perform well for power-of- #ACI-1664137.
two message sizes but provide inconsistent perfor-
mance for nonpower-of-two message sizes. We also & REFERENCES
attempt to mitigate this by designing and evaluating 1. A. Sergeev and M. Del Balso, “Horovod: Fast and easy
a padding design for Horovod. Even with the pad- distributed deep learning in TensorFlow,” 2018.
ding designs, we believe that there is an opportunity [Online]. Available: http://arxiv.org/abs/1802.05799
to redesign Horovod and/or communication run- 2. S. A. Mojumder et al., “Profiling DNN workloads on a
times to further optimize performance for these volta-based DGX-1 system,” in Proc. IEEE Int. Symp.
nonpower-of-two message sizes. In the future, we Workload Characterization, Sep. 2018, pp. 122–133.
plan to investigate new designs that avoid the main 3. C. Pearson, I.-H. Chung, Z. Sura, W.-M. Hwu, and
problem with current padding design, i.e., adding J. Xiong, “NUMA-aware data-transfer measurements
more data due to padding and thereby increasing for power/NVLink multi-GPU systems,” in Proc. Int.
the traffic over interconnects. Conf. High Perform. Comput., 2018, pp. 448–454.
4. A. A. Awan, H. Subramoni, and D. K. Panda, “An in-
ACKNOWLEDGMENTS depth performance characterization of CPU- and
This work was supported in part by the GPU-based DNN training on modern architectures,” in
National Science Foundation (NSF) under Grants Proc. Mach. Learn. HPC Environ., 2017, pp. 8:1–8:8.

January/February 2020
41
Hot Interconnects 26

5. K. S. Khorassani, C.-H. Chu, H. Subramoni, and Ammar Ahmad Awan is currently working toward
the PhD degree at The Ohio State University, Colum-
D. K. Panda, “Performance evaluation of MPI libraries
bus, OH, USA. He received the BS and MS degrees in
on GPU-enabled OpenPOWER architectures: Early
computer science and engineering from the National
experiences,” in Proc. Int. Conf. High Perform.
University of Science and Technology, Islamabad,
Comput., 2018. Pakistan, and Kyung Hee University, Seoul, South
6. Baidu Research, “Benchmarking deep learning Korea, respectively. His current research focus lies at
operations on different hardware,” 2016, Accessed: the intersection of high-performance computing
2019/10/31 15:45:00. [Online]. Available: https:// libraries and deep learning frameworks. He previ-
github.com/baidu-research/DeepBench ously worked on a java-based message passing inter-
7. A. A. Awan, A. Jain, C.-H. Chu, H. Subramoni, and face (MPI) and nested parallelism with OpenMP and
D. K. Panda, “Communication profiling and MPI for scientific applications. He has authored or
characterization of deep learning workloads on coauthored 20 papers in conferences and journals
clusters with high-performance interconnects,” in related to these research areas. He actively contrib-
utes to various projects like MVAPICH2-GDR (high-
Proc. 26th Symp. High-Perform. Interconnects, 2019.
performance MPI for GPU clusters, OMB (OSU Micro
dorf, C.-H. Chu, H. Subramoni, and
8. A. A. Awan, J. Be
Benchmarks), and HiDL (high-performance deep
D. K. Panda, “Scalable distributed DNN training using
learning). He is the lead author of the OSU-Caffe
TensorFlow and CUDA-aware MPI: Characterization, framework (part of HiDL project) that allows efficient
designs, and performance evaluation,” in Proc. 19th distributed training of deep neural networks. Contact
Annu. IEEE/ACM Int. Symp. Cluster, Cloud, Grid him at awan.10@osu.edu.
Comput., 2019, pp. 498–507.
9. S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarking
Arpan Jain is currently working toward the PhD
state-of-the-art deep learning software tools,” in Proc.
degree at The Ohio State University, Columbus, OH,
7th Int. Conf. Cloud Comput. Big Data, Nov. 2016,
USA, working with Prof. D. K. Panda. He received the
pp. 99–104.
BS and MS degrees in information and communica-
10. A. Viebke and S. Pllana, “The potential of the Intel
tion technology from ABV—Indian Institute of Informa-
(R) Xeon Phi for supervised deep learning,” in Proc. tion Technology and Management Gwalior, Gwalior,
IEEE 17th Int. Conf. High Perform. Comput. India. He is currently a Graduate Research Assistant
Commun., 7th Int. Symp. Cyberspace Safety Secur. with the Network-Based Computing Laboratory,
12th Int. Conf. Embedded Softw. Syst., Aug. 2015, Columbus, OH, USA. His current research interests
pp. 758–765. lie at the intersection of deep learning and high-per-
11. V. Codreanu, D. Podareanu, and V. Saletore, “Large formance computing. He is working on parallelization
minibatch training on supercomputers with improved and distribution strategies for large-scale deep neural
accuracy and reduced time to train,” in Proc. IEEE/ACM network (DNN) training. He previously worked on
speech analysis, time series modeling, hyperpara-
Mach. Learn. HPC Environ., Nov. 2018, pp. 67–76.
meter optimization using nature-inspired algorithms
12. A. Bhandare et al., “Best practices for scaling deep
for DNN, and object recognition. He has authored or
learning training and inference with Tensorflow on Intel
coauthored nine papers in conferences and journals
Xeon processor-based HPC infrastructures,”
related to these areas. He actively contributes to
Connectivity Group & AI Products Group, Data Center projects like HiDL (high-performance deep learning)
Group Customer Solutions Technical Enabling, Intel and MVAPICH2-GDR (high-performance MPI for
Corporation, Tech. Rep., Jan. 2019. GPU clusters). Contact him at jain.575@osu.edu.

IEEE Micro
42
Ching-Hsiang Chu is currently working toward the 50 papers in international journals and conferences
PhD degree in computer science and engineering related to these research areas. He is doing research
at The Ohio State University, Columbus, OH, USA. and working on the design and development of
He received the BS and MS degrees in computer sci- MVAPICH2, MVAPICH2-GDR, and MVAPICH2-X
ence and information engineering from the National software packages. He is a member of IEEE. Contact
Changhua University of Education, Changhua City, him at subramoni.1@osu.edu.
Taiwan, in 2010, and the National Central University,
Taoyuan City, Taiwan, in 2012, respectively. Dhableswar K. (DK) Panda is currently a Profes-
His research interests include high-performance sor and University Distinguished Scholar of Com-
computing, GPU communication, and wireless puter Science and Engineering at The Ohio State
networks. He is a student member of IEEE and ACM. University, Columbus, OH, USA. He has authored or
Contact him at chu.368@osu.edu. coauthored more than 450 papers in major journals
and international conferences. The MVAPICH2 (high-
Hari Subramoni has been a Research Scientist with performance MPI over InfiniBand, iWARP, and
the Department of Computer Science and Engineering, RoCE) open-source software package, developed
The Ohio State University, Columbus, OH, USA, since by his research group (http://mvapich.cse.ohio-
September 2015. He received the PhD degree in com- state.edu), is currently being used by more than
puter science from The Ohio State University, Colum- 3025 organizations worldwide (in 89 countries). This
bus, OH, USA, in 2013. His current research interests software has enabled several InfiniBand clusters to
include high-performance interconnects and protocols, get into the latest TOP500 ranking during the last
parallel computer architecture, network-based comput- decade (including the current #3). More than
ing, exascale computing, network topology aware com- 600 000 downloads of this software have taken place
puting, QoS, power-aware LAN-WAN communication, from the project’s website alone. He is an IEEE
fault tolerance, virtualization, big data, and cloud com- Fellow and a member of ACM. Contact him at
puting. He has authored or coauthored more than panda.2@osu.edu.

January/February 2020
43
Theme Article: Hot Interconnects 26

High-Quality Fault
Resiliency in Fat Trees
John Gliksberg Pedro Javier Garcıa
Versailles Saint-Quentin-en-Yvelines University, Castilla-La Mancha University
Atos, Castilla-La Mancha University
Devan Sohier
Antoine Capra Versailles Saint-Quentin-en-Yvelines University
Atos
Alexandre Louvet
Atos

Abstract—Coupling regular topologies with optimized routing algorithms is key in pushing


the performance of interconnection networks of supercomputers. In this article, we present
Dmodc, a fast deterministic routing algorithm for parallel generalized fat trees (PGFTs),
which minimizes congestion risk even under massive network degradation caused by
equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula
by relying on a fast preprocessing phase. This allows complete rerouting of networks with
tens of thousands of nodes in less than a second. In turn, this greatly helps centralized fabric
management react to faults with high-quality routing tables and has no impact on running
applications in current and future very large scale high-performance computing clusters.

& A MAJORITY OF current leading network topolo- topology class should simplify load-balancing
gies for high-performance computing (HPC) clus- strategies. Parallel generalized fat trees (PGFTs)2
ters are fat-tree variants. (The five most powerful describe all regular fat trees for which there is at
clusters of the June 2019 Top500 list1 had fat-tree most one downward switch-path from any switch
topologies.) These networks have some form of to any node (as shown in Figure 1). In this article,
static routing tables computed by a centralized we refer to fat trees as PGFTs. The oblivious rout-
routing engine and uploaded to all switches. ing algorithm for nondegraded PGFTs (Dmodk,2
It is sufficient for fat-tree-specific routing algo- see DMODK) uses this property and their connec-
rithms to be minimal to guarantee deadlock-free tion logic to provide load balance through an
arithmetic rule.
routing, and the regular nature of their target
Due to the sheer amount of equipment in cur-
rent and future supercomputers, hardware fail-
Digital Object Identifier 10.1109/MM.2019.2949978 ures are to be expected3 (especially in optical
Date of publication 30 October 2019; date of current version links4 typically used in higher levels of fat trees)
14 January 2020. and should not hinder running applications as far

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
44
as possible. The fabric manager can react to
equipment failures that do not break graph con-
nectivity by uploading updated routing tables. In
order to do this, it requires a fault-resilient routing
algorithm capable of rapid rerouting. The chal-
lenge is to provide these characteristics while
maintaining a high-quality static load balance.
Some of the research regarding oblivious Figure 1. PGFTð3; 2:2:3; 1:2:2; 1:2:1Þ with leaf switches shown in
fault-resilient routing focuses on techniques that grey.
explicitly target degradations to regular fat-
trees.5,6 There are several re-routing strategies
for these techniques. OpenSM’s UPDN7 and The level-wide constants (or arities) wl and
Ftree8 routing engines can also be applied from pl , respectively, denote the numbers of
scratch to a degraded fat tree. PQFT5 is similar, uplinks and interlinks of all switches at level l.
though it requires a complete list of faults. The With this formula, each destination’s routes
combination of Dmodk + Ftrnd_diff9 available in are coalesced as early as possible, and routes
BXI FM10 is applied in an offline/online manner to different destinations are spread out as
(with an iterative list of network changes and an much as possible, thus minimizing collisions
up-to-date view of the network), the goal being between independent traffic. These closed-
fast reaction to faults with minimal routing form steps rely on a given organization of
changes. Fabriscale11 also provides fast central- addresses of switches and indexing of their
ized re-routing of fat trees, by precomputing ports. Dmodk is a very low complexity and
alternative routes. A short summary of the limits perfectly parallel routing algorithm for PGFTs,
of the existing approaches is provided in a previ- but it is not applicable to degraded PGFTs or
ous conference paper.12 irregular fat trees.
The approach that we propose to meet
that challenge is to apply the closed-form arith- DMODC DESCRIPTION
metic formula of Dmodk while relaxing the topo- The idea behind the fault-resilient algorithm
logical constraint. For that purpose, we compute that we propose is to rely on local information
shortest paths explicitly rather than relying on while using the same closed-form arithmetic for-
an addressing scheme, and we balance load mula as Dmodk. The c in Dmodc refers to the
according to locally propagated information neighboring switches explicitly determined to
rather than relying on level-wide constants. be closer to the destination among which paths
These two goals are addressed together during are chosen. The aim is fast centralized computa-
preprocessing and will be the focus of this arti- tion of routing tables for degraded PGFTs, pro-
cle, whose main contribution is the detailed algo- viding optimal or well-balanced deterministic
rithm description in the “Dmodc Description” routes even under heavy fabric degradation. The
section. algorithm begins with a preprocessing phase
(that can be multithreaded) followed by a paral-
DMODK lel computation phase. Links are assumed to be
The Dmodk routing algorithm and corre- bidirectional; notations used in the expressions
sponding PGFT topology are described in detail hereafter are defined in Table 1.
in the report by Zahavi.2 The algorithm relies on
a criterion (not shown here) to determine Basic Preprocessing
whether a destination d must be routed within For ranking, levels and link directions are
the down ports and, if so, which one. Otherwise, determined according to leaf switches being
the following arithmetic formula defines the up equivalent to the lowest level. Groups of ports
port (with index p) to select: linked to the same switch are prepared and
j Q k sorted by globally unique identifier (GUID) to
p ¼ d = lk¼1 wk mod ðwlþ1 plþ1 Þ:
help with same-destination route coalescing.

January/February 2020
45
Hot Interconnects 26

Table 1. Notations used in expressions. done here to prepare sets of output ports, but it is
better to leave it for later since each set is only
S Set of switches used once (see Routes Computation). Other all-pairs
L Set of leaf switches (L  S) shortest paths methods could be substituted here.
N Set of nodes
Procedure 1. Compute costs and dividers
E Set of edges
for all s 2 S do
n (Only) leaf switch connected to node n (n 2 L)
for all l 2 L do
; denote down and up links, respectively, according to rank cs;l 1
`
´

Ps 1
Gs Ordered list of port groups of switch s for all l 2 L do
Vg Switch connected to port group g
cl;l 0
for all s 2 S sorted in ascending rank order do
# Cardinality p Ps  #fs0 sg

´
for all s0 s do

´
Dmodc-specific notations:
cs;l
Cost of switch s to leaf switch l
for all l 2 L j cs;l þ 1 < cs;0 l do
Ps cs;0 l cs;l þ 1
Divider of switch s
for all s0 s j Ps0 < p do

´
Ps0 p
Cost for all s 62 L sorted in descending rank order do
for all s0 s do

`
We define the cost cs;l of a switch s to a leaf
for all l 2 L j cs;l þ 1 < cs;0 l do
switch l to be the minimum number of hops
cs;0 l cs;l þ 1
between each other under up–down restrictions
according to rank, as defined in Procedure 1 and Thanks to the up–down restriction, the com-
illustrated in Figure 2. This later allows us to deter- plexity of this procedure is in Oð#E#LÞ. This
mine valid paths by exploring neighboring switches restriction is only for efficiency; it does not enforce
and comparing costs. That exploration could be deadlock freedom. Some fat-tree-like topologies
would result in up–down–up–down paths (if such
shortcuts appear in neighboring switches), since
path selection does not distinguish up and down
neighbors. Avoiding this requires a slightly differ-
ent method: an extra integer must be stored, simi-
lar to cost but only for downpaths. More details
can be found in the “Routes Computation” section.
In our partially parallel implementation, each
worker thread obtains a block of switches to
propagate with one barrier per level upward,
then downward.

Divider
Figure 2. Example sequence of cost propagation steps in a Dmodc is based on the same arithmetic for-
degraded part of a network. Costs to the bottom-right switch are mula as Dmodk. Prior to the modulo operation, it
shown in switches. At each propagation step, the updated costs are begins with an integer division by the product of
in grey. Note that in steps 3–5, some propagations are interrupted #fs0 2 S j s0 sg (the upward arity of s) of
´

due to the cs;l þ 1 < cs;0 l condition in the procedure. They could switches at each lower level. This value repre-
have been achieved with a simple cs;0 l ¼ 1 condition instead; sents the number of consecutive destinations to
however, this would have also interrupted the propagation of 2 in route through the same port. It is multiplied when
step 5. As a result, the long path on the left would not have been going up levels to mirror the number of consecu-
avoided. For PGFTs (degraded or not), such cases are actually tive choices by switches below before each
impossible and the simple condition would suffice; but it would not switch is chosen again. To reflect the actual state
guarantee shortest up–down paths in fat-tree-like topologies. of the network (in which switches of the same

IEEE Micro
46
Figure 3. Example sequence of divider propagation
steps in a degraded part of a network. Dividers are
shown in switches. At each propagation step, the Figure 4. Example route computation with s in grey,
updated dividers are in gray. Note that in step 2, the Ps ¼ 4, and d ¼ 20. Costs to 20 are shown in switches.
first upswitch is not updated because p ¼ 2  2  6. Indices are ordered from left to right. The top-right
Even though there are multiple degradations in the group is chosen as gs;20 because b20=4c mod 2 ¼ 1,
considered case, all top switches end up with the and the right port in gs;20 is chosen as ps;20 because
divider that they would have had in the complete b20=ð4  2Þc mod 3 ¼ 2.
network.
with indices i 2 ½0#Cs;d ½ using the Cs;d ½i nota-
level may have different arities), only local infor- tion. From this, the output port group is chosen in
mation must be considered; in turn, this opera-   
d
tion is based on the products of up-to-date counts gs;d Cs;d mod #Cs;d (3)
Ps
of upswitches (switches connected above), as
defined in Procedure 1. Each downpath corre- and the port within that group in:
sponds to a potential divider value, and we
  
choose to keep only the maximum (as illustrated d
ps;d gs;d mod #gs;d : (4)
in Figure 3). The underlying motivation is to gen- Ps  #Cs;d
erate the same values as in the nondegraded
PGFT, as long as the topological subgroup is not Routes are computed in a loop over leaves so
systematically degraded. The complexity of this that Cs; is determined only once for all nodes con-
part of the procedure is in Oð#EÞ. nected to  (with Ps;d also unchanging 8 d j 9 d ).
Figure 4 illustrates assignments (1)–(4).
Routes Computation The cost variant for up–down restriction
The deterministic output port ps;d and alter- described in “Cost” section requires (1) to com-
native output ports Ps;d of every switch s for pare c values for upswitches and the downpath
every destination d 2 N (not directly linked to s) cost value for downswitches.
are selected with a closed-form formula based
on the results previously determined. First, port RESULTS
groups leading closer to d are selected in The algorithm was implemented in the fabric
n o management software for Atos’s Bull eXascale
Cs;d g 2 Gs j cVg ;d < cs;d (1) Interconnect (BXI). The same code has been used
for validation, simulation, and in production.
(without taking ranking into account), setting Validity
corresponding alternative output ports in: Routing is valid for degraded PGFTs if and
Ps;d fp 2 g j g 2 Cs;d g: (2) only if the cost of every leaf switch to every
other leaf switch is finite: this reflects every
Selected port groups C are stored in an array node pair having an up–down path. Our imple-
(ordered by GUID of their remote switch), also mentation includes a pass through all leaf switch
represented by C: individual groups are accessed pairs to verify this condition. The up–down path

January/February 2020
47
Hot Interconnects 26

without relying on partial re-routing. Dmodc is


also applicable to fat-tree-like topologies (as
shown in Figure 2) but with a lower quality load
balancing. As defined here, no effort has been
made to minimize the size of updates to be
uploaded to switches throughout the fabric.
This algorithm is implemented inside BXI FM10
and has been successfully deployed to an 8490-
node PGFT production network in which it helps
provide fault-resiliency even when faced with
thousands of simultaneous changes.

ACKNOWLEDGMENTS
Figure 5. Algorithm runtime on a 2.50-GHz Intel Xeon E5-2680 v3 This research has been undertaken under a
for real-life fat trees of varying sizes (in log–log scale; lower is better). cooperation between CEA and Atos, with the goal
of codesigning extreme computing solutions.
This work was supported in part by a grant of
restriction is sufficient to guarantee deadlock- Programme des Investissements d’Avenir; and in
freedom within degraded PGFTs.6 part jointly by the Spanish Ministry of Science,
Innovation and Universities under the project
Runtime RTI2018-098156-B-C52 and by JCCM under project
Our C99 implementation had computation of SBPLY/17/180501/000498. BXI development was
cost, divider, and routes spread over POSIX also part of ELCI, the French FSN (Fond pour la
threads fetching work with a switch-level granular- te
Socie  Nume rique) cooperative project that
ity. Figure 5 reports the complete algorithm execu- associates academic and industrial partners to
tion time alongside OpenSM (version 3.3.21) design and provide software components for new
routing times (measured by adding timers in the generations of HPC datacenters.
source code) running on the same machine. For
clusters ranging up to many tens of thousands of
& REFERENCES
nodes, Dmodc provides fast enough rerouting for
a centralized fabric manager to react to faults 1. E. Strohmaier, J. Dongarra, H. Simon, M. Meuer, and
before applications are interrupted. H. Meuer, “Top500 1993–2018,” 2019. [Online].
Available: https://www.top500.org
2. E. Zahavi, “D-mod-k routing providing non-blocking traffic
Quality
for shift permutations on real life fat trees,” Irwin and Joan
The routing algorithm was tested for quality
Jacobs Center for Communication and Information
by generating randomly degraded networks,
Technology, Technion, Tech. Rep. 776, 2010.
computing corresponding routing tables, and
3. J. Domke, T. Hoefler, and S. Matsuoka, “Fail-in-place
then determining maximum congestion risk for
network design: Interaction between topology, routing
multiple communication patterns. This study is
algorithm and failures,” in Proc. Int. Conf. High
available in a previously published extended
Perform. Comput., Netw., Storage Anal., 2014,
abstract.12 The results are comparable or better
pp. 597–608.
than the other available algorithms across the
4. T. Connors, T. Groves, T. Quan, and S. Hemmert,
studied range of degradations.
“Simulation framework for studying optical cable
failures in dragonfly topologies,” in Proc. Int. Parallel
CONCLUSION Distrib. Process. Symp., 2019, pp. 859–864.
The simulation results in the “Results” sec- 5. E. Zahavi, I. Keslassy, and A. Kolodny, “Quasi fat trees
tion show that Dmodc provides high-quality cen- for HPC clouds and their fault-resilient closed-form
tralized fault-resilient routing for PGFTs as a routing,” in Proc. 22nd Annu. Symp. High-Perform.
fraction of the runtime of existing algorithms, Interconnects, 2014, pp. 41–48.

IEEE Micro
48
 ras, “Transitively deadlock-
6. J.-N. Quintin and P. Vigne Antoine Capra is currently a Lead Technical Devel-
oper in two BXI projects with Atos, Paris, France. His
free routing algorithms,” in Proc. 2nd Int. Workshop
thesis work focused on virtualization in HPC context.
High-Perform. Interconnection Netw. Exascale Big-
His research interests include routing algorithms for
Data Era, 2016, pp. 16–24.
HPC interconnects and optimizations to the MPI
7. OpenSM, “Current OpenSM routing §UPDN routing software layer. He has contributed to several routing
algorithm,” 2007. [Online]. Available: https://github. algorithms used in BXI-based clusters. He received
com/linux-rdma/opensm the Ph.D. degree in computer science. Contact him at
8. E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang, antoine.capra@atos.net.
“Optimized InfiniBandTM fat-tree routing for shift all-to-
all communication patterns,” Concurrency and Alexandre Louvet is currently a Lead Technical
Computation: Practice and Experience. Hoboken, NJ, Developer in multiple BXI projects with Atos, Paris,
France. His research interests include storage, algo-
USA: Wiley, 2010.
rithmic, and low-level optimizations for HPC. He has re-
ras and J.-N. Quintin, “The BXI routing
9. P. Vigne
architectured the BXI fabric management software,
architecture for exascale supercomputer,” J.
developed several routing algorithms used in BXI-
Supercomput., vol. 72, no. 12, pp. 4418–4437, 2016.
based clusters and improved numerous aspects of
10. “Bull exascale interconnect,” 2019. [Online]. Available: BXI’s stack. Contact him at alexandre.louvet@atos.net.
https://atos.net/en/products/high-performance-
computing-hpc/bxi-bull-exascale-interconnect Pedro Javier Garcıa is currently an Associate Pro-
11. J. C. Villanueva, T. Skeie, and S.-A. Reinemo, “Routing fessor of Computer Architecture and Technology with
and fault-tolerance capabilities of the fabriscale fm Castilla-La Mancha University (UCLM), Albacete,
compared to OpenSM,” Fabriscale, Tech. Rep. July, Spain. His research focuses on high-performance
interconnection networks. He has authored or co-
2015. [Online]. Available: http://fabriscale.com/wp-
authored around 60 refereed papers in journals and
content/uploads/2015/07/whitepaper_isc2015.pdf
conferences. He has guided two doctoral theses. He
12. J. Gliksberg, A. C. Capra, A. Louvet, P. J. Garcıa, and
has coordinated three research projects funded by the
D. Sohier, “High-quality fault-resiliency in fat-tree Spanish and Castilla-La Mancha Governments. He
networks (extended abstract),” in Proc 26th Symp. has coordinated four R&D agreements between
High-Perform. Interconnects, 2019. UCLM and different companies. He received the PhD
degree in computer science. Contact him at
pedrojavier.garcia@uclm.es.
John Gliksberg is currently a Ph.D. student with
Versailles Saint-Quentin-en-Yvelines University, Ver- Devan Sohier is currently a Full Professor with Ver-
sailles, France, in an international cotutelle with Cas- sailles Saint-Quentin-en-Yvelines University (UVSQ),
tilla-La Mancha University (UCLM), Spain, and in an Versailles, France, within the LI-PaRAD lab. He has
industrial thesis contract with Atos, France. His previously been an Associate Professor with the Uni-
versity of Reims and UVSQ. His primary research inter-
research interests include routing algorithms for HPC
est is distributed algorithms (in particular probabilistic
interconnects. He has developed several routing
distributed algorithms in stochastic environments),
algorithms used in BXI-based clusters. He received with applications to HPC. He has advised three Ph.D.
the B.Sc. degree in mathematics and physics and students, and is currently advising two others. He
the M.Sc. degree in computer science for HPC. received the Ph.D. degree in computer science.
Contact him at john.gliksberg@uvsq.fr. Contact him at devan.sohier@uvsq.fr.

January/February 2020
49
Theme Article: Hot Interconnects 26

A High-Throughput Network
Processor Architecture for
Latency-Critical Applications
Sourav Roy, Arvind Kaushik,
Rajkumar Agrawal, Joseph Gergen,
Wim Rouwet, and John Arends
NXP Semiconductors

Abstract—This article presents the recent advancements on the Advanced IO Processor


(AIOP), a network processor architecture designed by NXP Semiconductors. The AIOP is a
multicore accelerated computing architecture where each core is equipped with
dedicated hardware for rapid task switching on every hardware accelerator call. A
hardware preemption controller snoops on the accelerator completions and sends task
preemption requests to the cores, thus reducing the latency of real-time tasks. A
technique of priority thresholding is used to avoid latency uncertainty on lower priority
tasks and head-of-line blocking. In this way, the AIOP handles the conflicting requirements
of high throughput and low latency for next-generation wireless applications such as WiFi
and 5G. In presence of frequent preemptions, the throughput reduces by only 3% on AIOP,
compared to 25% on a similar network processor. Moreover, the absolute throughput and
latency numbers are 2X better. The area and power overhead of adding hardware task-
scheduling and preemption is only about 3%.

& AN EFFICIENT WAY to achieve higher perfor- points is to integrate wireless (WiFi 802.11
mance at lower costs on enterprise access n/ac/ad/ax and 5G) and wired networking
technologies on a single system-on-a-chip.
Moreover programmable architectures, such
Digital Object Identifier 10.1109/MM.2019.2958896 as software-defined radio, are attractive as
Date of publication 10 December 2019; date of current they protect the investment against ever
version 14 January 2020. changing specifications of the new technology

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
50
standards. Currently, two very different archi-
tectures are used for wired and wireless func-
tions due to the inherent difference in their
protocols. In WiFi, the short interframe spac-
ing time is only 16 ms for 802.11 n/ac/ax. 5G
standard not only supports very high through-
put, but also has a concept of a self-
contained slot, where the Downlink (DL) data
(PDSCH or Physical Data Shared Channel) and
the corresponding Uplink Acknowledge (UL
ACK) all need to complete within one timing
slot of 125 ms. This, in turn, drives a medium
access control layer to physical layer (MAC-
PHY) turnaround time of tens of microsec-
onds. As a result, typically this real-time proc-
essing is largely handled in dedicated
hardware. On the other hand, 802.3 network
packet processing is often handled using pro-
grammable processors, since there are no
such hard real-time deadlines and have higher Figure 1. Latency versus throughput.
throughput requirements. Due to the conflict-
ing requirements, combining the two domains packets. In this article, we show how the recent
in a common architecture is a challenge. The advancements made on the Advanced-IO-Pro-
main problem is that as the packet through- cessor (AIOP), an NPU architecture designed by
put is increased, the average number of pend- NXP Semiconductors, mitigate the conflicting
ing tasks in the network processor cluster requirements of high throughput and low
increases sharply. The increased congestion latency in a unified programmable environment.
in turn increases the latency of the individual
tasks by several factors. This is shown in
Figure 1(a) using measurements on a multi- AIOP ARCHITECTURE
core network processor (NPU) cluster with a The AIOP, as shown in Figure 2, consists of sev-
packet forwarding workload. Hence, latency eral optimized PowerPC processors1 as well as
and throughput are often conflicting require- specialized hardware accelerators targeting com-
ments for the system architect. To achieve real- mon packet processing operations. Accelerators
time response, preemption is the commonly include high-bandwidth direct memory access
employed technique, where a high priority task (DMA) engines as well as special purpose data-
starts execution by interrupting a lower priority transmission engines (DTEs) for WiFi and wireless
task. However, the maximum throughput traffic, table lookup units (TLUs) and security
reduces substantially with frequent preemp- (SEC) engines for encryption and decryption. The
tions, as is required in wireless protocols. To DTE interacts with signal processing and forward
illustrate this phenomenon, we use the same error correction (FEC) engines and can perform
packet forwarding workload but attach a real- special functions like parsing and (de)aggregation
time requirement to one-half of the tasks. Real- of packets. The software model of the AIOP is like
time tasks preempt the non-real-time tasks any other processor, with the difference that
when they are ready to execute. An optimized when an accelerator call occurs, the processor
light-weight context switching routine saves switches to a new task. When the accelerator job
and restores the critical set of registers. The completes, the original task is ready to execute
subsequent reduction in maximum throughput again. This hides the long latency of the accelera-
achieved (MAX) is shown in Figure 1(b), beyond tor execution to facilitate high throughput of task
which the NPU cluster starts dropping incoming completions, as the processor is busy executing

January/February 2020
51
Hot Interconnects 26

preempts NRT-2. Sometime later it is itself pre-


empted by the highest priority hard real-time
task HRT-4. Task HRT-4, after it executes the time-
critical portion, reduces its priority to become a
soft real-time task. Finally, when it terminates (or
does an accelerator call), the previously pre-
empted SRT-3 resumes execution. When SRT-3
does the next accelerator call, the scheduler
chooses the previously preempted task NRT-2
from the preemption return stack (PRS). In this
way, the tasks execute on the core, switching
context on an accelerator call, task termination,
or preemption.

CORE TASK SCHEDULER


Each multitasking core can support up to 16
Figure 2. AIOP block diagram. active tasks in the core task scheduler (CTS)
hardware. There is a global task scheduler at the
AIOP level, which allocates a new task to an
other tasks when the original task is waiting for empty task slot on a core. The next task schedul-
the accelerator to complete its job. However, this ing algorithm in the CTS considers the priority,
model is not suitable when real-time responses age, and state of the task. The scheduling is per-
are required, which in turn requires fast preemp- formed as per the following priority.
tion support on the platform. To support preemp-
tion, the AIOP employs a hardware preemption 1) If the priority of the task signaled by the pre-
controller that interrupts low-priority tasks with emption controller is higher than the priority
high-priority real-time tasks. The preemption of the currently executing task, the CTS per-
controller snoops on the completion signal from forms a context switch to the former task. The
all the accelerators and records it. It then selects executing task is pushed to a PRS, so that exe-
the task with the highest priority to preempt a cution can return to the preempted task.
specific core. Preemption follows the program- 2) If the PRS holds a task with priority higher
ming order, such that a task that is preempted by than the currently executing task, the CTS
a higher priority task resumes execution when will switch to the task on top of the PRS.
the latter relinquishes execution. Note that the 3) If none of the above is true, the CTS employs
AIOP can easily handle frequent preemptions an optimized algorithm based on the age and
arriving every few nanoseconds. An example task priority to select a task out of the ready-to-
flow is shown in Figure 3. Once the non-real-time execute list. Once the currently executing
task NRT1 does an accelerator call, the context task performs an accelerator call, the context
switches to the next task NRT2. When the soft switches to the CTS selected task.
real-time task SRT-3 is ready to execute, it
The CTS is designed to facilitate very fast con-
text switching in hardware. The block diagram of
the CTS is shown in Figure 4. An internal SRAM
stores the contexts of all the tasks. On receiving a
task switch signal, multiple registers are read every
core cycle using a wide bus to the background reg-
ister set. Then, the contents of the working register
set and the background register set are swapped.
The whole operation takes very little time, typically
Figure 3. Example software task flow. five cycles. This CTS architecture does not worsen

IEEE Micro
52
the critical timing path in the processor, since the
working register set is not impacted by additional
logic. Moreover, the background register allows
the processor to continue executing the old task
until it is ready, thus minimizing the performance
overhead of the context switch. It is to be noted
that it is also possible to dynamically change the
priority of a task in software, by writing to a task pri-
ority register. An accelerator can also modify the
task priority in the completion result bus. The pre-
emption controller records the updated task prior-
ity once it snoops the accelerator completion bus.
In this way, a task can also relinquish execution to Figure 4. Core task scheduler.
other higher priority tasks, once its hard real-time
operations are completed. accelerator execution latency of an application.
To obtain the latency characteristics, the above
THROUGHPUT AND LATENCY workload is modified to consist of N (e.g., 4)
CHARACTERISTICS equal groups of tasks, each with a different prior-
In this section, we discuss the throughput ity level. Figure 5(b) shows the latency histo-
and latency characteristics of the AIOP. The gram for the different priorities, with P1 being
numbers shown in the figures are normalized. the highest priority task and P4 being the lowest
The throughput characteristics, as shown in priority task. The mean and variance of the exe-
Figure 5(a), are measured on silicon using a syn- cution latencies of the various groups of tasks
thetic workload where we modify the number of increase with decreasing priority, as is required.
accelerator calls and the execution latency of an
accelerator. The timer accelerator is used for Task Queueing and Priority Thresholding
this purpose in the workload. Its execution Though the maximum latency of high-priority
latency can be precisely modified by program- tasks is well within acceptable limits, that of the
ming a different expiry duration. In the core lim- lowest priority task is high. This is because the
ited region, the accelerator execution latency is tasks on a core on the AIOP can be envisioned as a
hidden by the multitasking cores. As we increase preemptive priority queue, with high-priority
the average execution latency or the density of tasks preempting lower priority tasks. Hence, the
the accelerator calls (N = number of accelerator average queue length of lower priority tasks is
calls per 100 instructions), the throughput starts larger than the higher priority tasks. It can also
to decrease after a threshold. Based on these lead to a head-of-line blocking, affecting the next
throughput characteristics, the system architect higher priority task trying to enter the AIOP. To
can determine the number of concurrent prevent such a scenario, we introduce the concept
tasks or cores required to hide an estimated of priority thresholding. Once the number of

Figure 5. Throughput and latency characteristics.

January/February 2020
53
Hot Interconnects 26

Figure 6. Priority thresholding

pending tasks in a priority group crosses a config- RELATED WORK


urable threshold, the scheduling hardware gives There has been considerable work on optimiz-
highest priority to this group at the next task ing context switches on CPU, by delaying the pre-
switching opportunity. This scheduling strategy is emption at fast context switch points in the
nonpreemptive and continues untill the new task program where the number of live registers is
can enter the AIOP. Figure 6(a) shows the concept low.2,3 However, this can still delay the context
of priority thresholding with thresholds for the switch in the order of microseconds. Moreover
four groups set as 8, 4, 2, and 2. Figure 6(b) shows these techniques are invisible to the application
the latency histogram for the lowest priority tasks programmer and are workload specific. Hence,
(P4) with the threshold set to different values. In they cannot be used in a general purpose net-
this case, as the threshold of P4 is reduced, the work processor which executes different types of
maximum latency of P4 tasks is reduced with little code. There are other techniques like flushing,4
impact on P1 tasks. which drops the execution of the current task by
detecting and exploiting idempotent execution,
like in a GPU. However, the condition for idempo-
tent execution is typically not true for networking
workloads, which have atomic operations as well
as shared data structures across multiple tasks.
Finally, NPUs with dedicated packet processing
paths5,6 have used hardware packet schedulers
for higher throughput and traffic management.
Papaefstathiou et al.6 mentioned usage of two
sets of registers in the RISC engine to allow con-
current register access by core and external
logic. However, these are not similar to the more
generic and high performance task schedulers
used in AIOP that consider both throughput and
latency requirements.

RESULTS AND CONCLUSION


Finally, we show a comparison of a 16-core
AIOP architecture with respect to a state-of-the-
art NPU with 16 cores. For a fair comparison, we
use a cycle-accurate simulation framework, where
the system configurations of the AIOP and the
NPU can be easily matched with each other. The
Figure 7. NPU versus AIOP comparison. individual cores, except data memory subsystem

IEEE Micro
54
denote the case where the tasks have no priority
(hence no preemption); and the case where they
are divided into real-time (high priority) and non-
real-time (low priority) groups. The absolute
throughput and latency numbers achieved on the
AIOP are 2X better than the NPU. The relative
reduction in maximum throughput in presence of
preemption between the two architectures is the
primary distinguishing factor. On the AIOP, the
throughput has reduced by only 3%, in compari-
son to 25% on the state-of-the-art network proces-
sor. The area and power breakdown of the AIOP
is shown in Figure 8. Power is measured using
power simulations on the timing annotated gate-
level netlist with a section of the packet forward-
ing workload with preemptions. Also, this does
not include the PHY subsystem. Still the area and
power overhead of the schedulers and preemp-
tion logic is just about 3%. As per our knowledge,
presently, AIOP is the only architecture that
Figure 8. Power and area breakdown. achieves low latency on network processors with
minimal impact on throughput, by employing
novel hardware task scheduling and preemption
and CTS, are the same. Acceleration engines are schemes. The first version of the architecture has
similar, though the communication between the been fabricated in 28-nm technology (LA1575,7
core and the accelerator is different on the AIOP. Figure 9). The next revision of the architecture in
Similarly, the Real Time Operating System (RTOS) 16FF technology node is under development.
and software stack except the scheduling, task
switching, and accelerator drivers are also the
same. The maximum sustainable throughput and ACKNOWLEDGMENTS
the corresponding latency numbers as measured The authors would like to thank several of their
on the NPU and AIOP on a packet forwarding past and present colleagues at NXP who contrib-
application are shown in Figure 7. The numbers uted to the development of the AIOP. Special thanks
to B. Moyer, Z. Kingsbury, Q. Pho, M. Schinzler,
S. Mishra, P. Singh, N. Jain, and H. Owens.

& REFERENCES
1. S. Roy, X. Lu, E. Gieske, P. Yang, and J. Holt,
“Asymmetric scaling on network packet processors in
the dark silicon era,” in Proc. ACM/IEEE Symp. Archit.
Netw. Commun. Syst., 2013, pp. 157–167.
2. J. S. Snyder, D. B. Whalley, and T. P. Baker, “Fast
context switches: Compiler and architectural support
for preemptive scheduling,” Microprocessors
Microsyst., vol. 19, pp. 35–42, 1995.
3. X. Zhou and P. Petrov, “Rapid and low-cost context-
switch through embedded processor customization
for real-time and control applications,” in Proc. 43rd
Figure 9. LA1575 chip floorplan. ACM/IEEE Design Autom. Conf., 2006, pp. 352–357.

January/February 2020
55
Hot Interconnects 26

4. J. J. K. Park, Y. Park, and S. Mahlke, “Chimera: Arvind Kaushik is currently a Senior Principal
Engineer with NXP Semiconductors, Noida, India,
Collaborative preemption for multitasking on a shared
with expertise in WLAN and 4 G/5G MAC/Baseband
GPU,” in Proc. 20th Int. Conf. Archit. Support Program.
PHY accelerators design and verification. Contact
Lang. Oper. Syst., 2015, pp. 593–606.
him at arvind.kaushik@nxp.com.
5. J. R. Allen et al., “IBM PowerNP network processor
hardware, software and applications,” IBM J. Res. Rajkumar Agrawal is currently a Senior Engineer
Develop., vol. 47, pp. 177–193, Mar.–May 2003. with the IP design Group, NXP Semiconductors, Noida,
6. I. Papaefstathiou et al., “PRO3: A hybrid NPU India. Contact him at raj.agrawal@nxp.com.
architecture,” IEEE Micro, vol. 24, no. 5, pp. 20–33,
Joseph Gergen is currently a Senior Architect with
Sep./Oct. 2004.
the Digital Networking Group, NXP Semicond-
7. “QorIQ layerscape LA1575 programmable wireless
uctors, Austin, TX, USA. He is the Lead Architect of
platform,” 2017. Online. Available: https://www.nxp.
AIOP. Contact him at joseph.gergen@nxp.com.
com/products/processors-and-microcontrollers/arm-
based-processors-and-mcus/qoriq-layerscape-arm- Wim Rouwet is currently a Systems Architect with
processors/qoriq-layerscape-la1575-programmable- the Digital Networking Group, NXP Semiconductors,
wireless-platform:LA1575. Austin, TX, USA, with expertise in wireless systems.
He is the Architect of LA1575. Contact him at
wim.rouwet@nxp.com.

Sourav Roy is currently a Technical Director with the John Arends is currently a Fellow with NXP Semi-
Digital Networking Group, NXP Semiconductors, Noida, conductors, Austin, TX, USA. He manages the sys-
India. He is the Technical Lead of the AIOP design. tems architecture team in the Digital Networking
Contact him at sourav.roy@nxp.com. Group. Contact him at john.arends@nxp.com.

IEEE Micro
56
General Interest

Warp: A Hardware
Platform for Efficient
Multimodal Sensing With
Adaptive Approximation
Phillip Stanley-Marbell Martin Rinard
University of Cambridge Massachusetts Institute of Technology

Abstract—In this article, we present Warp, the first open hardware platform designed
explicitly to support research in approximate computing. Warp incorporates 21 sensors,
computation, and circuit-level facilities designed explicitly to enable approximate
computing research, in a 3.6 cm  3.3 cm  0.5 cm device. Warp supports a wide range
of precision and accuracy versus power and performance tradeoffs.

& SENSOR INTEGRATED CIRCUITS (ICs) are critical of the power dissipation in many energy-con-
components of many hardware platforms, from strained platforms. Such energy-constrained plat-
augmented reality and wearable health monitors forms are a promising next frontier for application
to drones. Sensors convert physical signals, of techniques from approximate computing.15
such as temperature, vibration, rotation, etc., into The power dissipated by sensors depends on
electrical signals which are then digitized and their electrical configuration (e.g., supply voltage)
used in computations. Because sensor circuits are as well as on their software configuration (e.g.,
often constrained by the physics of the phenom- number of bits per sample for sensors with digital
ena they are designed to measure, sensors often interfaces). These configuration parameters also
do not benefit from the scaling of semiconductor affect the precision and accuracy of samples pro-
technology that has enabled dramatic reduction duced by sensors. System designers can capitalize
in power dissipation of digital logic. As a result, on this observation to trade energy efficiency and
sensors today constitute an important component performance for precision and accuracy. These
tradeoffs have been investigated, primarily for
computation as opposed to sensors, by several
Digital Object Identifier 10.1109/MM.2019.2951004
research efforts in the last decade.1,3,5,7–10,12,16–19
Date of current version 14 January 2020.

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
57
General Interest

Warp’s design provides facilities for


trading sensor precision for energy
usage, trading sensor accuracy for
energy usage and performance, and
trading sensor access reliability for
energy and performance
Although Warp contains a photo-
voltaic subsystem for charging and a
supercapacitor array for charge stor-
age, Warp is neither targeted at
energy-scavenged systems nor at
intermittent computing systems:
When fully charged, Warp’s superca-
pacitors can power the processor for
over an hour. The hardware facilities
for approximation, which we imple-
ment in Warp, are therefore comple-
mentary to research on intermittent
computing.4

WARP: AN APPROXIMATE
Figure 1. Warp hardware platform contains ICs that together provide 21 sensors COMPUTING PLATFORM
across eight sensing modalities. Table 1 details the sensors further. Because We designed Warp to provide a
some sensors have multiple subdimensions (e.g., three-axis accelerometers), greater range of energy versus cor-
Warp provides a total of 35 different sensor signal channels. Warp combines rectness tradeoffs than is available
this diversity with circuit support to enable approximate computing tradeoffs using commercial off-the-shelf
between precision, accuracy, performance, and energy-efficiency. hardware. We named the platform
“Warp” because it provides flexibil-
Despite this interest in efficiency versus precision ity for warping sensor values for
and accuracy tradeoffs, no common open hard- the benefit of efficiency. Warp achieves flexibility
ware platforms for research evaluation of approxi- by integrating sensors that have a broad range of
mate computing systems exist today. hardware-implemented precisions and accura-
This article introduces Warp, an open hard- cies. Table 1 lists the sensors, their operating
ware platform for evaluating hardware and soft- voltage ranges, and their output precision, accu-
ware techniques that trade precision, accuracy, racy, and noise characteristics.
and reliability for improved efficiency in energy- The sensors in Warp cover eight sensing
constrained systems. We have made the hardware modalities: 1) temperature; 2) acceleration in
designs and our basic firmware available on three axes; 3) angular velocity in three axes;
GitHub.11 Other researchers can use the hardware 4) magnetic flux density in three axes (often used
designs to easily recreate the Warp hardware using as a digital compass); 5) humidity; 6) pressure (for
the manufacturing instructions we provide. measuring, e.g., atmospheric pressure or eleva-
Because we provide the complete hardware and tion); 7) infrared radiation; and 8) color (a red–
firmware design source, researchers can also green–blue-clear sensor with filters for 615-, 525-,
extend Warp as they see fit. Warp fills an unmet and 465-nm light). For each of the first six modali-
need for research evaluation hardware and the ties, Warp contains at least two different state-of-
measurements from platforms such as Warp could the-art sensor ICs from different manufacturers,
serve as valuable error models for research on each of which represents a different point in the
algorithmic, programming language, and system tradeoff space between precision, accuracy,
software techniques for approximate computing. power dissipation, and performance. For example,
Figure 1 shows the system components of Warp. for atmospheric pressure, Warp contains an

IEEE Micro
58
Table 1. Operating voltage ranges, precision, accuracy, and noise properties of the sensors in Warp. Many sensor ICs
include temperature sensors, hence the abundance. The BMX055 officially operates from 2.4 to 3.6 V; Warp allows
software to operate it down to 1.8 V.

Supply voltage Accuracy range Interface precision


Sensor
range (V) (noise measure) range (bits/sample)
pffiffiffiffiffiffi
MMA8451Q accelerometer 1.95–3.6 99–126 mg= Hz 8 or 14
pffiffiffiffiffiffi
BMX055 accelerometer 2.4–3.6 150 mg= Hz 8 or 12
pffiffiffiffiffiffi
ADXL362 accelerometer 1.6–3.5 175–550 mg= Hz 4, 8, or 12
pffiffiffiffiffiffi
L3GD20H gyroscope 2.2–3.6 0.011  =s= Hz 8 or 16
pffiffiffiffiffiffi
BMX055 gyroscope 2.4–3.6 0.014  =s= Hz 8 or 16

MAG3110 magnetometer 1.95–3.6 0.25–0.4 mT 8 or 16

BMX055 magnetometer 2.4–3.6 0.3–1.4 mT 8 or 13 (x-, y-), 15 (z-)

SI7021 hygrometer 1.9–3.6 2% accuracy 8, 10, 11, or 12

0.025–0.2% precision

HDC1000 hygrometer 3.0–5.0 4% accuracy 14

0.1% precision

LPS25H barometer 1.7–3.6 0.01–0.03 hPa 8, 16, or 24

BMP180 barometer 1.6–3.6 0.03–0.06 hPa 8, 16, or 19

HDC1000 thermometer 3.0–5.0 0.2  C 14

SI7021 thermometer 1.9–3.6 0.3  C 11, 12, 13, or 14



ADXL362 thermometer 1.6–3.5 0.5 C 4 or 12

TMP006B thermometer 2.2 1  C 8 or 14



BMP180 thermometer 1.6–3.6 1 C 8 or 16

MAG3110 thermometer 1.95–3.6  1 8

L3GD20H thermometer 2.2–3.6  1 8



LPS25H thermometer 1.7–3.6 2 C 8 or 16

BMX055 thermometer 2.4–3.6 2  C 8

TCS3772 photometer 2.7–3.3 14%–35% irradiance responsivity 8 or 16 per R/G/B/clear

LPS25H IC, which can provide up to 24-b precision shows a simplified schematic of the system,
per sample, and a BMP180 IC, which is limited to highlighting hardware support for flexible sen-
19-b precision per sample. These two ICs also sor precision, flexible sensor accuracy, and flexi-
have different power dissipation and noise prop- ble sensor reliability, all designed to support
erties, providing software with a tradeoff between research in approximate computing.
power dissipation, accuracy, and precision.
Warp uses this diverse set of sensors to allow Comparing Warp to Related Platforms From the
approximate computing researchers to explore Domains of Sensor Networks and Intermittent
precision and accuracy versus energy efficiency Computing
tradeoffs. Warp complements this intersensor Today, despite a growing body of research on
flexibility with new hardware facilities for sensor techniques to trade errors for efficiency (approxi-
accuracy and sensor communication reliability mate computing), there is no hardware platform
versus energy efficiency tradeoffs. Figure 2 that allows researchers to explore the many

January/February 2020
59
General Interest

efficiency versus precision


and accuracy tradeoff (see
Table 1), Warp allows its
users to evaluate techniques
that trade precision and accu-
racy for efficiency. For exam-
ple, for acceleration, Warp
provides hardware support
for sampling at 4-, 8-, 12-, or
14-b precision, and to do so
with a range of measurement
noise, by selecting amongst
Figure 2. Processor controls the sensor operation voltage using one dynamically three different accelerometer
programmable voltage regulator paired with a second manufacture-time-selectable implementations that have
voltage regulator to trade sensor accuracy for power dissipation. Software controls different energy efficiencies.
sensor precision by configuration commands for each sensor as well as by choosing
between sensors for a given physical signal. The processor controls I/O reliability Sensor Accuracy Tradeoff
versus power dissipation tradeoffs using the programmable I/O pull-up switch. All of Facilities in Warp
this hardware is integrated into Warp. In addition to achieving
accuracy versus energy effi-
techniques proposed to trade correctness for ciency tradeoffs by allowing software to choose
performance and power. Warp is the first hard- between sensors (see Table 1, third column),
ware platform we know of that is explicitly Warp implements the Lax16 sensor hardware
designed to support approximate computing approximation technique using two miniature
research. Warp however exists in the context of voltage regulators, each occupying less than 7
existing research on low-power hardware plat- mm2 in circuit board area.
forms, including prior work, such as Sunflower,14 One of the two voltage regulators is software-
Flicker,6 WISP,13 and contemporary work, such as controllable to set the supply voltage of the sys-
Capybara.4 These prior and contemporary plat- tem’s sensors to one of eight voltage levels:
forms largely address the needs of researchers in either 1.8 to 2.5 V, or 2.6 to 3.3 V, in steps of
wireless sensor networks, energy scavenging, 0.1 V. The choice between these two voltage
and intermittent computing. Warp addresses the ranges, which are implemented by two different
needs of researchers in approximate computing. regulators with identical printed circuit board
Warp might nevertheless be a useful platform in footprint, is fixed at the point at which the board
these related research areas: With 21 integrated is assembled. The second voltage regulator,
sensors in its 3.6 cm  3.3 cm area, Warp is a third which is also fixed at manufacture time, can
the area of Capybara while containing more than have an output voltage of one of 1.05, 1.1, 1.2,
twice the number of sensors. Warp is smaller 1.225, 1.26, 1.5, 1.6, 1.8, 1.86, 1.95, or 2.1 V. The
than all the aforementioned platforms except outputs of these two regulators are fed into a
Sunflower (but Sunflower contains only four sen- software-controlled analog switch, allowing soft-
sors). By making our complete design files and ware to dynamically select between the two volt-
firmware publicly available,11 our intention for age regulators (programmable output and fixed
Warp is to provide a foundation on which output) at runtime. Figure 2 shows a simplified
researchers in approximate computing can build schematic of the software-controlled sensor
more sophisticated systems. power supply, which is part of Warp’s hardware
support for approximate computing.
Sensor Precision Tradeoff Facilities in Warp Warp’s sensor supply voltage changes have a
By including multiple hardware implementa- typical hardware latency of 315 ms due to the
tions of sensors for the same sensing modality, output voltage switching latency of the voltage
each of which achieves a different energy regulators and the switching time of the analog

IEEE Micro
60
switch. This low latency makes it feasible to dissipation can reduce supply voltage droop. As
implement sensor energy efficiency versus accu- a result, being able to trade performance for
racy tradeoffs by voltage control at fine temporal power can make the difference between a system
granularities. that works and one which does not, even when it
leads to larger overall energy usage. We then
Sensor I/O Reliability Tradeoff Facilities in Warp demonstrate the tradeoffs between power dissi-
Warp implements a hardware facility to allow pation and sensor accuracy that Warp’s pro-
software control of the pull-up resistors that are grammable sensor supply voltage enables.
mandatory for the I2C serial com-
munication standard used by most Warp’s sensor supply Performance Versus Power
sensor ICs. Disabling board-level I/ voltage changes have Tradeoff Results
O pull-ups leaves the I2C signals a typical hardware We use a Keysight B2962A
with only the microcontroller’s on- latency of 315 ms due source-measure unit for power
to the output voltage measurements. The B2962A pro-
chip pull-ups. This removes the
switching latency of the vides current sourcing precision
main source of power dissipation
voltage regulators and of 10 fA, voltage sourcing preci-
for open-drain interfaces, such as the switching time of
I2C, but reduces the reliability of sion of 100 nV, current measure-
the analog switch. This
communication. For example, for ment precision of 10 nA, and
low latency makes it
an I2C interface operating at an I/O voltage measurement precision
feasible to implement
supply voltage of 2.5 V, the average of 200 mV. These current and
sensor energy effi-
power dissipated in the typical 4.7 ciency versus accu- voltage measurement specifica-
kV pull-up resistor is 1.3 mW, more racy tradeoffs by tions enable us to measure power
than the power dissipation of most voltage control at fine dissipation to a resolution of bet-
sensors in Warp. temporal granularities. ter than 1 mW.
Figure 3(a) shows a represen-
Implementation Miniaturization tative example of how the power
We optimized the implementation of Warp for dissipation for accessing one of the sensors in
size, to achieve a form factor of 3.6 cm  3.3 cm  Warp (the BMX055 gyroscope) varies with I/O
0.5 cm that is small enough for use in user studies speed. For the BMX055 gyroscope in Warp,
(e.g., as a wearable platform). To achieve this power dissipation increases by over 0.2 mW as
level of integration, we implemented Warp using the speed at which the sensor is accessed is
a ten-layer printed circuit board process with a increased from 1 to 64 kb/s. Even though power
board thickness of 62 mils (1.6 mm). Fully popu- dissipation increases with I/O speed, Figure 3(b)
lated with components, the Warp prototype is shows that the energy per bit for I/O decreases
only 5 mm thin. Researchers using our open with I/O speed.
hardware design as a starting point can choose Figure 3(c) and (d) shows similar trends in I/O
to populate the system with a subset of the sen- power and energy per bit for seven of Warp’s sen-
sor ICs listed in Table 1 (reducing costs signifi- sors and show how power dissipation varies by
cantly) and with a choice of different voltage 0.2 – 0.3 mW as a function of I/O speed. The mag-
regulators (both fixed and software-controlled). nitude of this change in I/O power dissipation
is greater than the power dissipation of many of
the sensors in the platform, motivating the need
EVALUATION for precise and approximate techniques for
We highlight Warp’s facilities to trade sensor improving I/O power efficiency.16,17,20
access speed for average power dissipation for
seven of the sensors in Warp below. Such trade- Sensor Accuracy Versus Voltage Tradeoff
offs are valuable for systems that are power- Results
limited. Because energy stores such as coin We evaluate the tradeoff between accuracy of
cell batteries as well as supercapacitors have sensor data and supply voltage by operating the
nonnegligible internal resistance, lower power three different accelerometers, the two different

January/February 2020
61
General Interest

variance to provide a visual


indicator of the distance of
the measured variation
from a Gaussian distribu-
tion. We also test for nor-
mality numerically: The
null hypothesis that the
data is distributed accord-
ing to the Gaussian with the
same mean and variance as
the sample is not rejected
at the 5% level based on the
Cramer–von Mises test. The
set of samples has a kurto-
sis of 2.4, compared to a
kurtosis of 3 for a Gaussian.
Noise distributions and
Figure 3. Warp enables tradeoffs between I/O power dissipation, energy per bit, error models often play a
and I/O data transfer speeds. role in techniques for
approximate computing. In
the absence of quantitative
gyroscopes in Warp, and the two different magne- measurements, such as those in Figure 4(b),
tometers, over a range of supply voltages. For researchers today have no choice but to make
each of the three axes of these seven sensors (21 assumptions about noise distributions. Typical
signal dimensions in total), we operate the sensors assumptions include uniform distributions in
at one of eight supply voltages uniformly spaced space and normal (Gaussian) distributions over
between 1.8 and 2.5 V, a total of 168 measurement repeated measurements.
configurations. We use the Warp platform’s on- Figure 4(c) shows the distributions of the y-
board programmable voltage regulator subsystem axis acceleration sensor values obtained from
to control these sensor supply voltages. the ADXL362 accelerometer in a fixed orienta-
We use the highest voltage as our refer- tion, as a function of sensor supply voltage. The
ence for sensor output correctness. In each of distributions in Figure 4(c) show significantly
the 168 measurement configurations, we com- greater separation than those in Figure 4(a) and
pare the average of 100 sensor signal measure- are distinctly non-Gaussian, as the overlay of the
ments at each of the eight supply voltage Gaussian with the same mean and variance in
settings to an average of 100 sensor measure- Figure 4(d) shows. The null hypothesis that the
ments when the sensor is operating under data are Gaussian with the same mean and vari-
identical conditions but at the nominal supply ance as the sample is rejected at the 5% level
voltage of 2.5 V. based on the Cramer–von Mises test. The causes
Figure 4 shows examples of the distributions for the observed distributions may range from
of values from two of the 21 signal dimensions the underlying mechanism for transduction of
we studied. Figure 4(a) shows the distributions the physical signal into a measurement, to noise
of the z-axis magnetic flux density values introduced in the digitization process, such as
returned by the BMX055 magnetometer, in a quantization noise. In practice, a given sensor
fixed orientation, as we change the supply volt- might be optimized to have the lowest noise at a
age of the sensor from 1.8 to 2.5 V. Figure 4(b) particular voltage or temperature. Platforms like
shows one of these eight distributions (sensor Warp with multiple sensors for the same modal-
values measured at 2.5 V) in isolation. We over- ity allow researchers to study tradeoffs that may
lay a histogram of random variates drawn from a exist between performance, power, and accu-
Gaussian distribution with the same mean and racy, across sensor-specific peculiarities.

IEEE Micro
62
Figures 5–7 show that in
these measurements, the accel-
erometers and magnetometers
in Warp provide a useable
tradeoff between supply volt-
age (and hence power dissipa-
tion) and accuracy with respect
to the output at a reference
operating voltage (2.5 V in our
measurements). The benefit
from going from 2.5 V supply
down to 1.8 V supply is an
11.8% reduction in dynamic
power dissipation.
The gyroscope data in the
measurements provide a less
distinct trend in improving
accuracy from higher supply
voltage operation. We attri-
bute this observation to the
higher variance in the output
of the gyros. In our measure- Figure 4. Distributions of sensor noise differ across sensor modalities and across IC
ments, both the BMX055 and implementations and vary with supply voltage. Directions for further research include
the L3GD20H gyroscopes have evaluating noise under temperature-controlled conditions, improved isolation of
high coefficients of variation effects of the measurement environment, and evaluation of noise under different
of over 115%, indicating that values of the measurand. (a) Distributions of z-axis magnetic flux density for BMX055
the value of the standard in Warp operating at supply voltages from 1.8V to 2.5V, (b) 100 measurements of
deviation across the 100 sam- z-axis magnetic flux density for BMX055 in Warp at 2.2V. Passes normality test
ples in each measurement set (Gaussian overlaid), (c) Distributions of y-axis acceleration for ADXL362 in Warp
was even larger than the operating at supply voltages from 1.8V to, (d) 100 measurements of y-axis
value of the mean. acceleration for ADXL362 in Warp at 2.2V. Fails normality test (Gaussian overlaid).

CONCLUSION
Data from embedded sensing systems form sensor-driven systems, energy is severely con-
the foundation for applications ranging from strained and techniques to improve energy effi-
wearable health monitors to infrastructure moni- ciency or to trade energy efficiency for some
toring and augmented reality. In many of these other system metric are valuable. Platforms such

Figure 5. Acceleration inaccuracy (difference in value versus value when supply voltage is at the nominal 2.5
V). The nine data series in the plots are acceleration readings in each of the three axes (x, y, and z) of the three
accelerometers in Warp. The BMX055 officially only operates down to 2.4 V.

January/February 2020
63
General Interest

Figure 6. Magnetic flux density inaccuracy (difference versus value when supply voltage is at the nominal
2.5 V). The six data series in the plots are magnetic flux density readings in each of the three axes (x, y, and z)
of the two magnetometers in Warp. The BMX055 officially only operates down to 2.4 V.

Figure 7. Angular rotation rate inaccuracy (difference in value versus value when supply voltage is at the
nominal 2.5 V). The six data series in the plots are angular rate readings in each of the three axes (x, y, and z)
of the two gyroscopes in Warp. The BMX055 officially only operates down to 2.4 V.

as Warp provide a foundation for research to provide new possibilities for calibrating
into employing techniques from approximate techniques developed across the system stack
computing in low-power with measurements from real hard-
embedded systems. Warp By making the hard- ware systems.
complements existing ware design and firm-
research platforms targeted ware for Warp publicly
at precise execution on available,11 our goal is ACKNOWLEDGMENT
RF- scavenged energy or 2
to provide a foundation This work was supported in part by
intermittent computing. 4 for new experimenta- an Alan Turing Institute Award TU/B/
Warp enables approxi- tion in approximate 000096 under EPSRC Grant EP/N510129/
mate computing research by computing research 1, in part by the Royal Society Grant
integrating 21 sensors that and to provide new RG170136, and in part by the EPSRC
possibilities for cali- Grant EP/P001246/1 and Grant EP/
reside in a large range of pre-
brating techniques R022534/1. The authors would like to
cision, accuracy, and power
developed across the thank the anonymous reviewers for
dissipation tradeoff points, system stack with
and it augments this with encouraging them to make the links
measurements from
custom hardware in the form with intermittent computing clearer.
real hardware systems.
of programmable I/O pull-
ups and dynamically recon-
figurable sensor supply voltages to enable &
REFERENCES
additional efficiency versus accuracy tradeoffs.
By making the hardware design and firm- 1. J. Bornholt, T. Mytkowicz, and K. S. McKinley,
11
ware for Warp publicly available, our goal is “Uncertain<T>: A first-order type for uncertain data,”

to provide a foundation for new experimenta- in Proc. 19th Int. Conf. Archit. Support Program. Lang.

tion in approximate computing research and Oper. Syst., 2014, pp. 51–66.

IEEE Micro
64
2. M. Buettner et al., “RFID sensor networks with the Intel 13. J. R. Smith, A. P. Sample, P. S. Powledge, S. Roy, and
WISP,” in Proc. 6th ACM Conf. Embedded Netw. A. Mamishev, “A wirelessly-powered platform for
Sensor Syst., 2008, pp. 393–394. sensing and computation,” in Proc. Int. Conf.
3. M. Carbin, S. Misailovic, and M. C. Rinard, “Verifying Ubiquitous , 2006, pp. 495–506.
quantitative reliability for programs that execute on 14. P. Stanley-Marbell and D. Marculescu, “An 0.91.2",
unreliable hardware,” in Proc. ACM SIGPLAN Int. low power, energy- harvesting system with custom
Conf. Object Oriented Program. Syst. Lang. Appl., multi-channel communication interface,” in Proc.
2013, pp. 33–52. Design Autom. Test Eur., 2007, pp. 15–20.
4. A. Colin, E. Ruppel, and B. Lucia, “A reconfigurable 15. P. Stanley-Marbell and M. Rinard, “Approximating
energy storage architecture for energy-harvesting outside the processor,” in Proc. Workshop Approx.
devices,” in Proc. 23rd Int. Conf. Archit. Comput. Across Syst. Stack, 2015, pp. 1–3.
Support Program. Lang. Oper. Syst., 2018, 16. P. Stanley-Marbell and M. Rinard, “Lax: Driver
pp. 767–781. interfaces for approximate sensor device access,” in
5. H. Esmaeilzadeh, A. Sampson, L. Ceze, and Proc. 15th Workshop Hot Topics Oper. Syst., Kartause
D. Burger, “Architecture support for disciplined Ittingen, Switzerland, May 2015, pp. 1–6.
approximate programming,” in Proc. 17th Int. Conf. 17. P. Stanley-Marbell and M. Rinard, “Reducing serial I/O
Archit. Support Program. Lang. Oper. Syst., 2012, pp. power in error-tolerant applications by efficient lossy
301–312. encoding,” in Proc. 53rd Annu. Design Autom. Conf.,
6. J. Hester and J. Sorber, “Flicker: Rapid prototyping for 2016, pp. 62:1–62:6.
the batteryless internet-of-things,” in Proc. 15th ACM 18. P. Stanley-Marbell and D. Marculescu, “A
Conf. Embedded Netw. Sensor Syst., 2017, programming model and language implementation for
pp. 19:1–19:13. concurrent failure-prone hardware,” in Proc. 2nd
7. H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, Workshop Program. Models Ubiquitous Parallelism,
A. Agarwal, and M. Rinard, “Dynamic knobs for 2006, pp. 1–5.
responsive power-aware computing,” in Proc. 16th Int. 19. P. Stanley-Marbell, V. Estellers, and M. Rinard,
Conf. Archit. Support Program. Lang. Oper. Syst., “Crayon: Saving power through shape and color
2011, pp. 199–212. approximation on next-generation displays,” in Proc.
8. Y. Kim, S. Behroozi, V. Raghunathan, and 11th Eur. Conf. Comput. Syst., New York, NY, USA,
A. Raghunathan, “Axserbus: A quality-configurable 2016, pp. 11:1—11:17.
approximate serial bus for energy-efficient sensing,” in 20. P. Stanley-Marbell and M. Rinard, “Efficiency limits for
Proc. IEEE/ACM Int. Symp. Low Power Electron. value-deviation-bounded approximate
Design, 2017, pp. 1–6. communication,” IEEE Embedded Syst. Lett., vol. 7,
9. S. Lee, L. K. John, and A. Gerstlauer, “High-level no. 4, pp. 109–112, Dec. 2015.
synthesis of approximate hardware under joint
precision and voltage scaling,” in Proc. Design Phillip Stanley-Marbell is currently an Assistant
Autom., Test Eur. Conf. Exhib., 2017, pp. 187–192. Professor with the Department of Engineering, Univer-
10. A. Lingamneni, K. K. Muntimadugu, C. Enz, R. M. Karp, sity of Cambridge. His research focus is on exploiting
K. V. Palem, and C. Piguet, “Algorithmic methodologies an understanding of properties of the physical world
for ultra-efficient inexact architectures for sustaining and the physiology of human perception to make com-
technology scaling,” in Proc. 9th Conf. Comput. puting systems more efficient. Prior to joining the Uni-
Frontiers, 2012, pp. 3–12. versity of Cambridge, he was a Researcher with
11. P. Stanley-Marbell, Warp hardware designs and
Massachusetts Institute of Technology, from 2014 to
2017. He held positions with Bell-Labs Research
baseline firmware, 2018. [Online]. Available: https://
(1995, 1996), Lucent Technologies and Philips (1999),
github.com/physical-computation/Warp
and NEC Research Labs (2005). He was a Postdoc
12. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam,
with TU Eindhoven until 2008, was a permanent
L. Ceze, and D. Grossman, “EnerJ: Approximate Research Staff Member with IBM Research—Zurich
data types for safe and general low-power from 2008 to 2012, and then an Engineer with Apple
computation,” in Proc. 32nd ACM SIGPLAN until 2014. He received the PhD degree from CMU in
Conf. Program. Lang. Design Implementation, 2011, 2007. He is a Senior Member of the IEEE. Contact him
pp. 164–174. at phillip.stanley-marbell@eng.cam.ac.uk.

January/February 2020
65
General Interest

Martin Rinard is currently a Professor with the enable applications to survive otherwise fatal errors
Department of Electrical Engineering and Com- and security attacks and techniques that tradeoff
puter Science, Massachusetts Institute of Technol- accuracy of end-to-end results in return for
ogy (MIT) and a member of the Computer Science increased performance and resilience. He received
and Artificial Intelligence Laboratory, MIT. His the PhD degree in computer science from
research interests include programming lan- Stanford University. He is an ACM Fellow and has
guages, computer security, program analysis, pro- received many awards including an Alfred P. Sloan
gram verification, software engineering, and Research Fellowship and Distinguished and Best
distributed and parallel computing. Prominent Paper Awards from a variety of publication venues.
results have included automatic techniques that Contact him at rinard@csail.mit.edu.

IEEE Micro
66
General Interest

DNN: Power-Efficient
Neural Network
Acceleration Using
Differential Weights
Hoda Mahdiani Mehdi Modarressi
University of Tehran University of Tehran
Alireza Khadem Farima Fattahi-Bayat
University of Tehran University of Tehran
Azam Ghanbari Masoud Daneshtalab
University of Tehran €lardalen University
Ma

Abstract—The enormous and ever-increasing complexity of state-of-the-art neural


networks has impeded the deployment of deep learning on resource-limited embedded
and mobile devices. To reduce the complexity of neural networks, this article presents
DNN, a power-efficient architecture that leverages a combination of the approximate
value locality of neuron weights and algorithmic structure of neural networks. DNN keeps
each weight as its difference (D) to the nearest smaller weight: each weight reuses the
calculations of the smaller weight, followed by a calculation on the D value to make up the
difference. We also round up/down the D to the closest power of two numbers to further
reduce complexity. The experimental results show that DNN boosts the average
performance by 14%–37% and reduces the average power consumption by 17%–49% over
some state-of-the-art neural network designs.

& MULTIPLICATION OPERATION IS the most com- has long been the target of many optimization/
plex part of neural information processing, so it acceleration methods.1–6
Among them, quantization of floating-point
numbers to the narrowest fixed point format
Digital Object Identifier 10.1109/MM.2019.2948345 that still preserves the required accuracy is one
Date of publication 21 October 2019; date of current version of the most basic optimization methods that
14 January 2020. enable further architectural optimization and

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
67
General Interest

innovations. Thanks to the error-tolerant nature accuracy and complexity. Second, we present the
of deep neural network (DNNs), they promise DNN architecture, which realizes such a differen-
sublinear accuracy degradation as the bit-width tial processing scheme for both multilayer per-
is decreased. ceptron (MLP) and convolutional neural network
In this article, we show how quantization can (CNN) models. The experimental results show
be followed by a delta computation scheme to that DNN outperforms state-of-the-art designs in
considerably reduce the complexity of DNN com- terms of power-saving and performance.
putation. Our design is motivated by the fact
that each element of the DNN input data (and DIFFERENTIAL COMPUTATION FOR
intermediate feature maps) is multiplied by tens NEURAL NETWORKS
to thousands of filter/neuron weights. To exploit
D NN for MLPs
this algorithmic structure for DNN power/perfor-
Although MLPs are typically less complicated
mance efficiency, we present DNN, a novel archi-
than CNNs, some applications may still need large
tecture for DNN acceleration that works with
MLPs that are very challenging to run.9 Based on
differential weights.
the special computation pattern of MLPs, DNN
DNN applies an input-stationary execution
proposes to rearrange the weights in an input-
scheme, in which every input element, once
major order: in an n-layer MLP with Ni neurons in
fetched, is multiplied by all neurons or filter
layer i, the weights of each layer i are arranged as
weights required to make all of its contribution
Ni1 vectors (Ni1 = the number of inputs to layer i
to the output. In this design, the weights are
= the number of neurons in layer i1), in such a
grouped and processed in an input-major order,
way that the xth vector contains the xth weight of
i.e., based on the input data by which they
all (Ni ) neurons of that layer, which are all multi-
should be multiplied. We then sort the weights
plied to the xth input of layer i. The weights of
of each group in ascending order and calculate
each vector are reordered in the ascending order
the difference of each individual weight with the
and stored in a delta-coded format, in which we
preceding one (denoted by D). The multiplica-
keep the difference (D) of each weight with its pre-
tion result of each weight to an input is then cal-
ceding weight, instead of the weight itself.
culated as the multiplication result of the
Each input value is multiplied by its corre-
preceding weight to that input (which is already
sponding weight vector, starting from the small-
available) plus the multiplication result of the D.
est weight at the head of the vector. The
We further round the D to the closest power
multiplication of each subsequent weight wi is
of two numbers to reduce the multiplication on
then calculated as
D to a simple shift operation. Since the D value is
very likely to be very smaller than the weights Pi ¼ wi  x ¼ wi1  x þ Di;i1  x:
themselves, rounding the D value will introduce
considerably lower output error than the low- The first term (wi1  x) is the most recent calcu-
precision DNNs that constrain the absolute lated product (Pi1 ) and can be reused for Pi .
value of the weights to be (or be rounded to) The second term multiplies the D value by x.
power of two numbers.1,7 This way, all previous calculations contribute to
As the DNN structure and weights are avail- the current calculation, effectively simplifying it
able at design time, weight reordering, delta cal- without sacrificing accuracy.
culation, and rounding up/down are done once We further round the D up/down to the clos-
prior to the DNN deployment. Thus, our method est power of two numbers and keep the expo-
has no runtime overhead of a prior work that nent value instead of the D value itself. Thus, the
applies differential coding on input data.8 multiplication operation is reduced to a single
This article makes the following contribu- shift-add operation, with the stored delta values
tions. We first show that reorganizing the DNN specify the number of bit positions by which the
computation pattern into an input-stationary input shifts.
order and using rounded differential weights In this design, a small piece of metadata is
makes the state-of-the-art tradeoff between kept with each weight to identify its original

IEEE Micro
68
position in the DNN so that it can be matched
with the right neuron at runtime.

D NN for CNNs
A typical CNN layer applies a set of filters on a
WH-element input feature map to produce an
output feature map with roughly the same size.
Each layer often has multiple independent Figure 1. Average D value distribution of AlexNet.
channels.
Ignoring the boundary cells, each cell of weights. The bars display the distribution of the
the input feature map should be multiplied to D value of this 11 616-element weight vector after
all filter weights of its channel to contribute sorting, averaged across all three channels and
to different output elements. Assuming K fil- then, across all five convolutional layers.As
ters of size R×S per channel, DNN converts Figure 1 shows, the majority of the values are zero
the filters of each channel to a 1-D vector and and one: Even for 16-b quantization, the percent-
concatenates all to form a weight vector of age of small D values is still high. This promises to
size K×S×R. In the same way as MLPs, the ele- reduce the multiplication operation to the simpler
ments of this vector are sorted and encoded shift-add operation (or even eliminating multipli-
in the D format, i.e., the difference to the pre- cation when D=0) with negligible rounding error.
vious element rounded to the closest power
of two numbers.
To enforce differential computing, DNN again D NN ARCHITECTURE
processes input data in an input-stationary MLPs on DNN
order, in which input feature map cells are An array of shift-add and accumulation units
fetched one by one (or more precisely, row by connected by a multiplexer based crossbar com-
row, as will be discussed shortly) and run prise the main processing components of a DNN
through all S×R×K multiplication operations. core (see Figure 2).The core has three memory
elements to store the input data, neuron
Benefits of DNN weights, and control information. The first table,
DNN reduces the rounding error significantly differential weight table (DWT), keeps the
as D values are more likely to find very close weights in an input-major order, i.e., each row
power of two numbers. The delta value is accommodates the weights that are multiplied
expected to decrease when the DNN size to the same input data. A TK ×TN table can
increases. With n-bit quantization, there are 2n accommodate TN neurons, each with TK weights.
unique numbers, so when the number of weights Each row i of DW is allocated to the ith weight of
increases, the pairwise distance between two all TN neurons: these weights in each row are
consecutive weights decreases, on average. Ulti- sorted in ascending order and DWT(i,j) keeps
mately, if there are more than 2n weights, the the difference between the jth and j1th weight
delta value will be very small, with some weights in row i. As the D value is rounded to the closest
are equal due to the pigeonhole principle. As a power of two numbers, what the table keeps is
quantitative evaluation, Figure 1 demonstrates an exponent value that specifies the number of
the distribution of the average D values of the bit positions by which the input shifts. Weights
convolution layer weights of the AlexNet CNN are stored and sorted by the absolute value to
(trained for 1000-class ImageNet). The weights reduce the average D values.
are quantized to 8, 12, and 16 b. Input buffer (IB) keeps the input data of the
The D value is calculated for all weights that core. It stores the TK input elements received
work on the same input. For example, the first con- from the previous layer: IB(i) is the output of the
volutional layer of AlexNet has 3 input channels, ith neuron of the previous layer and should be
each with 96 filters of size 1111, so each input multiplied to the ith weight of all neurons, which
feature map element is multiplied by 11616 are stored in the ith row of DWT.

January/February 2020
69
General Interest

which is added by the multiplication result of


the delta value to generate the new result.
At the same time, the indexes of the weights
for cycle C are read from the Control Table (CT) to
pass the results to the right accumulation unit.
For example, assume DWT(2,1) belongs to
neuron 5. This means that among the second
weight of all neurons (which are stored in the
second row of DWT and should be multiplied to
the same input), the second weight of neuron 5
has the smallest value. Once the calculations of
the first column of all rows are completed in
cycle 1, the result generated by DWT(2,1) should
be sent to the accumulator in row 5, where the
partial results of neuron 5 are accumulated. The
contents of CT specify the select line of the
crossbar multiplexers: by setting the CT(5,1) = 2,
the input of the accumulator in row 5 at cycle 1
is connected to the second row of DWT. As TK
partial results are generated at each cycle, one
column of CT is read per cycle to send the
results to the right accumulators.
As the DNN weights are available before the
inference phase, CT is filled offline.
It is likely to have two or more weights from
the same neuron at the same column. If in the
previous example the third weight of neuron 5 is
also the smallest among the third weight of all
neurons, the second and third weights of neuron
5 are processed at cycle 1, but the results
cannot pass through a single multiplexer simu-
ltaneously. DNN resolves this conflict by offline
examination of all columns of DWT and inserting
a bubble in place of all but one of the weights in
each column that come from the same neuron.
This way, only one weight is processed and sent
to the accumulation unit, while the computation
of the conflicting weights is delayed to avoid the
Figure 2. (a) Architecture of a DNN core. (b) CNN data structures conflict. An enable bit in CT elements disables
(above and below the dotted line) and their mapping on DNN the corresponding DWT row when a bubble
(between the dotted lines). should be inserted. Due to the bubbles, CT may
need more columns than DWT.
In each cycle, all IB cells are multiplied to These processing steps are carried out in
their corresponding DWT rows in parallel. In TN cycles for all TK rows in parallel. TN and Tk
cycle C, the multiplication of DWT(i,C) to IB(i) is are the maximum number of neurons and the
calculated as maximum number of weights per neuron, res-
pectively, that a core can keep. Although DNN
Pi = REG(i) + (IB(i) < DWT(i,C)).
can come with any Tk andTN; it is obvious that
The REG register [see Figure 2(a)] keeps the the maximum resource utilization is achieved
multiplication result of the preceding weight when Tk =TN .

IEEE Micro
70
Larger DNNs are partitioned into multiple input that is multiplied to the weight, is used to
parts and are serialized on a single core or run in calculate the output coordination (the output
parallel on multiple cores. channel and the cell in the output channel) of the
result. The output coordination can be calculated
CNNs on DNN by simple formula5,10 (the Calc_Output_Row()
For CNNs, input cells are kept stationary to function in the pseudocode). This address calcu-
maximize convolutional reuse. Figure 2(b) shows lation procedure is carried out in parallel to the
how the K filters of size S×R of each input chan- shift-add operation. Afterward, the target output
nel are converted to a 1-D weight vector of size row is read from the output buffer to accumulate
K×S×R, with the weights stored in the D format the new output of the shift-add units. It then will
(exponent value of the rounded difference to the be written back to the buffer.
preceding weight). Each input feature map cell is
multiplied (by a single shift-add operation) to Index Compression
the cells of the weight vector of its channel one In the D format, the original 8-b length of the
at a time in K×S×R consecutive cycles to com- filter weights (in an 8-b quantized network) is
plete its contributions to all output channels. encoded in 4 b (1-b sign + 3-b exponent). How-
To exploit the inherent parallelism of CNNs, ever, every filter weight needs an 8-b index to
all cells of a complete row of the input feature keep filter coordinate and output channel num-
map are fetched and multiplied by the same ber. For large CNN filters, the number of filters
weight in parallel at each cycle. When an input that are merged is reduced, to keep the meta-
row is multiplied by a weight, the partial-sums of data size constant. Consequently, the weights
one row of the output feature map are generated are now encoded in 12 b.
that is accumulated in the corresponding row of To keep the bit-width close to the original 8 b,
the output buffer. Due to this regular processing we again use a differential encoding, but this
pattern, unlike MLPs, the CNN design does not time for the indexes. This coding is motivated by
need a crossbar between the multiplier and the huge amount of repetitive weights that are
accumulator arrays. shown by the zero D values in Figure 1. When
To integrate both MLPs and CNNs into a sin- merging and sorting the filter weights into a sin-
gle DNN core, the 1-D array that keeps input data gle weight vector, the repetitive weights form a
for MLPs [see IB in Figure 2(a)] is used to store subarray of subsequent cells of the weight vec-
weights when running CNNs and the 2-D array tor. The order of these weights is not important
that keeps weights (DWT) is used to keep input as they all are represented by a zero D value and
matrix when running CNNs. reuse the result generated for the first weight of
The following nested loop outlines the main the subarray. Thus, we sort the weights of each
processing steps of CNNs on DNN: subarray in the ascending order of their index.
In each subarray, the first weight gets an 8-b
FOR h= 1 to H//for_all_inpu_rows
index that shows its absolute location value,
FOR f= 1 to K×S×R//for_all_filter_weights
while the indexes of the subsequent weights in
Out_channel = Index[f].filter_no;
the subarray are set as the difference to the
Out_row= Calc_Output_Row(h,Index
index of their preceding weight. We use 3 b to
[f].coordinate);
code the difference value. If the relative index of
FOR w= 1 to W in PARALLEL//
a weight exceeds 8 (needs more than 3 b), its
for_all_cells_of_row
index is stored in absolute format, with the sub-
Out[Out_channel,Out_row,w]=
sequent cells of the subarray are indexed rela-
Weight[f]*In[h,w];
tive to it. To specify whether an index is
The Index data structure in the pseudocode is absolute or relative, one bit is added to the index
a piece of metadata that keeps the filter number that shows its mode. This way, the index size is
to which the weight belongs and the coordination reduced to 4 b for most weights, keeping the
of the weight in the filter (relative to the central weight size roughly the same as the baseline
cell of the filter). The index, together with the network.

January/February 2020
71
General Interest

Table 1. Benchmarks. benchmarks, weights and intermediate data are


quantized to 8-b numbers.
Benchmark Data Set Topology To measure power consumption and critical
AlexNet ImageNet CNN (5 filter + 3 FC) path, we implemented the considered archi-
VggNet-16 ImageNet CNN (13 filter + 3 FC) tectures in Very High Speed Integrated Circuit
Hardware Description Language (VHDL). Synthe-
Object class IRIS MLP (16:32:3)
sis is carried out by a commercial synthesis tool
MLP using the NanoGate 15 nm library. The power
Image classification CIFAR
(2048:1024:512:10)
consumption of the ON-chip and OFF-chip memory
FFT Mibench MLP (8:256:8) elements is calculated by CACTI and DRAMSim,
Online heart UCI heart respectively.
MLP (255:255:125)
monitoring disease We compare DNN with three state-of-the-art
DNN accelerators, UCNN,1 CORN,2 and PRA,4 in
terms of power consumption, latency, and
Our evaluation results show that this com- accuracy. The results are also compared to a
pression method reduces the average size of low-precision accelerator with the power of
metadata to 5.4 b for AlexNet and 4.8 b for two weights. This accelerator uses the baseline
VGGNet. DNN architecture without multiplexers and
The same compression scheme is used adders.
for MLPs. Bit-pragmatic (PRA) accelerator eliminates
zero product terms in a multiplication by on-the-
Implementation Considerations fly conversion of input feature map elements into
To reduce the required on-chip memory an explicit list of powers of two. It utilizes a serial
demand, we divide the input feature map into multiplier to only calculate the nonzero product
multiple Tk TN partitions and process the parti- terms, hence reducing the nn multiplication to
tions serially. With this size, the weight vector multiple (n) shift-add operations. With this bit-
should be of length Tk . Thus, the weights should serial design, n serial multiplications run in paral-
be fetched and streamed to the accelerator in lel on a conventional nn multiplier.
(K×S×R)/TK steps for each input partition. This In addition to the baseline PRA4 (PRA-ifm), we
also limits the coordinates of the generated par- use another variation of PRA, referred to as PRA-
tial products to a sub-volume of the entire out- weight, in which the weights are converted to an
put feature map space of size K×Tk TN . explicit list of powers of two. PRA-ifm benefits
Figure 2(b) shows this petitioning scheme, from the more zero bits in input feature maps,
with the data structures are shown above and but at the price of online format conversion.
below the dotted lines and the accelerator archi- PRA-weight eliminates this overhead by offline
tecture between the dotted lines. format conversion of weights.
There are several previous works that incre-
ase the performance/power of neural networks
EXPERIMENTAL RESULTS through computation reuse.1,2,12 Among them,
Methodology we selected CORN2 and UCNN1 for the compari-
To evaluate DNN, four widely used MLPs (all son purpose.
taken from the UCI machine learning reposi- Since CORN is designed for MLPs and PRA
tory11) and two CNNs are used as benchmarks and UCNN are designed for CNNs, we compare
(see Table 1). The training and quantization DNN with PRA and UCNN under the CNN
phases of the DNNs are carried out by Tensor- benchmarks and with CORN under the MLP
Flow. We developed a custom-built tool to con- benchmarks.
vert the DNN weights to the D format and a In DNN, input feature maps are 8-b and
cycle-accurate simulator to calculate the execu- weights are kept in the differential format. The D
tion time of DNN and the other accelerators values in an 8-b quantized network are encoded
selected for the comparison purpose. In all in 4 b (1-b sign + 3-b exponent).

IEEE Micro
72
MLP Configuration
For MLPs, the DNN core has a 128128 DWT
with 4-b cells to keep differential weights. There
is also one 128-cell IB with 8-b cells. The index
data size is 4-b per weight: 3-b encoded crossbar
output number plus 1-b enable signal. For MLPs
we need an array of multiplexers in front of the
accumulators.

CNN Configuration
For CNNs, we use the same amount of memory
space and logic as MLPs. The 128128 table keeps
128 rows of 64 8-b input feature maps. The 128-cell
buffer keeps 256 4-b differential weights for CNNs.
The index data size is 4 b, as described earlier.

Hardware Implementation Results


The results show that the critical path of
DNN, PRA, the pipelined version of CORN and
UCNN are 0.64, 0.59, 0.74, and 0.89 ns in 15 nm,
respectively. As all latencies are less than 1 ns,
we evaluate the power consumption of all accel-
erators at 1 GHz.
Accuracy. Figure 3(a) and (b) compares the
accuracy results of the considered DNN acceler-
ators. Thanks to the small D values, as reflected
in Figure 1, DNN can preserve accuracy within
3% of the accuracy of PRA, UCNN, and CORN
(which offer the same accuracy as the original
quantized DNN, as they do not apply further
approximation). The results confirm our initial
evaluation that due to the small delta values, the
Figure 3. Accuracy (a) under CNN workloads and
rounding error, and hence the accuracy loss is
(b) MLP workloads and normalized power and performance
minimized.
comparison under (c) CNN workloads and (d) MLP workloads.
Power and latency. The power and latency
results under the CNN and MLP benchmarks are
depicted in Figure 3(c) and (d). For a more com- replaces the multiplication with single shift-add
prehensive comparison, the results are normal- operations. The average number of shift-add oper-
ized to the results of PRA-weight. As a reference, ations per multiplication is 2.1 for PRA-ifm and 3.0
the latencies of Alexnet and VggNet on DNN are for PRA-weight. It also does not incur the overhead
0.67 s (0.67B cycles on a 1-GHz core) and 33.7 s of the on-the-fly input conversion of PRA-ifm.
on a single-core, respectively. The average The crossbar in DNN for MLP benchmarks
power consumption of DNN under CNN and MLP consumes 9% of the total power (with comp-
benchmarks is 0.013 and 0.022 W, respectively. utation and memory access consume 57% and
DNN outperforms CORN and UCNN consider- 34%, respectively), but this overhead is over-
ably, as it generalizes computation reuse: repeti- compensated by the larger reduction in the
tive weights appear as zero delta values in our multiplication power.
design, effectively bypassing the related computa- The accelerator with the power of two weights
tion. DNN also exhibits better power and perfor- achieves 13% less power consumption than DNN,
mance profiles than PRA. The reason is that it but at the price of up to 23% accuracy loss.

January/February 2020
73
General Interest

CONCLUSION 7. A. Zhou et al., “Incremental quantization: Towards


In this article, we showed the effectiveness lossless CNNs with low-precision weights,” 2017,
of differential computation in reducing the com- arXiv preprint arXiv:1702.03044.
plexity of DNNs. To realize differential com- 8. M. Mahmoud, K. Siu, and A. Moshovos, “Diffy: A deja
puting, we presented DNN, a novel accelerator vu-free differential deep neural network accelerator,”
architecture that exploits value locality of quan- in Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchit.,
tized DNN weights to replace the costly multipli- 2018, pp. 134–147.
cation operation by a single shift-add operation. 9. N. Jouppi et al., “In-datacenter performance analysis
The experimental results shows, compared to of a tensor processing unit,” in Proc. 44th Annu. Int.
the state-of-the-art designs, DNN consumes Symp. Comput. Archit., 2017, pp. 1–12.
17%–49% less average power and has 14%–37% 10. Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial
less latency under a set of CNN benchmarks. architecture for energy-efficient dataflow for
For MLP benchmarks, average power consump- convolutional neural networks,” in Proc. 43rd Annu. Int.
tion and latency are reduced by 45% and 32%, Symp. Comput. Archit., 2016, pp. 367–379.
respectively. 11. D. Dua and C. Graff, “UCI machine learning
repository,” 2019. [Online]. Available: http://archive.
ACKNOWLEDGMENTS ics.uci.edu/ml
This work was supported by The Initiation 12. F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei,
Grant Program of Swedish Foundation for Inter- “Deep convolutional neural network architecture with
national Cooperation in Research and Higher reconfigurable computation patterns,” IEEE Trans.
Education (STINT). Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8,
pp. 2220–2233, Aug. 2017.

& REFERENCES Hoda Mahdiani is currently working toward the PhD


1. K. Hedge et al., “UCNN: Exploiting computational degree in computer engineering at the Department of
reuse in deep neural networks via weight repetition,” Electrical and Computer Engineering, University of Teh-
in Proc. 45th Int. Symp. Comput. Archit., 2018,
ran, Tehran, Iran. Contact her at hoda.mahdiani@ut.ac.ir.
pp. 674–687.
2. A. Yasoubi, R. Hojabr, and M. Modarressi, “Power- Alireza Khadem is currently working toward the
efficient accelerator design for neural networks using PhD degree at the University of Michigan, Ann Arbor,
computation reuse,” IEEE Comput. Archit. Lett., MI, USA. He received the BSc degree in computer
vol. 16, no. 1, pp. 72–75, Jan.–Jun. 2017. engineering from the University of Tehran, Tehran,
3. Z. Du, A. Lingamneni, Y. Chen, K. V. Palem, Iran. Contact him at arkhadem@umich.edu.
O. Temam, and C. Wu, “Leveraging the error
Azam Ghanbari is currently working toward the
resilience of neural networks for designing highly MSc degree in computer engineering at the Depart-
energy efficient accelerators,” IEEE Trans. Comput.- ment of Electrical and Computer Engineering, Uni-
Aided Design Integr. Circuits Syst., vol. 34, no. 8, versity of Tehran, Tehran, Iran. Contact her at
pp. 1223–1235, Aug. 2015. azam.ghanbari@ut.ac.ir.
4. J. Albericio, “Bit-pragmatic deep neural network
Mehdi Modarressi is currently an Assistant Profes-
computing,” in Proc. 50th Annu. IEEE/ACM Int. Symp.
sor with the Department of Electrical and Computer
Microarchit., 2017, pp. 382–394. Engineering, College of Engineering, University of Teh-
5. A. Parashar et al., “SCNN: An accelerator for ran, Tehran, Iran. Contact him at modarressi@ut.ac.ir.
compressed-sparse convolutional neural networks,” in
Proc. 44th Annu. Int. Symp. Comput. Archit., 2017, Farima Fatahi-Bayat received the BSc degree in
pp. 27–40. computer engineering from the University of Tehran,
6. V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, Tehran, Iran. Contact her at farima.fatahi@ut.ac.ir.
and H. Esmaeilzadeh, “SnaPEA: Predictive early Masoud Daneshtalab is currently an Associate
activation for reducing computation in deep Professor with Ma €lardalen University, Va€stera s,
convolutional neural networks,” in Proc. 45th Annu. Int. Sweden, and with the Euromicro’s board of Directors.
Symp. Comput. Archit., 2018, pp. 662–673. Contact him at masoud.daneshtalab@mdh.se.

IEEE Micro
74
General Interest

AutoML for Architecting


Efficient and Specialized
Neural Networks
Han Cai, Ji Lin, Yujun Lin, Zhijian Liu,
Kuan Wang, Tianzhe Wang, Ligeng Zhu, and
Song Han
Massachusetts Institute of Technology

Abstract—Efficient deep learning inference requires algorithm and hardware codesign to


enable specialization: we usually need to change the algorithm to reduce memory
footprint and improve energy efficiency. However, the extra degree of freedom from the
neural architecture design makes the design space much larger: it is not only about
designing the hardware architecture but also codesigning the neural architecture to fit the
hardware architecture. It is difficult for human engineers to exhaust the design space by
heuristics. We propose design automation techniques for architecting efficient neural
networks given a target hardware platform. We investigate automatically designing
specialized and fast models, auto channel pruning, and auto mixed-precision quantization.
We demonstrate that such learning-based, automated design achieves superior
performance and efficiency than the rule-based human design. Moreover, we shorten the
design cycle by 200 than previous work, so that we can afford to design specialized
neural network models for different hardware platforms.

& ALGORITHM AND HARDWARE codesign plays an box, there is plenty of room in the neural architec-
important role in efficient deep learning comput- ture design that can improve the inference
ing. Unlike treating the neural network as a black efficiency of deep learning, given fixed target hard-
ware. The benefit usually comes from memory sav-
ing and locality improvement. For example, model
Digital Object Identifier 10.1109/MM.2019.2953153 compression techniques1 including pruning and
Date of publication 12 November 2019; date of current quantization can drastically reduce the memory
version 14 January 2020. footprint and save energy consumption. Another

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
75
General Interest

computation
cost used to be
prohibitive: even
searching a
model on CIFAR-
10 data set takes
104 GPU hours.6,7
Figure 1. Design automation for (a) auto model specialization; (b) auto channel pruning; and
Third, it requires
(c) auto mixed-precision quantization.
to directly take
hardware feed-
example is small model design. MobileNet2 has back into the optimization process since low
only 4.8 MB of model size, which can fit on-chip FLOPs do not directly translate to high hardware
SRAM and improve the locality. efficiency.7 We cut the search cost by two orders
However, efficient model design and compres- of magnitude (actually more than that, since we
sion have a large design space. Many different neu- directly search on ImageNet), which enables us to
ral network architectures can lead to similar design specialized models on the target task and
accuracy, but drastically different hardware effi- target hardware. On the mobile phone, our
ciency. This is difficult to exhaust by rule-based searched model3 runs 1.8 faster than the best
heuristics since there is a shortage of deep learn- human-designed model.8
ing and hardware experts to hand-tune each After designing a specialized model, compres-
model to make it run fast. It is demanding to sys- sion and pruning is an effective technique to fur-
tematically study how to design efficient neural ther reduce the memory footprint.1,9 Conventional
architecture given a specific hardware architec- model compression techniques rely on hand-
ture. We propose hardware-centric AutoML tech- crafted heuristics and rule-based policies that
niques that can automatically design neural require domain experts to explore the large design
networks that are hardware efficient for infer- space. We propose an automated design flow that
ence.3–5 Such joint optimization is systematic and leverages reinforcement learning to give the best
can transfer well between tasks. It requires less model compression policy. This learning-based
engineering effort while designing better neural compression policy outperforms the conventional
networks at a low cost. rule-based compression policy by having a higher
We explore three aspects of neural network compression ratio, better preserving the accu-
design automation (Figure 1): auto design spe- racy, and freeing human labor. We applied this
cialized model, auto channel pruning, and auto automated, push-the-button compression pipeline
mixed-precision quantization. Each aspect is to MobileNet and achieved 1.81 speedup of mea-
summarized as follows. sured inference latency on an Android phone and
There is plenty of specialized hardware for 1.43 speedup on the Titan XP GPU, with only
neural networks, but little research has been done 0.1% loss of ImageNet top-1 accuracy.
for specialized neural network architecture for a The last step is automatic mixed-precision
given hardware architecture (the reverse speciali- quantization. Emergent DNN hardware accelera-
zation). There are several advantages for a spe- tors begin to support flexible bitwidth (1–8 b),
cialized model: it can fully utilize the parallelism which raises a great challenge to find the optimal
of the target hardware [e.g., fitting the channel bitwidth for each layer: it requires domain experts
size with the processing element array (PE) size]. to explore the vast design space trading off among
Besides, a specialized model can fully utilize the accuracy, latency, energy, and model size. The
on-chip buffer and improve locality and reuse. conventional quantization algorithm ignores the
Specialization can also match the model’s compu- different hardware architectures and quantizes all
tation intensity with the hardware’s roofline the layers in a uniform way. We introduce the
model. However, designing a specialized neural automated design flow of model quantization, and
network architecture used to be difficult. First, we take the hardware accelerator’s feedback in
there are limited heuristics. Second, the the design loop. Our framework can specialize the

IEEE Micro
76
quantization policy for different hardware archi- train it from scratch, and update the meta-control-
tectures. It can effectively reduce the latency by ler. It typically requires tens of thousands of net-
1.4–1.95 and the energy consumption by 1.9 works to be trained to find a good neural network
with negligible loss of accuracy compared with architecture.
the fixed bitwidth (8 b) quantization. We adopt a different approach to improve
the efficiency of model specialization.3 We first
AUTOMATED MODEL build a super network that comprises all candi-
SPECIALIZATION date architectures. Concretely, it has a similar
In order to fully utilize the hardware resource, structure to a CNN model in the design space
we propose to search a specialized convolutional except that each specific operation is replaced
neural network (CNN) architecture for the given with a mixed operation that has n parallel opera-
hardware. The model is compact tors, and we introduce an archi-
and runs fast. We start with a large tecture parameter ai to each
In order to fully utilize
design space [Figure 1(a)] that operator to learn which opera-
the hardware resource,
includes many candidate operators we propose to search a
tors are redundant and thereby
(e.g., 3  3 Conv, 5  5 Conv, etc.) specialized convolu- can be pruned. Finally, only one
to learn which is the best one by tional neural network operator remains within each
gradient descent, rather than hand- (CNN) architecture for block, and we retrain this model
picking with rule-based heuristics. the given hardware. from scratch on the target task
Instead of just learning the weight The model is compact for deployment.
parameter, we jointly learn the and runs fast. In the forward step, to save
architecture parameter [shown in GPU memory, we allow only one
red in Figure 1(a)]. The architecture parameter is candidate operator to actively reside in the GPU
the probability of choosing each operator. The memory. This is achieved by hard-thresholding
search space for each block i consists of many the probability of each candidate operator to
choices: either 0 or 1. As such, the output of a mixed
operation is given as
 ConvOp: mobile inverted bottleneck conv8 X
with various kernel sizes ({3  3, 5  5, 7  7} xl ¼ gi oi ðxl1 Þ (1)
i
in our experiments) and expansion ratios
({3, 6} in our experiments); where gi is sampled according to the multino-
 ZeroOp: if ZeroOp is chosen at ith block, it mial distribution derived from the architecture
means the block is skipped. parameters, i.e., {pi ¼ softmaxðai ; aÞ ¼ expðai Þ=
P
i expðai Þ}.
Therefore, in our experiments, the total num-
In the backward step, we update the weight
ber of possible architectures is
parameters of active operators using standard
1 N ¼ 7N ;
½ð3  2Þ þ |{z} gradient descent. Since the architecture parame-
|fflfflffl{zfflfflffl}
ConvOp ZeroOp ters are not directly involved in the computa-
tional graph (1), we use the gradient w.r.t. binary
where N is the number of blocks (21 in our gates to update the corresponding architecture
experiments). parameters:
Given the vast design space, it is infeasible to
rely on domain experts to manually design the @L X @L @pj X @L @pj
¼  :
CNN model for each hardware platform. So we @ai j¼1
@pj @ai j¼1
@gj @ai
need to employ automatic architecture design
techniques. In order to specialize the model for hardware,
6,7
However, early reinforcement learning-based we need to take the latency running on the hard-
neural architecture search (NAS) methods are ware as a design reward. However, measuring
very expensive to run (e.g., 104 GPU hours) since the inference latency on-device has the following
they need to iteratively sample an architecture, problems: 1) being slow due to the measurement

January/February 2020
77
General Interest

achieve 0.6% higher


top-1 accuracy with
slightly lower mobile
latency. More remark-
ably, we reduced the
search cost by 200,
from 40 000 GPU hours
to only 200 GPU hours.
Figure 2. Left: ImageNet Accuracy (%) and GPU latency (Tesla V100). Our specialized model Figure 2(a) reports
outperforms previous work. Right: AI automatically designed specialized model consistently the speedup on GPU.
outperforms human designed MobileNetV2 under various latency settings. Our method achieved
superior performan-
ces compared to both
time; 2) high variance due to different battery human-designed and automatically searched
condition, and thermal throttling; and 3) latency architectures. Compared to general-purpose mod-
is nondifferentiable and cannot be directly opti- els, our specialized model improves the top-1
mized. To address these, we present our latency accuracy by 1.1%–3.1% while being 1.2–7.5
prediction model and hardware-aware loss. faster. It is essential to learn specialized neural net-
To build the latency model, we precompute works to cater for different hardware.
the latency of all possible operators in the Our automated design flow designed CNN arc-
architecture space. We query the lookup table hitectures that were long dismissed as being too
during the searching process. The overall inefficient—but in fact, they are very efficient. For
latency of ith block is the weighted sum of the instance, engineers have essentially stopped using
latency of each operator. 7  7 filters, because they are computationally
Then, we combine the latency and training more expensive than multiple, smaller filters (one
loss (e.g. cross-entropy loss) using the following 7  7 layer has the same receptive field than three
formula: 3  3 layers, but bears 49 weights rather than 27).
 
E½LAT b However, our AI-designed model found that using
L ¼ LCE   log (2)
LATref a 7  7 filter is very efficient on GPUs. That is
because GPUs have high parallelization, and invok-
where L is the loss function, LCE is the cross- ing a large kernel call is more efficient than invok-
entropy loss,  and b are hyperparameters con- ing multiple small kernel calls. This design choice
trolling the accuracy–latency tradeoff, and goes against previous human thinking. The larger
LATref is the target latency. Note that our formu- the search space, the more unknown things you
lation not only provides a fast estimation of the can find. You do not know if something will be bet-
searched model but also makes the search pro- ter than the past human experience. Let the auto-
cess fully differentiable. mated design tool figure it out (kicking neural
We demonstrate the effectiveness of our model network design automation into high gear).*
specialization on ImageNet data set with CPU
(Xeon E5-2640 v4), GPU (Tesla V100), and mobile
AUTOMATED CHANNEL PRUNING
phone (Google Pixel-1). We first search for a spe-
Pruning is widely used in model compression
cialized CNN model for the mobile phone [Figure 2
and acceleration. It is very important to find the
(b)]. Compared to MobileNet-V2 (the state-of-the-
optimal sparsity for each layer during pruning.
art human engineered architecture), our model
Pruning too much will hurt accuracy; pruning
improves the top-1 accuracy by 2.6% while main-
not enough will not achieve high compression
taining a similar latency. Under the same level of
ratio. This used to be manually determined in
top-1 accuracy (around 74.6%), MobileNet-V2 has
previous studies.1 Our goal is to automatically
143-ms latency while ours has only 78 ms (1.83
faster). Compared with the state-of-the-art auto *
Available at: http://news.mit.edu/2019/convolutional-neural-network-
designed model, MnasNet,10 our model can automation-0321

IEEE Micro
78
Table 1. AMC speeds up MobileNet. On Google Pixel-1 CPU, AMC achieves 1.95 measured speedup with batch size one, while
saving the memory by 34%. On NVIDIA Titan XP GPU, AMC achieves 1.53 speedup with batch size of 50.

Million Top-1 Top-5 GPU Android


MAC Acc. Acc. Latency Speed Latency Speed Memory
100% MobileNet 569 70.6% 89.5% 0.46 ms 2191 fps 123.3 ms 8.1 fps 20.1 MB

75% MobileNet 325 68.4% 88.2% 0.34 ms 2944 fps 72.3 ms 13.8 fps 14.8 MB

AMC (50% FLOPs) 285 70.5% 89.3% 0.32 ms 3127 fps (1.43) 68.3 ms 14.6 fps (1.81) 14.3 MB

AMC (50% Latency) 272 70.2% 89.2% 0.30 ms 3350 fps (1.53) 63.3 ms 16.0 fps (1.95) 13.2 MB

find out the effective sparsity for each layer. We inference time and achieve an accurate speedup
train a reinforcement learning agent to predict ratio. On GPUs, we also achieve up to 1.5
the best sparsity for a given hardware resource.4 speedup, which is less than mobile phone but still
We evaluate the accuracy and FLOPs after prun- significant on an already very compact model.
ing. Then, we update the agent by encouraging The less speedup is because a GPU has a higher
smaller, faster, and more accurate models. degree of parallelism than a mobile phone.
Our automatic model compression (AMC) lev-
erages reinforcement learning to efficiently
search the pruning ratio [Figure 1(b)]. The rein- AUTOMATED MIXED-PRECISION
forcement learning agent receives an embedding QUANTIZATION
state st of layer Lt from the environment and then Conventional quantization methods quantize
outputs a sparsity ratio as action at . The layer is each layer of the model to the same precision.
compressed with at (rounded to the nearest feasi- Mixed-precision quantization is more flexible but
ble fraction). We calculate the average magnitude suffer from a huge design space that is difficult to
of weight tensor for each input channel and explore. Meanwhile, the quantization solution opti-
remove the input channels with least magnitude. mized on one hardware might not be optimal on
Then, the agent moves to the next layer Ltþ1 , and the other, which raises the demand for specialized
receives state stþ1 . After finishing the final layer policies for different hardware architectures and
LT , the reward accuracy is evaluated on the vali- further increase the design space. Assuming that
dation set and returned to the agent. the bitwidth is between 1 and 8 for both weights
With our framework, we are able to push the and activations, then each layer has 82 choices. If
expert-tuned limit of fine-grained model pruning. we have M different neural network models, each
For ResNet-50 on ImageNet, we can push the with N layers, on H different hardware platforms,
compression ratio from 3.4 to 5 without loss there are in total OðH  M  82N Þ possible
of accuracy. With further investigation, we find solutions. Rather than using rule-based heuristics,
that AMC automatically learns to prune 3  3 we propose an automated design flow to quantize
convolution kernels harder than 1  1 kernels, different layers with mixed precision. Our hard-
which is similar to human heuristics since the ware-aware automatic quantization (HAQ)5 models
latter is less redundant. the quantization task as a reinforcement learning
We applied AMC to MobileNet2 and observed problem. We use the actor-critic model to give the
significant speedup both on the GPU and the quantization policy (#bits per layer) [Figure 1(c)],
mobile phone (Table 1). Since MobiletNet is and use FPGAs to provide energy and latency cost
already very compact, it is convincing to com- because FPGAs well support mixed precision. The
press this model. For a 0.5 FLOPs setting, we goal is not only high accuracy but also low energy
achieve 1.81 speedup on a Google Pixel 1 phone. and low latency.
For a 0.5 FLOPs setting, we accurately achieve An intuitive reward can be FLOPs or the model
1.95 speedup, which is very close to actual 2 size. However, these proxy signals are indirect.
target, showing that AMC can directly optimize They do not translate to latency or energy

January/February 2020
79
General Interest

Table 2. Latency-constrained quantization on the edge and cloud


accelerator on ImageNet.

Edge accelerator Cloud accelerator


Bitwidths Top-1 Latency Top-1 Latency

PACT 4b 62.44 45.45 ms 61.39 52.15 ms

Ours flexible 67.40 45.51 ms 66.99 52.12 ms

PACT 5b 67.00 57.75 ms 68.84 66.94 ms

Ours flexible 70.58 57.70 ms 70.90 66.92 ms Figure 3. HAQ improves the roofline performance
of pointwise layers in MobileNet-V1.
PACT 6b 70.46 70.67 ms 71.25 82.49 ms

Ours flexible 71.20 70.35 ms 71.89 82.34 ms


Compared to fixed 8-b quantization (PACT11),
Original 8b 70.82 96.20 ms 71.81 115.84 ms our automated quantization consistently achieved
better accuracy under the same latency (see
Table 2). With similar accuracy, HAQ can reduce
improvement. Cache locality, number of kernel the latency by 1.4–1.95 compared with the
calls, memory bandwidth all matter. Instead, we baseline.
use direct latency and energy feedback from the Our agent gave specialized quantization policy
hardware simulator. Such feedback enables our for edge and cloud accelerators. The policy is
RL agent to learn the hardware characteristics for quite different on different hardware. For the acti-
different layers, e.g., vanilla convolution has more vations, the depthwise convolution layers are
data reuse and locality, while depthwise convolu- assigned much less bitwidth than the pointwise
tion has less reuse and worse locality, which layers on the edge device; however, on the cloud
makes it memory bounded. device, the bitwidths of these two types of layers
In real-world applications, we have limited are similar to each other. For weights, the bit-
resource budgets (i.e., latency, energy, and model widths of these types of layers are nearly the same
size). We would like to find the quantization policy on the edge; however, on the cloud, the depthwise
with the best performance given the resource con- convolution layers are assigned much more bit-
straint. We encourage our agent to meet the com- width than the pointwise convolution layers.
putation budget by limiting the action space. After We interpret the quantization policy’s differ-
our RL agent gives actions fak g to all layers, we ence between edge and cloud by the roofline
measure the amount of resources that will be model. Operation intensity is defined as opera-
used by the quantized model. The feedback is tions (MACs in neural networks) per DRAM byte
directly obtained from the hardware simulator. If accessed. A lower operation intensity indicates
the current policy exceeds our resource budget that the model suffers more from the memory
(on latency, energy or model size), we will sequen- access. On the edge accelerator, which has
tially decrease the bitwidth of each layer until the much less memory bandwidth, our RL agent allo-
constraint is finally satisfied. cates fewer activation bits to the depthwise con-
We applied HAQ to three different hardware volutions since the activations dominate the
architectures to show the importance of special- memory access. On the cloud accelerator, which
ized quantization policy. Inferencing on edge has more memory bandwidth, our agent allo-
devices and cloud severs can be quite different, cates more bits to the depthwise convolutions
since 1) the batch size on the cloud servers are and allocates fewer bits to the pointwise convo-
larger and 2) the edge devices are usually lim- lutions to prevent it from being computation
ited- to low-computation resources and memory bounded. Figure 3 shows the roofline model
bandwidth. We use embedded FPGA Xilinx Zynq- before and after HAQ. HAQ gives more reason-
7020 as our edge device, and server FPGA Xilinx able policy to allocate the bits for each layer and
VU9P as our cloud device to compare the spe- pushes all the points to the upper-right corner
cialized quantization policies. that is more efficient.

IEEE Micro
80
and believe the insights will inspire the future soft-
ware and hardware codesign for efficient deep
learning computing.

& REFERENCES
1. S. Han, H. Mao, and W. J. Dally, “Deep compression:
Compressing deep neural networks with pruning,
trained quantization and Huffman coding,” in Proc. Int.
Conf. Learn. Representations, 2016.
2. A. G. Howard et al., “Mobilenets: Efficient
Figure 4. Combining all of the techniques together
convolutional neural networks for mobile vision
further improves the accuracy–latency tradeoff by a
applications,” 2017, arXiv:1704.04861. [Online].
significant margin on BitFusion.
Available: https://arxiv.org/abs/1704.04861
3. H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct neural
Finally, we integrate all of these techniques architecture search on target task and hardware,” in
together, including auto model specialization, Proc. Int. Conf. Learn. Representations, 2019.
auto channel pruning, and auto mixed-precision 4. Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han,
quantization. Figure 4 shows the results on the “AMC: AutoML for model compression and
BitFusion accelerator, where we can observe acceleration on mobile devices,” in Proc. Eur. Conf.
significant improvements over the baselines Comput. Vis., 2018, pp. 784–800.
including ProxylessNAS (with 8-b quantization), 5. K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ:
ProxylessNAS + AMC (with 8-b quantization), Hardware-aware automated quantization with mixed
MobileNetV2 (with 4-b/6-b quantization), and precision,” in Proc. Conf. Comput. Vis. Pattern
MobileNetV2 + HAQ (mixed-precision quantiza- Recognit., 2019, pp. 8612–8620.
tion with different target latency). 6. B. Zoph and Q. V. Le, “Neural architecture search with
reinforcement learning,” in Proc. Int. Conf. Learn.
Representations, 2017.
CONCLUSION 7. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le,
In this article, we presented design automation “Learning transferable architectures for scalable
techniques for efficient deep learning computing. image recognition,” in Proc. Conf. Comput. Vis. Pattern
There is plenty of room at the algorithm level to Recognit., 2018, pp. 8697–8710.
improve the hardware efficiency, but the large 8. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
design space makes L.-C. Chen, “Mobilenetv2: Inverted residuals and
it difficult to be Our framework reveals linear bottlenecks,” in Proc. Conf. Comput. Vis. Pattern
exhausted by that the optimal design Recognit., 2018, pp. 4510–4520.
human, and the con- policies on different 9. J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das,
ventional AutoML hardware architectures and S. Mahlke, “Scalpel: Customizing DNN pruning to
techniques are not are drastically different; the underlying hardware parallelism,” ACM SIGARCH
hardware efficient therefore, specializa- Comput. Archit. News, vol. 45, no. 2,
for both search and tion is important. pp. 548–560, 2017.
inference. We cov- 10. M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le,
ered three aspects of design automation: special- “Mnasnet: Platform-aware neural architecture search
ized model design, compression and pruning, for mobile,” in Proc. Conf. Comput. Vis. Pattern
mixed-precision quantization. Such learning-based Recognit., 2019, pp. 2820–2828.
design automation outperformed rule-based heu- 11. J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang,
ristics. Our framework reveals that the optimal V. Srinivasan, and K. Gopalakrishnan, “PACT:
design policies on different hardware architectures Parameterized Clipping Activation for Quantized
are drastically different; therefore, specialization is Neural Networks,” 2018, arXiv:1805.06085. [Online].
important. We interpreted those design policies Available: https://arxiv.org/abs/1805.06085

January/February 2020
81
General Interest

Han Cai is currently working toward the Ph.D. Song Han. He is an undergraduate student with
degree with the Electrical Engineering and Computer Tsinghua University, Shanghai, China. His research
Sciences Department, Massachusetts Institute of interests focuses on the intersection of computer
Technology, Cambridge, MA, USA, under the super- vision, deep learning, and efficient hardware archi-
vision of Prof. Song Han. He received the M.Eng. tecture. He is a student member of the IEEE. Contact
degree in computer science from Shanghai Jiao him at kuanwang@mit.edu.
Tong University, Shanghai, China. His research
mainly focuses on efficient deep learning and Tianzhe Wang is currently a Research Assistant
AutoML. Contact him at hancai@mit.edu. with HAN Lab, Massachusetts Institute of Technol-
ogy, Cambridge, MA, USA, under the supervision of
Ji Lin is currently working toward the Ph.D. degree Prof. Song Han. His research focuses on efficient
with the Electrical Engineering and Computer Sciences and automated deep learning and their applications
Department, Massachusetts Institute of Technology, in vision, speech, and robotics. He is an undergradu-
Cambridge, MA, USA, under the supervision of Prof. ate student of ACM Class in Zhiyuan College (Hon-
Song Han. He received the B.Eng. degree in electronic ored Program), Shanghai Jiao Tong University,
engineering from Tsinghua University, Beijing, China. Shanghai, China. Contact him at usedtobe@mit.edu.
His research mainly focuses on efficient deep learning
and its applications. Contact him at jilin@mit.edu. Ligeng Zhu is currently a Research Assistant with
HAN Lab, Massachusetts Institute of Technology,
Yujun Lin is currently working toward the Ph.D. Cambridge, MA, USA, under the supervision of Prof.
degree with the Electrical Engineering and Computer Song Han. His research focuses on efficient machine
Sciences Department, Massachusetts Institute of learning and computer vision. He received the B.Sc.
Technology, Cambridge, MA, USA, under the super- and B.Eng. degrees in computer science from Simon
vision of Prof. Song Han. He received the B.Eng.
Fraser University, Burnaby, BC, Canada, and Zhe-
degree in electronic engineering from Tsinghua Uni- jiang University, Hangzhou, China, respectively.
versity, Beijing, China. His research mainly focuses Contact him at ligeng@mit.edu.
on efficient deep learning acceleration and machine
learning- assisted hardware optimization. Contact
him at yujunlin@mit.edu. Song Han is currently an Assistant Professor with
the Electrical Engineering and Computer Sciences
Zhijian Liu is currently working toward the Ph.D. Department, Massachusetts Institute of Technology,
degree with the Electrical Engineering and Computer Cambridge, MA, USA. His research focuses on effi-
Sciences Department, Massachusetts Institute of cient deep learning computing at the intersection
Technology, Cambridge, MA, USA, under the super- between machine learning and computer architec-
vision of Prof. Song Han. He received the B.Eng. ture. He proposed “Deep Compression” and the effi-
degree in computer science from Shanghai Jiao cient hardware implementation “EIE Accelerator”
Tong University, Shanghai, China. His research that impacted the industry. He received the
mainly focuses on efficient and hardware-friendly B.S. degree in electrical engineering from Tsinghua
machine learning and its applications in vision and University, Shanghai, China, and the Ph.D. degree in
language. Contact him at zhijian@mit.edu. electrical engineering from Stanford University,
Stanford, CA, USA. His work received the best paper
Kuan Wang is currently a Research Assistant with award in ICLR’16 and FPGA’17. He is listed by MIT
HAN Lab, Massachusetts Institute of Technology, Technology Review’s 35 Innovators Under 35.
Cambridge, MA, USA, under the supervision of Prof. Contact him at songhan@mit.edu.

IEEE Micro
82
General Interest

In-Hardware Moving
Compute to Data Model
to Accelerate Thread
Synchronization on Large
Multicores
Masab Ahmad Jose
 A. Joao
University of Connecticut Arm Research
Halit Dogan Omer Khan
University of Connecticut University of Connecticut

Abstract—In this article, the moving computation to data model (MC2D) is proposed to
accelerate thread synchronization by pinning shared data to dedicated cores, and utilize
in-hardware core-to-core messaging to communicate critical code execution. The MC2D
model optimizes shared data locality by eliminating unnecessary data movement, and
alleviates contended synchronization using nonblocking communication between
threads. This article evaluates task-parallel algorithms under their synchronization-
centric classification to demonstrate that the effectiveness of the MC2D model to exploit
performance correlates with the number and frequency of synchronizations. The
evaluation on Tilera TILE-Gx72 multicore shows that the MC2D model delivers highest
performance scaling gains for ordered and unordered algorithms that expose significant
synchronizations due to task and data level dependencies. The MC2D model is also shown
to deliver at par performance with the traditional atomic operations based model for
highly data parallel algorithms from the unordered category.

Digital Object Identifier 10.1109/MM.2019.2955079


Date of publication 22 November 2019; date of current
version 14 January 2020.

January/February 2020 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
83
General Interest

& IN THIS ARTICLE, the moving compute to data


Many task parallel algorithms also exist that
(MC2D) model is proposed to accelerate thread do not enforce task ordering and work-inefficien-
synchronizations in large cache coherent multi- cies in their task-parallel implementations. This
cores.1–3 The MC2D model pins shared data to unordered task execution category, however,
dedicated core(s), and utilizes in-hardware core- may still implement synchronizations due to data
to-core messaging to invoke critical code sections dependencies between tasks. Depending on the
at those cores. Consequently, data locality is opti- number and frequency of these synchroniza-
mized by preventing unnecessary shared data tions, the MC2D model seamlessly adapts and
movement between cores. The objective of this delivers competitive performance. In highly par-
article is to study fundamental questions regard- allel form, the only synchronizations present in
ing practical adoption of the MC2D synchroniza- some algorithms are enforced when all threads
tion model. What characteristics of a parallelized observe barrier synchronization to transition
algorithm are best suited for the MC2D model? from one phase to the next. Even
Does the MC2D model port effi- for completely data-parallel algo-
ciently to highly data parallel algo- rithms, the MC2D model is shown
This article creates a
rithms? These aspects are evaluated synchronization-centric to be competitive with traditional
for the MC2D model on a real multi- characterization of spin-lock and atomic operations
core machine Tilera’s Tile-Gx72 that task-parallel algorithms based synchronization model.
enables both hardware cache coher- to guide the hypothesis Various task parallel algorithms
ence and in-hardware core-to-core that the MC2D model from the domains of graph proc-
messaging. Moreover, the MC2D works best when an
essing, machine learning, data-
model is compared against spin-lock algorithm exhibits high
base, and data analysis are
and atomic instructions based syn- synchronizations.
characterized as ordered, relax-
chronization models for a represen- ordered, unordered with task-level
tative set of task-parallel algorithms. data dependencies, and unordered with thread-
Task parallelism is a popular strategy for mul- level ordering. The evaluation on Tilera’s Tile-
ticore processors to exploit fine-grain parallelism. Gx72 multicore shows that the MC2D model per-
This article creates a synchronization-centric forms best under high synchronizations while it
characterization of task-parallel algorithms to matches the performance of state-of-the-art
guide the hypothesis that the MC2D model works atomic model for highly data-parallel algorithms.
best when an algorithm exhibits high synchroni-
zations. Work efficiency is a fundamental metric
used to evaluate the efficacy of an algorithm.
THREAD SYNCHRONIZATION
However, exploiting task-parallelism while maxi- MODELS
mizing work efficiency is a hard problem, since it Thread synchronization under traditional
requires ordered task execution. Frameworks, shared memory is done using an atomic memory
such as kinetic dependence graphs (KDG)4 operation in hardware by locking a cache line in
enforce ordered task-parallel execution by intro- the private cache of the requesting core (near
ducing significant synchronizations to globally atomic), or as remote read–modify–write opera-
order tasks among threads. The Galois5 frame- tion at the home last-level cache location for the
work creates relax-ordered task-parallel operators cache line (far atomic). As the core count
that introduce locally ordered task processing increases, the atomic instructions suffer from the
per core. Although this approach reduces the cost of expensive data sharing as cache lines ping
need for global task synchronizations, it introdu- pong between the communicating threads. When
ces races in task-level data dependencies that applicable, the atomic instructions can be directly
add redundant computations in the algorithms utilized to implement synchronization. However,
for convergence. Both ordered and relax-ordered they are limited to specific operations and data
algorithms expose significant synchronizations sizes, thus limiting their applicability to a wide
that open opportunities for the MC2D model to range of critical section implementations. In this
improve performance scalability. case, the atomic operation is utilized to build the

IEEE Micro
84
widely applicable spin-lock-based synchroniza-
tion model that protects an arbitrary critical code
section. In addition to cache line ping-pong, spin-
lock implementation also suffers from instruction
retries under contended thread synchronizations.
The MC2D model removes atomic opera-
tions from each critical code section. Conse-
quently, the critical code sections are serialized Figure 1. Workload ordering and dependence
on a dedicated service core. The shared data categorization for synchronization inference.
structures associated with the critical sections
are pinned and updated by the dedicated ser-
vice core. The actual application threads (exe-
ALGORITHM-CENTRIC
CLASSIFICATION OF THREAD
cuting on worker cores) send in-hardware
SYNCHRONIZATIONS
critical section execution requests to the ser-
Task parallelism simplifies parallel progra-
vice core. The service core receives the request
mmingI and has gained popularity due to its inte-
messages one at a time, and execute the critical
gration in modern concurrency frameworks.4,5
section sequentially to maintain atomicity of
The programmer specifies tasks to expose paral-
operations. Depending on the algorithmic
lelism, where each task is a unit of computation
requirements, the worker thread may wait for a
that executes in parallel with other tasks. How-
reply message from the service thread (block-
ever, system aspects, such as load balancing and
ing communication), or just continue to per-
thread synchronizations are managed by library
form other work as soon as it sends the
(or framework) constructs. This machine indepen-
message (nonblocking communication). The
dent execution model allows task parallel algo-
serialization of critical code sections at a single
rithms to scale at large core counts. All task
service core suppresses the exploitable paral-
parallel algorithms operate under some sort of a
lelism in such computations. Therefore, multi-
task ordering execution model. A task’s execution
ple cores are assigned as service cores by
order may depend on another task, or a group of
dividing nonoverlapping shared data among
tasks may require all threads to synchronize for
them. In this case, the worker threads send
their interdependencies to propagate. Moreover,
their critical section requests to the corre-
tasks may modify shared data with read
sponding service thread based on the mapping
and/or write dependencies. All these inter and
logic of shared data. The MC2D model exploits
intra task dependencies lead to thread synchroni-
shared data locality, and eliminates both retries
zations. In the most parallel form, all tasks execute
and ping-pong of shared data by pinning it at a
with no inter or intra dependencies. To contextual-
dedicated service core. However, the computa-
ize the relevance of increasing synchronization
tions in the worker and service cores must be
for task-parallel algorithms, Figure 1 presents an
load balanced for near-optimal performance.
algorithm-centric classification.
The MC2D model is shown to improve perfor-
Ordered Tasks category strictly enforces an
mance scalability over the traditional atomic
execution order of tasks among the cores. This
and spin-lock synchronization models, as core
strategy operates with the work efficiency of
counts increase.2,3 However, several questions
the sequential algorithm counterpart. However,
regarding the practical adoption of the MC2D
extracting parallelism among tasks while enforc-
model remain unanswered. What characteristics
ing a global task execution order is a hard prob-
of a parallelized algorithm are best suited for the
lem. KDG4 is a task-parallel execution framework
MC2D model? Does the MC2D model port effi-
that supports strictly ordered tasks. It imple-
ciently to a broad category of parallel algo-
ments a parallel queue (e.g., a priority queue) per
rithms? Based on the widely popular task-
core, and enforces globally ordered push and
parallel execution model, this article presents a
detailed algorithm-centric instrumentation and
I
characterization of the MC2D model. https://software.intel.com/tbb.

January/February 2020
85
General Interest

extract of tasks using an order list that is visible hypothesized to work best as algorithms experi-
to all cores. Before dequeuing a task, a safe source ence increasing thread synchronizations. The
test checks whether that task has another depen- following section describes how various catego-
dent task in another core. If so, the dequeue oper- ries of task-parallel algorithms can be ported to
ation waits until the inter task dependencies the MC2D model.
resolve. Otherwise, independent tasks are
allowed to concurrently execute in their respec- TASK PARALLELISM UNDER THE
tive cores. Note that during a task’s execution, its MC2D MODEL
shared data updates may also require thread syn- The programming model of a task-parallel
chronizations. The KDG framework utilizes the shared-memory application is not changed for the
above strategy to enable task-parallelism while MC2D model. The only difference is that critical
ensuring work-efficient algorithmic execution. An section requests are moved to a separate routine
alternative strategy adopted for ordered algo- that is processed by the service threads. The criti-
rithms is to tradeoff work-efficiency for parallel- cal section code in each (worker) thread is
ism. For example, the Galois framework5 replaced with a request message to invoke execu-
implements per-core priority queues while ignor- tion by the corresponding service thread. The bar-
ing the global intertask execution order. This rier synchronization is handled in a similar
relax-ordered category mitigates global thread manner, but instead of a dedicated service thread,
synchronizations by not performing the safe one of the worker threads handles the barrier.
source test. However, the shared data values are This process is automated by identifying all syn-
monotonically (and synchronously) updated to chronization points in the code. Similar to RCL,6
ensure task data dependencies are enforced. refactoring tools can be easily utilized to automat-
This leads to multiple iterations for the algo- ically transform existing applications. However,
rithms to converge, which increases the redun- this work performs manual transformations.
dant work performed by the cores. Adopting the Figure 2 presents an abstract framework con-
ordered and relax-ordered algorithms maximizes struct (similar to KDG) outlining data structures
the need for synchronizations among threads. and codes that exhibit thread synchronizations
Unordered Tasks category encompasses task- for ordered task-parallel algorithms. The pseudo-
parallel algorithms that enforce no local or global code on the left represents a generic code that a
ordering among tasks. Consequently, their work thread executes under the atomic synchronization
efficient implementations result in good perfor- model. Task orderings are maintained using a per-
mance scalability on parallel machines. However, core taskQueue and a global orderList. A thread
these algorithms may still implement synchroni- first peeks into the taskQueue, and invokes the
zations due to shared data dependencies among safe-source test to check (using shared data reads)
tasks. On the other hand, when no such shared if any other core has the same task with a different
data dependencies exist, most (if not all) algo- priority. The task is allowed to proceed to execu-
rithms require multiple phases of task-parallel tion only when it is either globally independent, or
computations. These phases execute indepen- has the highest global priority order. The task is
dent tasks in parallel, but their data outputs must first removed from the local taskQueue, and syn-
synchronously propagate from one layer to chronously removed from the global orderList.
another. This thread-level ordering is enforced During execution, and depending on the algo-
using primitives, such as barrier synchronization. rithm, all data dependencies among the tasks
Thread synchronizations in unordered algorithms being executed in all cores are resolved using
increase as the data dependencies between tasks, atomic critical code sections. Moreover, new task
or the frequency of barrier synchronization (s) are produced and pushed into the task queues
increases. and order lists. Again, the global orderList is
Figure 1 shows the classification of ordered updated synchronously. Under a relax-ordered
through unordered task-parallel algorithms, and algorithm, the safe-source test is not performed,
the impact of thread synchronizations on their and thus the orderList is also not implemented.
performance scalability. The MC2D model is However, the taskQueue is maintained per core to

IEEE Micro
86
Figure 2. Generic framework construct outlining data structures and pseudocode requiring synchronizations for ordered
task-parallel algorithms.

enforce locally ordered execution of tasks. These data locality is optimal since shared data is pinned
algorithms resolve the inter task dependencies in on service core(s), and in-hardware send and
a monotonic manner to converge to their final receive messages (using blocking communica-
solution. This is done by re-executing certain tasks tion) are utilized. The nonoverlapping regions of
when their data dependencies have not con- the orderList are pinned among the service core(s)
verged, thus increasing redundant work. based on a heuristic2 that utilizes the profiled per-
Although the pseudocode is shown for centage of shared work to determine the right
ordered algorithms, it is easily portable to unor- ratio of worker and service cores in the processor.
dered algorithms. Both taskQueue and orderList The objective of the heuristic is to optimize load
can be replaced with a simple per-core data struc- balancing of work done among all cores to maxi-
ture, such as an array to schedule tasks for execu- mize parallelism. All orderList update requests
tion. The synchronizations in unordered from each worker core are offloaded to the corre-
algorithms only arise due to certain data depen- sponding service core using (nonblocking) in-
dencies among tasks, which are implemented hardware messages. The MC2D model avoids
using atomic critical code sections. Finally, an expensive data movements for shared data, and
unordered algorithm may not even implement thus exploits data locality at the service cores.
synchronization among tasks, and only synchro- Similar strategy is used for all shared data struc-
nize threads from one phase of concurrent tasks tures for each child task being processed by a par-
to another phase. This results in barrier synchro- ent task. This is shown as offloading the critical
nization after all tasks within a given layer com- code section(s) from worker to service cores
plete and propagate outputs to the next layer. using in-hardware send and receive messages.
Each synchronization point discussed in For relax-ordered and unordered algorithms,
the context of abstract constructs for ordered, the safe-source test and the global task ordering
relax-ordered, and unordered algorithms is instru- (i.e., the orderList) are removed. However, the
mented for conversion to the MC2D model. synchronization points are expected to be lim-
Figure 2 shows an ordered algorithm’s synchroni- ited to one for the critical code section(s) within
zation points as arrows from atomic to the MC2D the task computations, and another at the com-
implementation. The safe-source test is an optional pletion of all tasks within a layer of computa-
conversion point since it only reads shared data tions (not shown in the figure). In summary, the
values that can be done using traditional load MC2D model pins shared data at the service
instructions or the MC2D model. In MC2D case, core(s), and exploits data locality to accelerate

January/February 2020
87
General Interest

thread synchronizations. Even for highly data- MC2D_shmem Model: MC2D_shmem is a


parallel algorithms that only require thread shared-memory-only version of the MC2D model,
ordering on barrier synchronizations, the MC2D which uses a shared software buffer per thread
model is expected to perform on par with the to enable messaging between worker and service
traditional atomic synchronization model. cores. Although MC2D_shmem benefits from
improved locality for shared data, it suffers from
bouncing of the shared buffer between worker
METHODOLOGY and service threads, which limits performance
Tilera’s Tile-Gx72 processor is used to evalu- scaling.3
ate various thread synchronization models
against the MC2D model. The processor consists
of 72 cores executing at 1 GHz, and includes a Benchmarks
double data rate (DDR) main memory with 16 GB Various task-parallel algorithms from diverse
capacity. Each core implements a two-level cache application domains of graph processing, machine
hierarchy, where level-two shared cache is physi- learning, database, and data analysis are analyzed
cally distributed among cores and interconnected to show the applicability and portability of the
using two-dimensional mesh networks. Directory- MC2D model. Several graph algorithms, namely
based hardware cache coherence enables data KCORE, SSSP, A*, BFS, MST, and COLOR are consid-
accesses between cores. The machine also ena- ered as representative ordered and relax-ordered
bles core-to-core explicit messaging using its algorithms. These algorithms are ported from
user-defined network (UDN). Four in-hardware state-of-the-art ordered and relax-ordered parallel-
UDN queues are integrated into each core to send ism works.4,5,7 Their task-parallel implementations
and receive messages using a high level API are implemented as outlined in Figure 2.
library, Tilera Multicore Components (TMC).2 All Several unordered algorithms are also consid-
benchmarks are compiled by employing a modi- ered. The triangle counting (TC)8 graph algorithm,
fied version of GCC 4.4.7. The evaluation is per- YCSB database workload,9 and the SGD machine
formed using up to 64 cores. learning10 are representative unordered algo-
rithms that process tasks with thread synchroni-
zations for task-level data dependencies. For
Thread Synchronizations in Tile-Gx72 example, YCSB processes transaction requests in
The MC2D model is implemented using the in- parallel, but uses synchronized timestamp order-
hardware messaging support. Each synchroniza- ing to keep track of write accesses by using per-
tion point in a shared-memory application is row write history tables. At commit for a request
ported as outlined in the section “Task Parallel- within a transaction, YCSB synchronously checks
ism Under the MC2D Model.” All communication if reads of the current transaction overlap with
that does not use explicit messages is carried other concurrent writes. If there are overlapping
out under traditional hardware cache coherence writes, the transaction is aborted. If there are no
load/store accesses. The following traditional overlapping writes, the changes in the transaction
synchronization models are also utilized for are applied to the database.
comparisons to the MC2D model. Several highly parallel unordered algorithms
Spin-Lock and Atomic Models: Tilera offers are also considered that implement thread-level
various atomic operations for efficient thread ordering across layers of task-parallel computa-
synchronization on shared data. Some of the tions. These include PAGERANK, COMMUNITY, and
operations are as follows: cmpexch, fetchadd, CONN. COMP.8 graph algorithms, and deep neural
fetchaddgez, exch, to name a few. Compare–and– networks, SQUEEZENET11 and GTRSB.12 For exam-
exchange (cmpexch) is utilized to build the widely ple, SQUEEZENET implements multiple neural com-
applicable spin-lock synchronization model that putations per layer, where each layer processes
can protect any arbitrary critical code section. its tasks in parallel across cores. Barrier syn-
When applicable, an atomic operation is directly chronizations are implemented to propagate
used to implement the atomic model. output neural values from layer to layer.

IEEE Micro
88
lock and atomic models, synchronization is mea-
sured as the time spent in the atomic operation,
as well as time spent in the critical code section.
The remaining thread local computation acco-
unts for the nonsynchronization time. However,
for the MC2D model, the synchronization time
accounts for the time spent in completing each
send and receive message. The tmc_udn_send_n
() routine is used to send request messages to the
Figure 3. Average speedup of spin, atomic, MC2D_ service threads, where the message is placed into
shmem, and MC2D as compared to sequential, at a core specific hardware queue, and then the
different core count. send instruction completes. This nonblocking
nature of send messages allows the MC2D model
to offload critical section work, and the worker
For graph algorithms, three real-world graphs thread can overlap computation with synchroni-
are used to explore input diversity. These are zation. However, the tmc_udn0_receive() rou-
CAL from DIMACS,II LiveJournal from the Network tine implements receive messages in a blocking
Journal Repository,III and CAGE14 from SuiteS- manner. Hence, the time taken by critical code
parse Matrix Collection.IV From CAL to CAGE14, section, and message traversal is accounted
the graph size and density increases while the when the receive completes. From a worker
diameter decreases. For GTRSB and SQUEEZENET, an thread perspective, the time taken to complete all
image is processed for inference from the Image- critical section work is implicitly accounted via
Net Repository.V In SGD, the real-simVI input is receive messages.
used, which evaluates 20 958 features. YCSB
implements access to database entries using Zip-
fian distribution. It includes a parameter called EVALUATION
theta to control the contention level. Setting theta The MC2D model is anticipated to mitigate
to 0.6 means that 10% of the database is accessed synchronization bottleneck as the number of
by 40% of all transactions. The theta value is varied cores increases per chip. Therefore, the spin,
from 0.6 to 0.9 with the increment of 0.05, then the atomic, MC2D_shmem, and in-hardware MC2D
average completion time is calculated using these models are evaluated at 8, 16, 32, and 64 threads
theta values for performance comparison. by pinning a single thread per core. The average
speedup is measured for all benchmarks over a
sequential implementation optimized for single
Evaluation Metrics
thread performance. Figure 3 shows the average
All evaluated algorithms are implemented
speedup for each thread synchronization model
using spin-lock, atomic, and both software-only
as the core count increases. The in-hardware
and in-hardware MC2D models using the capabili-
MC2D model demonstrates superior perfor-
ties of the Tile-Gx72 processor. All models utilize
mance scaling. The atomic model keeps up with
Pthreads library to spawn threads. Completion
the MC2D model until low core counts (less than
time is used as the evaluation metric, where all
32), but the performance gap rapidly increases
algorithms are run to completion, and only the
to 33% at 64 cores. The spin and MC2D_shmem
parallel region is measured for performance anal-
models both show diminishing performance
ysis. The completion time of the worst case
scaling since they both suffer from increasing
thread is broken down as nonsynchronization
overheads of cache line ping-pongs and instruc-
and synchronization components. For the spin-
tion retries. MC2D_shmem is slightly better than
II spin due to its locality optimizations for shared
http://users.diag.uniroma1.it/challenge9/download.shtml.
III
http://networkrepository.com/livejournal.php. data, but the shared buffer used to communicate
IV
https://sparse.tamu.edu/vanHeukelum/cage14.
V
http://www.image-net.org/.
between worker and service cores still ping-
VI
https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. pongs. The in-hardware MC2D model delivers

January/February 2020
89
General Interest

Figure 4. Normalized completion time breakdown results of MC2D over the atomic model.

near-optimal access to all shared data pinned at communication. Therefore, significant improve-
the service cores, and even surpasses the atomic ments are observed in synchronization times for
model when the on-chip network becomes a bot- the MC2D model over the atomic model. The
tleneck at high core counts. Therefore, the relax-ordered benchmarks observe smaller bene-
remaining evaluation focuses on the MC2D and fits from the MC2D model. Relaxed task ordering
the atomic model comparisons. reduces synchronizations needed to order tasks
Figure 4 shows a per benchmark and input globally across cores. However, more work is
evaluation of MC2D against the atomic model, done in each core to converge these algorithms
where the y-axis shows the completion time nor- to their solutions. This results in a higher non-
malized to the atomic model. Furthermore, the synchronization component for these bench-
completion time is broken down into nonsynch- marks. The MC2D model still improves
ronization and synchronization components. performance by accelerating synchronizations
Consequently, performance gain is calculated as that resolve shared data dependencies among
the percentage decrease in completion time for tasks, as well as barrier synchronizations across
the MC2D model relative to the atomic model. algorithmic iterations. SSSP has relatively large
The ordered benchmarks are classified into critical code sections compared to BFS and
ordered and relax-ordered implementations COLOR benchmarks. Therefore, SSSP observes
based on their performance under the atomic higher performance benefits with the MC2D
model. Ordered benchmarks with the MC2D model.
model consistently deliver 30%–65% decrease in Figure 4 also shows the evaluation for unor-
completion time compared to the atomic model. dered benchmarks, which are separated into two
The nonsynchronization time improves due to synchronization categories. The unordered bench-
the MC2D model offloading the critical code sec- marks with task-level data dependencies exhibit
tions to service cores, thus reducing the code significant synchronizations that must be handled
executed on the worker cores. The synchroniza- within a task’s execution. TC consists of tasks that
tion time is observed as a significant component are dominated by synchronization work. Hence,
of the completion time (more than 50% on aver- as graph density increases from CAL to CAGE,
age) for the ordered benchmarks. For example, stress on synchronizations also increases since
the safe source test needs to wait on synchroniza- each parent task synchronously updates an inc-
tion to process the next task in each thread, reasing number of child/new tasks. Therefore,
which increases its significance in terms of accel- most cores are assigned as service cores in TC.
erating synchronization. It improves due to the The shared data locality also increases at the ser-
MC2D model taking advantage of core-level vice cores as graph density increases due to
shared data locality, and avoid unnecessary increasing locality in edges. CAL does not observe
cache line ping-pongs between cores. Moreover, any benefits in both components because the
the nonblocking nature of the MC2D model graph is sparse (on average 1.2 child/new tasks
allows it to overlap computation with per parent) and exhibits random edge

IEEE Micro
90
connectivity. Therefore, under MC2D model, much less work between barrier synchroniza-
worker cores tend to not have enough computa- tions as compared to SQNET. Therefore, it shows
tions to overlap communication. Moreover, the more gains from accelerating barrier synchroni-
synchronization updates to random edges do not zation using the MC2D model as compared to
offer shared data locality benefits under the MC2D the atomic model.
model. On the other hand, CAGE is a dense graph, On average, the MC2D model outperforms
and exposes data locality on edges. The worker the atomic model by 33%. When unordered
cores now have sufficient computations to over- benchmarks with thread-level ordering are dis-
lap communication, counted, this average decrease in completion
and the service time increases to 48%.
This article evaluates
cores demonstrate
the applicability of the
improved shared MC2D model to accel-
data locality. There- CONCLUSION
erate synchronizations
fore, the MC2D This article evaluates the applicability of the
in task parallel algo-
model improves the MC2D model to accelerate synchronizations in
rithms. The evaluation
synchronization shows that improving task parallel algorithms. The evaluation shows
component signifi- shared data locality that improving shared data locality enables the
cantly, but it comes enables the MC2D MC2D model to deliver an average of 33% perfor-
at the cost of model to deliver an mance gains over the atomic model. These bene-
increased nonsynch- average of 33% perfor- fits directly correlate with the number and
mance gains over the frequency of synchronizations that are observed
ronization time due
atomic model. in both ordered, as well as unordered algorithms
to reduced parallel-
ism (i.e., fewer with task-level data dependencies. This work
worker cores). The nonsynchronization time in also shows that the MC2D model unlocks paral-
atomic model is better than the MC2D model lelism for highly parallel algorithms from the
because it has more cores available to perform unordered category, and delivers at par perfor-
the thread-local computations. mance with the atomic model.

In YCSB, the critical code sections for each ACKNOWLEDGMENTS


task are much larger, hence the importance of This work was supported in part by the
accelerating synchronizations increases as National Science Foundation under Grant CNS-
thread contention increases with theta values. 1718481. This research was also partially sup-
The reported YCSB result is an average of theta ported by the Semiconductor Research Corpora-
values from 0.6 to 0.9. SGD also improves with tion (SRC), and NXP Semiconductors. The authors
the MC2D model, where evaluated outputs that wish to thank C. Hughes of Intel and B. Kahne of
require atomic writes on the minimization func- NXP for their continued support and feedback.
tion are pinned to service cores.
The unordered with thread-level ordering
benchmarks generally perform a significant & REFERENCES
amount of thread-parallel work. These bench- 1. H. Dogan, F. Hijaz, M. Ahmad, B. Kahne, P. Wilson,
marks use barrier synchronization as all and O. Khan, “Accelerating graph and machine
threads propagate their shared values from one learning workloads using a shared memory multicore
layer to the next layer of computation. There- architecture with auxiliary support for in-hardware
fore, synchronization costs for these workloads explicit messaging,” in Proc. IEEE Int. Parallel Distrib.
is low as depicted by the completion time Process. Symp., 2017, pp. 254–264.
breakdown, and hence these benchmarks show 2. H. Dogan, M. Ahmad, J. Joao, and O. Khan,
little benefits from accelerating synchroniza- “Accelerating synchronization in graph analytics using
tion. However, the two machine learning bench- moving compute to data model on tilera TILE-Gx72,”
marks, GTRSB and SQNET show performance in Proc. IEEE 36th Int. Conf. Comput. Design, 2018,
improvement. The GTRSB benchmark performs pp. 496–505.

January/February 2020
91
General Interest

3. H. Dogan, M. Ahmad, B. Kahne, and O. Khan, Masab Ahmad is currently a Senior Silicon
Design Engineer with AMD Research. He received the
“Accelerating synchronization using moving compute to
Ph.D. degree in computer engineering from the Univer-
data model at 1,000-core multicore scale,” ACM Trans.
sity of Connecticut, Storrs, CT, USA. His research inter-
Archit. Code Optim., vol. 16, no. 1, Feb. 2019, Art. no. 4.
ests include parallel computing, computer architecture,
4. M. A. Hassaan, D. D. Nguyen, and K. K. Pingali, “Kinetic and workload characterization. He is a member of
dependence graphs,” in Proc. 20th Int. Conf. Archit. IEEE. Contact him at masab.ahmad@uconn.edu.
Support Program. Lang. Oper. Syst., 2015, pp. 457–471.
5. D. Nguyen, A. Lenharth, and K. Pingali, “A lightweight Halit Dogan is currently a Software Architect with
infrastructure for graph analytics,” in Proc. 24th ACM Intel, Santa Clara, CA, USA. He received the Ph.D.
Symp. Oper. Syst. Principles, 2013, pp. 456–471. degree in computer engineering from the Univer-
6. J.-P. Lozi, F. David, G. Thomas, J. Lawall, and sity of Connecticut, Storrs, CT, USA. His research
interests include improving intra and inter node com-
G. Muller, “Remote core locking: Migrating critical-
munication in high performance computing. Contact
section execution to improve the performance of
him at halitdoganeem@gmail.com
multithreaded applications,” in Proc. USENIX Conf.
Annu. Techn. Conf., 2012, USENIX Association,
Berkeley, CA, USA, pp. 6–6. Jose  A. Joao is currently a Staff Research Engi-
7. L. Dhulipala, G. Blelloch, and J. Shun, “Julienne: neer with the Architecture Group, Arm Research,
A framework for parallel graph algorithms using Austin, TX, USA. He received the Ph.D. and MS
work-efficient bucketing,” in Proc. 29th ACM degrees in computer engineering from the Univer-
Symp. Parallelism Algorithms Archit., 2017, pp. 293–304. sity of Texas, Austin, TX, USA, where he was super-
8. M. Ahmad, F. Hijaz, Q. Shi, and O. Khan, “CRONO: A vised by Professor Yale Patt. He also holds an
Electronics Engineering degree from Universidad
benchmark suite for multithreaded graph algorithms
Nacional de la Patagonia San Juan Bosco, Argen-
executing on futuristic multicores,” in Proc. IEEE Int.
tina, where he was an Assistant Professor. His cur-
Symp. Workload Characterization, 2015, pp. 44–55.
rent research interests are high-performance
9. X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and energy-efficient scalable system architectures for
M. Stonebraker, “Staring into the abyss: An evaluation of HPC and server workloads. He is a member of the
concurrency control with one thousand cores,” in Proc. IEEE and the ACM. Contact him at jjoao@ieee.org.
VLDB Endowment, Nov. 2014, vol. 8, pp. 209–220.
10. M. A. Zinkevich, M. Weimer, A. Smola, and L. Li,
“Parallelized stochastic gradient descent,” in Proc. Omer Khan is currently an Associate Professor with
23rd Int. Conf. Neural Inf. Process. Syst., J. D. Lafferty, the Department of Electrical and Computer Engineer-
ing, University of Connecticut, Storrs, CT, USA. He
C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A.
received the Ph.D. degree in electrical and computer
Culotta, Eds., Vol. 2. Red Hook, NY, USA: Curran
engineering from the University of Massachusetts
Associates Inc., 2010, pp. 2595–2603.
Amherst, Amherst, MA, USA. Prior to joining UConn,
11. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, he was a Postdoctoral Research Scientist with the
W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level Massachusetts Institute of Technology, Cambridge,
accuracy with 50x fewer parameters and < 1 MB model MA, USA. His research interests include developing
size,” 2016, arXiv:1602.07360. cross-layer methods to improve the performance
12. P. Sermanet and Y. LeCun, “Traffic sign recognition scalability and security of multicore processor archi-
with multi-scale convolutional networks,” in Proc. Int. tectures. He is a member of IEEE and the ACM.
Joint Conf. Neural Netw., 2011, pp. 2809–2813. Contact him at khan@uconn.edu.

IEEE Micro
92
Keep Your Career
Options Open
Upload Your Resume Today!
Whether you enjoy your current position or you
are ready for change, the IEEE Computer Society
Jobs Board is a valuable resource tool.
Take advantage of these special resources for
job seekers:

JOB ALERTS TEMPLATES

CAREER RESUMES VIEWED


ADVICE BY TOP EMPLOYERS No matter
your career
WEBINARS level, the
IEEE Computer
Society Jobs
Board keeps
you connected to
workplace trends
and exciting new
career prospects.

www.computer.org/jobs
Column: Micro Economics

The Vital Two Percent


Shane Greenstein
Harvard Business School

& EVERY ADVANCED ECONOMY struggles with pay- In short, even if society might benefit over
ing to stretch the knowledge frontier. It is no sur- time from innovation, and even if the risks are
prise why. Research and development (R&D) mild, no firm will authorize investing in R&D
costs money today, and yields the benefit tomor- without that measurable and significant justifica-
row, sometimes decades later. The benefits also tion. That intuition faces some important excep-
disperse widely, making it difficult to trace a tions, however, and those are worth the trouble.
direct chain between expenditure and result, or Let’s unpack it. Economists believe three head-
calculate any direct return on investment. winds reduce research and development: selfish
Despite the challenges, all modern economies accounting, disorganized accumulation, and mis-
devote some fraction of GDP to stretching the allocation of risk.
frontier with R&D. In the United States, and for a
few decades, that fraction has hovered around Selfish Accounting
2%–2.5% of GDP, with the Federal government Why would stockholders authorize a CFO to
covering about one fifth (more than $115 billion pay for research whose benefits accrue to any-
in 2017). body else? Others in society may benefit, but
Is this too much or too little? Let’s consider that is not the CFO’s responsibility. These
the economic debate. simple tendencies apply widely, and suggest
society would benefit from private R&D in
many situations in which private firms hesitate
HEADWINDS to pay for it.
As an initial intuitive hypothesis, most econo- That said, some countervailing forces push in
mists think market economies generate too little the opposite direction. CFOs alone do not deter-
R&D. That is because firms face a brutal math plan- mine the precise amount and direction of R&D.
ning R&D. Borrowing $100 at 5% requires addi- For example, retention of talented employees
tional $5 flow in perpetuity to pay back a loan. In can be higher in organizations that do more
light of the risk and the challenges capturing reve- R&D, especially if this makes the workplace
nue from R&D, most firms want more than $5, and more interesting for employees. That alone can
want it sooner, not in perpetuity. Hence, they do justify more expense.
not initiate an R&D project without something like
an expected 15% return on investment.
Disorganized Accumulation
Rarely is an invention born at a single
Digital Object Identifier 10.1109/MM.2019.2958726 moment in time, and rarely does a new technol-
Date of current version 14 January 2020. ogy emerge in a refined form from a lone

0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
94
inventor. The much more common path involves Once again, there are countervailing tenden-
accumulation of many insights, assembled and cies. After all, entrepreneurs take on many tech-
disassembled over time, resulting in a new con- nical risks, and VCs sponsor them, diversifying
figuration of technology, as views change about those risks across many projects. Large electron-
where value lies and what feasibly scales. All ics and software companies also maintain risky
that evolving and accumulating takes time and and regular research programs, diversifying
effort, and usually draws on the insights of many those across potential markets. Far from perfect,
people. to be sure, but these tendencies deter the
Doing all that accumulated activity within worst-case scenario.
one firm’s boundaries raises one set of chal-
lenges, and doing it across firm boundaries
NEGLECT AND GOVERNMENT
raises another. In either case, firms must manage
What is the bottom line? If there is too little
accumulated technologies over long time hori-
R&D, it happens in a few specific areas of mod-
zons. Many factors stand in the way of efficient
ern economies.
management, from human error to bureaucratic
bungling.  Risky science with wide consequence. Market-
There are tendencies that lead to accumula- oriented firms typically stay away from risky
tion nonetheless. Many appear to be noneco- and broad-based science that benefits many
nomic in character. Unbridled ambition makes over the long term. There are exceptions, but
up for some mismanagement, as egoistic entre- they are rare, as when a firm finds a way to
preneurs take on crazy amounts of risk, or pro- gain a return of a big innovation. Big pharma-
pose projects with impossible scope. Looking at ceutical companies do so with patents.
the last few decades, for example, there seems  Risky coordinating technologies with public
to be no shortage of entrepreneurs willing to benefits. Profit-oriented firms typically stay
attempt a big product, such as Tesla, to build a away from helping society develop a stan-
major software platform, such as Facebook, or dard or focal technology for accumulating
to develop an industry-wide standard, such as inventions from broad sources. The excep-
Wi-Fi. In short, though accumulation may be diffi- tions, again, are rare. These arise if a firm can
cult, some do try. find a direct way to profit from such activity.
But if one firm does manage to do so—as, for
Risk example, Qualcomm did in CDMA—it faces
Plenty of evidence suggests entrepreneurs in enormous resistance from the rest of the
developing poor countries do not take on market.
enough risk, and it is entirely understandable  Risky technologies with defensive purposes.
why. The prospects of starvation and physical Private firms do not typically do a great job
discomfort direct resources toward present con- at managing risks affiliated with military
sumption, and prevents saving for the future. weapons or safety, whose benefits are dif-
Some similar incentives discourage R&D among fused and widespread. Again, it is no acci-
the less privileged citizens of rich societies and/ dent federal agencies spend considerable
or small firms, who lack collateral with which to time and effort on topics like spectrum
borrow funds. allocation for emergency response, or air
What about the medium to large firms? transportation, or weapon development.
They take on too little risk when risk overlaps Sometimes that works against the tendencies
with the two just-mentioned headwinds. Firms of private firms.
will hesitate to take on risk if others benefit
too much, or if the costs of the risks hit man- The forgoing discussion of headwinds has
agement directly while the gains elude visible laid out the most common explanation for why
benefits, or if managers face a high risk of los- governments in modern economies intervene
ing control over the accumulation of the with federal subsidies. It also explains the goal
technology. of much policy, namely, to widen the set of

January/February 2020
95
Micro Economics

circumstances where private firms might see a taking on the risk as the first user to pay the
payoff from their R&D efforts. costs of pushing out the frontier. The technolo-
What do the data say? The U.S. government is gies develop for a mission, with a specific use-
particularly active in supporting science-based case in mind. Secondary users—in the prior
research with long payoffs, especially at the example, private users of drones—come along
National Science Foundation (approximately $8 later, and take advantage of the improvements
billion a year) and the National Institutes of developed under government budgets, tailoring
Health (NIH) ($31 billion). Is that a lot or a little? the technology to their own needs.
Consider this: the largest private R&D budgets for It would be great if the secondary use shaped
U.S. companies in 2017 were at Amazon, at the first use. Does it? It depends on whether mili-
approximately $22 billion, Alphabet/Google at tary planners consider the overlapping private
$16 billion, Microsoft at $14 billion, and Intel at problems as part of the government efforts.
$13 billion. VC funding in the United States Unfortunately, they typically do not, for the sim-
reached approximately $130 billion, and the ple reason that government decision makers are
accountants label only as shortsighted as private decision makers.
a small fraction as pri- Government accounting rarely helps, turning
vate R&D. In other Government agencies government decision makers into selfish
words, private R&D develop frontier CFOs. In other words, government agencies
technology in just as
exceeds government develop frontier technology in just as myopic
myopic a way as
R&D by a lot. a way as private organizations. Developed for
private organizations.
How about govern- own purposes, government users do not take
Developed for own
ment efforts for defen- purposes, government on undue risk, look beyond their own needs,
sive purposes? Namely, users do not take on or take on R&D if they cannot measure the
do governments undue risk, look gains.
stretch the frontier for beyond their own
their own purposes? needs, or take on R&D
THE VITAL TWO PERCENT
Yes, government does if they cannot measure
A few years ago, a doctor told a friend of
this, because private the gains.
mine that he had stage four melanoma. His
industry does not oth- back against the wall, my friend gained
erwise take up the efforts, especially in the mili- access to one of the frontier cures that puts the
tary and NASA. We benefit from the internet
human immune system on overdrive. It worked.
today due to such government efforts four deca- He went from a death sentence into remission.
des ago. A similar spillover shapes autonomous
Taking the long view, how did that happen?
vehicle development today. Simply stated, the NIH has devoted billions of
It still happens now. For example, the U.S.
dollars to research on cancer, for decades. More
military today extensively used drones in the recently, a few pharmaceutical companies har-
most recent conflicts in Afghanistan, Iraq, and
nessed that frontier knowledge into a specific
Syria. Technically complex weapons and opera- immunotherapy. My friend benefited.
tions, their use has led to, and will continue to
That experience frames the hard question: If
support, many new technical improvements affil- somebody you know will get cancer three deca-
iated with communications at great distances, des from now, do you want to spend money
sensors and perception for real time mobile
today on finding a cure? Or do you prefer to give
robots, planning and anticipation for software- it back to everyone as lower taxes or spend the
enabled systems, and control of airborne
dollars on some other urgent need today?
vehicles.
In other words, one might think of the gov- Shane Greenstein is a professor at the Harvard
ernment military as a leading technical user, Business School. Contact him at: sgreenstein@hbs.edu.

IEEE Micro
96
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌ‫ژ‬%Ʒ‫ژ‬FǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋ‫ژ‬°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌ‫ژ‬uƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏ‫ژ‬uƌȵǠƌȄǠ; Second VP: °ɲ‫ٯ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژ‬
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽ‫ژ‬°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠư‫ژ‬kȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲ‫ٮ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژژژژژژژ‬
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Gropp‫ژ‬
ombudsman@computer.org.
2019–2020 IEEE Division VIII Director: Elizabeth L. Burd‫ژ‬
CHAPTERS: Regular and student chapters worldwide provide the ‫ژ׏א׎אٮ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫¾ژي‬ǚȏȂƌȽ‫ژ‬uِ‫ژ‬ȏȄɋƷ‫ژژژژژژژژژژ‬
opportunity to interact with colleagues, hear technical experts, ‫ژ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬ÝUUU‫ژ‬%ǠȵƷƩɋȏȵ‫ٮ‬-ǹƷƩɋ‫ژي‬ǚȵǠȽɋǠȄƌ‫ژ‬uِ‫ژ‬°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׎א׎א ژ‬Ȅưɲ‫ ژِ¾ ژ‬ǚƷȄً‫ ژ‬eȏǚȄ‫ ژ‬%ِ‫ ژ‬eȏǚȄȽȏȄً‫ژ‬
following, email Customer Service at help@computer.org or call
°ɲ‫ٮ‬äƷȄ‫ ژ‬hɓȏً‫ ژ‬%ƌɫǠư‫ ژ‬kȏȂƷɋً‫ ژ‬%ǠȂǠɋȵǠȏȽ‫ ژ‬°ƷȵȲƌȄȏȽً‫ژژژژژ‬
+1 714 821 8380 (international) or our toll-free number, Oƌɲƌɋȏ‫ژ‬äƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׏א׎א ژ‬uِ‫ ژ‬ȵǠƌȄ‫ ژ‬ǹƌǵƷً‫ ژ‬FȵƷư‫ ژ‬%ȏɓǒǹǠȽً‫ژ‬
• Membership applications ƌȵǹȏȽ‫ ژ‬-ِ‫ ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ ژ‬ƌȂƌǹƌɋǚƌ‫ ژ‬uƌȵǠȂɓɋǚɓً‫ژژژژژژژژژژژ‬
• Publications catalog -ȵǠǵ‫ ژ‬eƌȄ‫ ژ‬uƌȵǠȄǠȽȽƷȄً‫ ژ‬hɓȄǠȏ‫ ژ‬ÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژيאא׎א ژ‬wǠǹȽ‫ ژ‬ȽƩǚƷȄƨȵɓƩǵً‫ژ‬
• Technical committee list -ȵȄƷȽɋȏ‫ ژ‬ɓƌưȵȏȽ‫ٯ‬ÝƌȵǒƌȽً‫ ژ‬%ƌɫǠư‫ ژ‬°ِ‫ ژ‬-ƨƷȵɋً‫ ژ‬ÞǠǹǹǠƌȂ‫ ژ‬GȵȏȲȲً‫ژ‬
• Technical committee application GȵƌƩƷ‫ ژ‬kƷɬǠȽً‫ ژ‬°ɋƷǑƌȄȏ‫ژ‬îƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِ‫ژ‬Russell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazines‫ژ‬ƌȄư‫ژז׏ژ‬DZȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژ‬
above. Email: ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: help@computer.org
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾U‚w‫¨‚ژ‬%-¨°‫ژ‬
standards used throughout the world.
¥ǚȏȄƷ‫ژٕבבבגژזוהژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
Technical Committees: TCs provide professional interaction in -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏ‫ژ‬Fɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄ‫ژ‬hِ‫ٹژ‬hƌɋǚɲ‫ژٺ‬kƌȄư
computing science accreditation. Past President: eȏȽƸ‫ژ‬uِFِ‫ژ‬uȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِ‫ژ‬Kramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: hƌɋǚƷȵǠȄƷ‫ژ‬eِ‫ژ‬%ɓȄƩƌȄ‫ژ‬
Director & President, Standards Association: Robert S. Fish‫ژ‬
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄ‫ژ‬¥ǚǠǹǹǠȲȽ‫ژ‬
Director & VP, Membership ‫ ۯ‬Geographic Activities:‫ژژژژژژژ‬
‫ژאא‬٫‫ באژ‬eƌȄɓƌȵɲ: ȏȽɋƌ‫ژ‬uƷȽƌً‫ژ‬ƌǹǠǑȏȵȄǠƌ hɓǵDZǠȄ‫ژ‬ǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄ‫ژ‬°ƌȵǵƌȵ‫ژ‬
Director & VP, Technical Activities: hƌɼɓǚǠȵȏ‫ژ‬hȏȽɓǒƷ

revised ‫ڳײ‬eƌȄɓƌȵɲ‫ש׫ש׫ڳ‬

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy