Distributed Graph Neural Network Training: A Survey
Distributed Graph Neural Network Training: A Survey
Distributed Graph Neural Network Training: A Survey
CCS Concepts: · General and reference → Surveys and overviews; · Computing methodologies → Distributed
computing methodologies; Machine learning; Neural networks; · Mathematics of computing → Graph algorithms.
Additional Key Words and Phrases: Surveys and overviews, Distributed GNN training, Graph data management, Communica-
tion optimization, Distributed GNN systems
Authors’ addresses: Yingxia Shao, shaoyx@bupt.edu.cn, Beijing University of Posts and Telecommunications, Beijing, China; Hongzheng
Li, Ethan_Lee@bupt.edu.cn, Beijing University of Posts and Telecommunications, Beijing, China; Xizhi Gu, guxizhi@bupt.edu.cn, Beijing
University of Posts and Telecommunications, Beijing, China; Hongbo Yin, yinhbo@bupt.edu.cn, Beijing University of Posts and Telecommu-
nications, Beijing, China; Yawen Li, lywbupt@126.com, Beijing University of Posts and Telecommunications, Beijing, China; Xupeng Miao,
xupeng@cmu.edu, Carnegie Mellon University, Pittsburgh, USA; Wentao Zhang, wentao.zhang@mila.quebec, Mila - Québec AI Institute,
HEC Montréal, Montreal, Canada; Bin Cui, bin.cui@pku.edu.cn, Peking University, Beijing, China; Lei Chen, leichen@cse.ust.hk, The Hong
Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 0360-0300/2024/2-ART
https://doi.org/10.1145/3648358
1 INTRODUCTION
GNNs are powerful tools to handle problems modeled by the graph and have been widely adopted in various
applications, including social networks (e.g., social spammer detection [102, 132], social network analysis [107]),
bio-informatics (e.g., protein interface prediction [40], diseaseśgene association [97]), drug discovery [12, 79],
traic forecasting [67], health care [3, 25], recommendation [37, 55, 61, 131, 135], natural language process [156,
168] and others [31, 99, 155, 163, 166]. By integrating the information of graph structure into the deep learning
models, GNNs can achieve signiicantly better results than traditional machine learning and data mining methods.
A GNN model generally contains multi graph convolutional layers, where each vertex aggregates the latest
states of its neighbors, updates the state of the vertex, and applies neural network to the updated state of the
vertex. Taking the traditional graph convolutional network (GCN) as an example, in each layer, a vertex uses a
sum function to aggregate the neighbor states and its own state, then applies a single-layer MLP to transform the
new state. Such procedures are repeated � times if the number of layers is �. The vertex states generated in the
�th layer are used by the downstream tasks, like node classiication, link prediction, and so on. In the past years,
many research works have made remarkable progress in the design of graph neural network models. Prominent
models include GCN [128], GraphSAGE [52], GAT [112], GIN [138], and many other application-speciic GNN
models [150, 152, 167]. Up to date, there are tens of surveys reviewing the GNN models [134, 136, 157, 170]. On
the other hand, to eiciently develop diferent GNN models, many GNN-oriented frameworks are proposed based
on various deep learning libraries [9, 17, 39, 50, 81, 123]. Many new optimizations are proposed to speed up GNN
training, including GNN computation kernels [43, 58, 59, 95, 109, 151], eicient programming models [57, 133, 137],
and full utilization of new hardware [23, 48, 146, 171]. However, these frameworks and optimizations mainly
focus on training GNN on a single machine, while not paying much attention to the scalability of input graphs.
Nowadays, large-scale graph neural networks [45, 68, 83] become a hot topic because of the prevalence of
massive large graph data. It is common to have graphs with billions of vertices and trillions of edges, like the
social network in Sina Weibo, WeChat, Twitter and Meta. However, most of the existing GNN models are only
tested on small graph data sets, and it is impossible or ineicient to process large graph data sets [56]. This is
because GNN models are complex and require massive computation resources when handling large graphs. A
line of works achieve large-scale GNNs by designing scalable GNN models. They use simpliication [41, 53, 130],
quantization [5, 38, 60, 84, 106, 119, 120, 126, 161], sampling [24, 145, 147] and distillation [30, 141, 153] to design
eicient models. Another line of works adopt distributed computing to the GNN training, a.k.a, distributed
GNN training. Because when handling large graphs, the limited memory and computing resource of a single
device (e.g, GPU) become the bottleneck of large-scale GNN training, and distributed computing provides more
computing resources (e.g., multi-GPUs, CPU clusters, etc.) to improve the training eiciency. However, previous
systems targeting at distributed graph processing [85, 87] and distributed deep learning [2, 78] separately. The
graph processing systems do not consider the acceleration of neural network operations, while the deep learning
systems lack the ability of processing graph data. Therefore, many eforts have been made in designing eicient
distributed GNN training frameworks and systems [65, 116, 117, 164, 173].
In this survey, we focus on the works and speciic techniques proposed for distributed GNN training. Distributed
GNN training divides the whole workload of model training among a set of workers, and all the workers process the
workload in parallel. However, due to the data dependency in GNNs, it is non-trivial to apply existing distributed
machine learning methods [113, 122] to GNNs, and many new techniques for optimizing the distributed GNN
training pipeline are proposed. Although there are a lot of surveys [134, 157, 170] about GNNs, to the best of
our knowledge, little efort has been made to systematically review the techniques for distributed GNN training.
Recently, Besta et al. [10] only reviewed the parallel computing paradigm of GNN, Abadal [1] surveyed GNN
computing from algorithms to hardware accelerators, and Vatter et al. [111] provided a comprehensive overview
of the evolution of the distributed systems for scalable GNN training.
To clearly organize the techniques for distributed GNN training, we introduce a general distributed GNN
training pipeline which consists of three stages ś data partition, batch generation, and GNN model training.
These stages involve GNN-speciic execution logic that includes graph processing and graph aggregation. In the
context of this general distributed GNN training pipeline, we discuss three main challenges of distributed GNN
training which are caused by the data dependency in graph data and require new techniques speciically designed
for distributed GNN training. To help readers understand various optimization techniques that address the above
challenges better, we introduce a new taxonomy that classiies the techniques into four orthogonal categories that
are GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol. This
taxonomy not only covers the optimization techniques used in both mini-batch distributed GNN training and
full-graph distributed GNN training, but also discusses the techniques from graph processing to model execution.
We carefully review the existing techniques in each category followed by describing 28 representative distributed
GNN systems and frameworks either from industry or academia. Finally, we briely discuss the future directions
of distributed GNN training.
The contributions of this survey are as follows:
• This is the irst survey focusing on the optimization techniques for eicient distributed GNN training, and
it helps researchers quickly understand the landscape of distributed GNN training.
• We introduce a new taxonomy of distributed GNN training techniques by considering the life-cycle of end-
to-end distributed GNN training. In high-level, the new taxonomy consists of four orthogonal categories
that are GNN data partition, GNN batch generation, GNN execution model and GNN communication
protocol.
• We provide a detailed and comprehensive technical summary of each category in our new taxonomy.
• We review 28 representative distributed GNN training systems and frameworks from industry to academia.
• We make a discussion about the future directions of the distributed GNN training.
Survey organization. In Section 2, we introduce the background of GNNs, and discuss the diferences between
GNN training and traditional neural network training. In Section 3, we present the distributed GNN training
pipeline, highlight the speciic challenges faced by distributed GNN training and introduce our taxonomy of the
techniques used in distributed GNN training. On the basis of the taxonomy, we make detailed discussion about
the techniques of distributed GNN training, including data partition (Section 4), batch generation (Section 5),
execution model (Section 6), and communication protocol (Section 7). In Section 8, we discuss various existing
distributed GNN training systems. In the end, we unveil some promising future directions for distributed GNN
training and make a conclusion.
among partitions and the number of replications of vertex � is called the replication factor. When the replication
factor of a vertex is larger than 1, the vertex is called boundary vertex, and other vertices in � are inner vertices.
Graph Neural Networks (GNNs). Given a graph � with adjacent matrix � and the feature matrix � where
each row is the initial feature vector � � of a vertex � in the graph �, an �-th layer in a GNN updates the vertex
feature by aggregating the features from the corresponding neighborhoods, and the process can be formalized in
matrix view as below
˜ � −1� � −1 ),
� � = � ( �� (1)
where � � is the hidden embedding matrix and � 0 = � , � � −1 is the model weights, �˜ is a normalized � and � is
a non-linear function, e.g., Relu, Sigmoid, etc. Eq. 1 is the global view of GNN computation. The local view of
GNN computation is the computation of a single vertex. Given a vertex �, the local computation of �-th layer in
the message passing schema [47] can be formalized as below
Mini-batch NN NN
Gradient
Computation
DNN &
Weight Update
Individual data points
GNN
NN NN
Sampled Subgraph
Input: Connected vertices
Aggregation Aggregation
Fig. 1. An example of GNN execution. The diference between GNN and DNN is highlighted in blue color.
a backward propagation is performed to compute the gradient for each parameter. Similar to the forward compu-
tation, the gradients on each vertex is sent alongside the edges to its neighbors during backward propagation,
incurring an additional scatter process compared to DNN training. The aggregation and scatter operation among
data samples leads to a much more complex computation process compared to DNN training.
Distributed DNN Training. Distributed DNN training [71, 73] is a solution to large-scale DNN. Parallelism
and synchronization are two key components. 1) Parallelism. Data parallelism is a prevalent training paradigm
in distributed DNN training. In data parallelism, the model is replicated across multiple devices or machines, with
each replica processing a diferent subset of the training data. Computed gradients are exchanged and averaged to
synchronize the model parameters, often utilizing communication operations like all-reduce [21]. To handle large
models that exceed the capacity of a single GPU, model parallelism is adopted, where each device processes a
distinct part of the model during forward and backward propagation. 2) Synchronization. Distributed training
can be categorized into synchronous and asynchronous training. In synchronous training, all workers complete a
forward and backward pass before model updates occur. This ensures that all workers are using the same model
parameters for computation. In asynchronous training, workers update the model parameters independently,
so to remove the global synchronization point between each mini-batch. Based on the idea of asynchronous
training, pipeline parallelism is proposed [63, 92] to perform lexible mini-batch training and improve resource
utilization to accelerate the overall training time.
Because of the large models in the DNNs, most eforts are made to manage the storage and synchronization of
model parameters for the distributed DNN training. Since GNN models typically have shallow network structures,
the management of model parameters is trivial. However, the input graphs of GNNs can be extremely large. The
speciic data dependency introduced by graph data structure signiicantly impacts the computation process of
distributed GNN training, resulting in a substantial communication overhead that difers from DNN training,
which becomes the main consideration and introduces new challenges (see Section 3.2).
∇ ʓ
Comp. graph § 6 Execution Model
k-layers
� ∇ All Reduce
§ 7 Communication Protocol
∇
Comp. graph § 6 Execution Model ∇
k-layers
� Parameter Server
§ 7 Communication Protocol
Fig. 2. The abstraction of distributed GNN training pipeline. Key stages are noted with the section number where they
will be further discussed. Stages marked with star ⋆ are optional for diferent training methods: batch generation is only
involved in mini-batch training, and the communication protocol which is used to communicate hidden embeddings and
gradients is involved in full-graph training.
considered. Finally, for parameter update stage, the existing techniques in classical distributed machine learning
can be directly applied to the distributed GNN training. In conclusion, the distributed GNN model training stage
is more complicated than training traditional DNN training, and needs careful design for both the execution
model and the communication protocol.
optimize the same stage in the distributed GNN training pipeline together and help readers fully understand the
existing solutions for the diferent stages in distributed GNN training.
According to the previous empirical studies, due to the data dependency, the bottleneck of distributed GNN
training generally comes up in the stages of data partition, batch generation and GNN model training, as shown
in the pipeline. Furthermore, various training strategies (e.g., mini-batch training, full-graph training) bring in
diferent workload patterns and result in diferent optimization techniques used in batch generation stage and
model training stage. For example, the computation graph generation in batch generation stage is important to
mini-batch training while communication protocol is important to full-graph training. In consequence, our new
taxonomy classiies the techniques speciically designed for distributed GNN training into four categories (i.e.,
GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol) as shown
in Figure 4. In the following, we introduce the overview of each category.
GNN data partition. In this category, we review the data partition techniques for distributed GNN training.
The goal of data partition is to balance the workload and minimize the communication cost for a GNN workload.
GNN training is an instance of distributed graph computing, many traditional graph partition methods can be
directly used. However, they are not optimal for distributed GNN training because of the new characteristics of
GNN workloads. Researchers have paid much efort to design GNN-friendly cost models which guide traditional
graph partition methods. In addition, the graph and feature are two typical types of data in GNN and both of them
are partitioned. Some works decouple the features from the graph structure and partition them independently. In
Section 4, we elaborate on the existing GNN data partition techniques.
GNN batch generation. In this category, we review the techniques of GNN batch generation for mini-batch
distributed GNN training. The methods of mini-batch generation afect both the training eiciency and model
accuracy. Graph sampling is a popular approach to generate a mini-batch for large-scale GNN training. However,
the standard graph sampling techniques do not consider the factors of distributed environments and each sampler
on a worker will frequently access data from other workers incurring massive communication. Recently, several
new GNN batch generation methods optimized for distributed settings have been introduced. We further classify
them into distributed sampling mini-batch generation and partition-based mini-batch generation. In addition, the
cache has been extensively studied to reduce communication during the GNN batch generation. In Section 5, we
elaborate on the existing GNN batch generation techniques.
GNN execution model. In this category, we review the execution model for both mini-batch and full-graph
GNN training. In mini-batch GNN training, sampling and feature extraction are two main operations that
dominate the total training time. To improve eiciency, diferent execution models are proposed to schedule
Mini-batch Full-graph
Classic
Heuristics
§ 4.1 Cost model Learning-based
Operator-based
Vertex-cut
Matrix-based partition
Row-wise partition
2D partition
Conventional
One-
shot
§ 6.1 Factored
Graph view
§ 6.2.1 Operator-parallel
Chunk-
based
GNN execution
Asynchronous
Pull-push parallelism
Synchronous
model
Comp.
only
Comm.-
Comp.-
Reduc.
Broadcast
P2P-based
§ 7.1 Synchronous Pipeline-based
Vertex-level
the training stage of a mini-batch with the sampling and feature extraction stage, in order to fully utilize
computing resources. In full-graph GNN training, the neighbors’ state for each vertex is aggregated in the forward
computation and the gradients are scattered back to the neighbors in the backward computation, leading to
massive communication of remote vertices. Due to the data dependency and irregular computation pattern,
traditional machine learning parallel models (e.g., data parallel, model parallel, etc.) are not optimal for graph
aggregation, especially when feature vectors are high-dimensional. We introduce the full-graph execution model
from both matrix view and graph view. From matrix view, we classify the execution of distributed SpMM into
Optimization
Objective
Graph
Partition
Feature
Partition
Feature matrix Graph property Row-wise Column-wise 2D
In this section, we review existing techniques of GNN data partition in distributed GNN training. Figure 5
describes the overview of the techniques. Considering graphs and features are two typical types of data in GNN,
we classify partition methods into graph partition and feature partition. The optimization objectives are workload
balance and communication and computation minimization, which aim at addressing challenges #1 and #3. In
addition, the cost model is another critical component that captures the characteristics of GNN workloads. With
a well-designed cost model, workloads on each partition can be accurately estimated. A graph partition strategy
can better address the challenges in GNN by adopting a more accurate cost model. In the following, we irst
present various cost models, which are the basis of graph partition. Then we discuss the graph partition and
feature partition, respectively.
we use the number of vertices to estimate the computation cost and the number of cross-edges to estimate the
communication cost [70]. However, this simple method is not proper for the GNN tasks because they are not
only inluenced by the number of vertices and cross-edges, but also inluenced by the dimension of features, the
number of layers, and the distribution of training vertices. Researchers have proposed several GNN-speciic cost
models, including the heuristics model, learning-based model, and operator-based model.
Heuristics model selects several graph metrics to estimate the cost via simple user-deined functions. Several
heuristics models for GNN workloads have been introduced in the context of streaming graph partition [104, 139],
which assigns a vertex or block to a partition one by one. Such models deine an ainity score for each vertex or
block and the score helps the vertex or block select a proper partition.
Assume a GNN task is denoted by �� � (�, �, ������ , ������ , ����� ), where � is the number of layers, � is the
graph, ������ , ������ and ����� are the vertex set for train, validation and test. The graph � is partitioned into �
subgraphs. In the context of streaming graph partition, for each assignment of vertex or block, let �� (1 ≤ � ≤ �)
be the set of vertices that have been already assigned to it, and ������ � �
, ������ �
and ����� are the corresponding
vertex set belong to partition �� .
Lin et al. [80] deine an ainity score vector with � dimension for each train vertex �� ∈ ������ , in which each
score represents the ainity of the vertex to a partition. Let � � (�� ) be the �-hop in-neighbor set of train vertex
�� , the score of �� with respect to partition �� is deined as below,
��� �
������ − |������ |
����� ��� = |������
�
∩ � � (�� )| · , (3)
|�� |
���
where ������ = ������
� . This score function implicitly balances the number of training vertices among partitions.
Liu et al. [82] deine similar ainity score with respect to a block (or subgraph) �, and the formal deinition is
|�� | |� � |
|�� ∩ � � (�)| · (1 − ) · (1 − �����
��� ), (4)
���� ������
where ���� = |�� | . Zheng et al. [162] deine the ainity score of a block � by considering all the training, validation
and test vertices, the formula is
�
��������� (�� , �) �� ������ �
�����
· (1 − � �����
��� − � ��� − � ��� ), (5)
|�� | ������ ������ �����
where ��������� (�� , �)) is the number of cross-edges between � and �� , �, �, � are hyper-parameters manually
set by users.
Learning-based model takes advantage of machine learning techniques to model the complex cost of GNN
workloads. The basic idea is to manually extract features via feature engineering and apply classical machine
learning to train the cost model. The learning-based model is able to estimate the cost using not only the static
graph structural information but also runtime statistics of GNN workloads, thus achieving a more accurate
estimation than the heuristic models. Jia et al. [65] introduce a linear regression model for GNN computation
cost estimation. The model estimates the computation cost of a single GNN layer � with regard to any input
graph � = (� , �). For each vertex in the graph, they select ive features (listed in Table 1), including three
graph-structural features and two runtime features. The estimation model is formalized as below,
︁
� (�, �) = �� (�)�� (�) (6)
�
︁ ︁︁
� (�, �) = � (�, �) = �� �� (�)
� ∈� � ∈� �
︁ ︁ ︁ (7)
= �� �� (�) = �� �� (�)
� � ∈� �
where �� (�) is a trainable parameter for layer l, �� (�) is the i-th feature of v, and �� (�) sums up i-th feature of all
vertices in G.
Wang et al. [121] use a polynomial function � to estimate the computation cost of vertex over a set of manually
selected features. The formal deinition is
︁�
� = �� �� , (8)
�=0
where � is the number of neighbor types (e.g., a metapath [121]) which is deined by the GNN models, �� is the
number of neighbors for the �-th type, �� is the total feature dimensions of the �-th type of neighbor instance
(i.e., a neighbor instance of the �-th type has � vertices and each vertex has feature dimension � , then �� = � × � ).
Refer to the original work [121] for the detailed examples of the function � . The total computation cost of a
subgraph is the sum of the estimated costs of the vertices in the subgraph.
Operator-based model enumerates the operators in a GNN workload and estimates the total computation cost
by summing the cost of each operator. Zhao et al. [158] divide the computation of a GNN workload into forward
computation and backward computation. The forward computation of a GNN layer is divided into aggregation,
linear transformation, and activation function; while the backward computation of a GNN layer is divided into
gradient computation towards loss function, embedding gradient computation, and gradient multiplications. As
a result, the costs of computing embedding ℎ�� of vertex � in layer � in forward and backward propagation are
estimated by � � (�, �) and �� (�, �), respectively.
� � (�, �) = � |� � |�� −1 + ��� �� −1 + ��� , (9)
(� + �)�� + (2� + �)�� �� −1, � =�
�� (�, �) = (10)
� |� � |�� + (� + �)�� �� −1 + ��� , 0 < � < �
where �� is the dimension of hidden embeddings in the �-th GNN layer, |� � | is the number of neighbors of vertex
�, �, �, �, and � are constant factors which can be learned through testing the running time in practice. Finally,
the computation cost of a mini-batch � is computed by summing up the computation cost from all vertices in the
� from layer 1 to layer � as below,
�−1
︁ ︁
� (�) = (� � (�, �) + �� (�, �)), (11)
�=0 � ∈ ���
Ð
� ∈�
where ��� represents the vertices in graph G that are l-hop away from vertex u.
cost. Based on the vertex-cut partition, Hoang et al. [54] leverage the 2D Cartesian vertex cut to improve the
scalability.
Feature Cache
Train vertex k-hop
matrix
Expansion
BFS, Importance Sampling
Graph Local subgraph Approx. subgraph
Besides the cache strategy, another approach is to design new distributed sampling techniques that are
communication-eicient and maintain the model accuracy. A basic idea of communication-eicient sampling
is to prioritize sampling local vertex, yet this introduces bias for the generated mini-batch. Inspired by the
linear weighted sampling methods [20, 174], Jiang et al. [66] propose a skewed linear weighted sampling for the
neighbor selection. Concretely, the skewed sampling scales the sampling weights of local vertices by a factor
� > 1 and Jiang et al. theoretically prove that the training can achieve the same convergence rate as the linear
weighted sampling by properly selecting the value of �. Cai et al. propose a technique called collective sampling
primitive (CSP) [15]. CSP reduces communication costs by pushing the sampling tasks to remote workers instead
of pulling the entire neighbor lists to the local worker. This is beneicial because the full neighbor list of a center
vertex is often much larger than the sampled results. With CSP, sampling tasks for remote vertices are conducted
on remote workers, and only the results are sent back, reducing communication overhead.
In addition to the above GNN-speciic distributed sampling techniques, other general distributed graph sampling
methods on CPU clusters or GPU clusters can also provide help, such as C-SAW [93], Knightking [143] and
Skywalker [124].
MB 1 MB 2 MB 3 MB 3 S E T
S Sampling T Training E Feature Extraction T-M Model Parallel T-D Data Parallel
SAGA-NN [86], which is inspired by the traditional GAS model in graph processing. From the view of matrix
multiplication, the core of GNN execution can be modeled as SpMM (Sparse Matrix Multiplication). We also
make a comparison in Section 6.2.3. Second, we focus on the update mode, which determines whether the vertex
embeddings and model parameters used for computation are updated in time or with delay. We categorize the
full-graph execution model into synchronous execution models (Section 6.2.4) and asynchronous execution
models (Section 6.2.5). Note that the above two perspectives (i.e., vertex computation and update mode) are
intersected, and a system simultaneously adopts one execution model from each of the two perspectives.
6.2.1 Graph View. From the view of graph processing, we use the most well-known programming model SAGA-
NN [86] for the following discussion. SAGA-NN divides the forward computation of a single GNN layer into four
operators Scatter (SC), ApplyEdge (AE), Gather (GA) and ApplyVertex (AV). SC and GA are two graph operations,
in which vertex features are scattered along the edges and gathered to the target vertex, respectively. AE and
AV may contain neural network (NN) operations, which process directly on the edge features or the aggregated
features of the target vertices, respectively.
According to diferent computation paradigms of graph operators (i.e., SC and GA), we divide computation
graph execution models into one-shot execution and chunk-based execution.
Vertex feature OR Partial aggregation Agg.: Commutative and associative aggregation operator Non-L.: Non-linear transformation
Agg.
Ego-network
&Non-L.
One-shot execution
Chunk 2
ego-network
Chunk 3
Chunk-based execution
call each sub-neighborhood along with the vertex features or embeddings a chunk. The chunks can be processed
sequentially (sequential chunk-based execution) or in parallel (parallel chunk-based execution), which is illustrated
in the lower part of Figure 8.
Under the sequential chunk-based execution, the partial aggregation is conducted sequentially and accumulated
to the inal aggregation result one after another. NeuGraph [86] uses 2D partitioning method to generate several
edge chunks, thus the neighborhood of a vertex is partitioned accordingly into several sub-neighborhoods. It
assigns each worker with the aggregation job of certain vertices, and feeds their edge chunks sequentially to
compute the inal result. SAR [91] uses edge-cut partitioning method to create chunks, retrieves the chunks
of vertex from remote workers sequentially, and computes the partial aggregations at local. The sequential
chunk-based execution efectively addresses the OOM problem, since each worker only needs to handle the
storage and computation of one chunk at a time.
Under the parallel chunk-based execution, the partial aggregations of diferent chunks are computed in parallel.
After all chunks inish partial aggregation, communication is invoked to transfer the results, and the inal
aggregation result is computed at a time. Since the communication volume incurred by transferring the partial
aggregation result is much less than transferring the complete chunk, the network communication overhead
can be reduced. DeepGalois [54] straightforwardly adopts this execution model. DistGNN [89] orchestrates the
parallel chunk-based execution model with the asynchronous execution model, to transfer the partial aggregation
with staleness. FlexGraph [121] further overlaps the communication of remote partial aggregations with the
computation of the local partial aggregation to improve eiciency.
6.2.2 Matrix View. As described in Section 2, the matrix formulation of a GNN model is given by � � =
˜ � −1� � −1 ), which involves SpMM since matrix � is sparse. In order to perform computations for the
� ( ��
GNN model, three matrices (i.e., �, � , and � ) need to be stored either locally or in a distributed manner. For
GNN on large graphs, it is impractical for a single worker (e.g., a GPU processor) to store all three matrices
simultaneously. At least one matrix must be partitioned and distributed across diferent workers. The execution
of distributed SpMM in GNN can be divided into three stages: communication, computation, and reduction.
The computation stage is the core of matrix multiplication, which is performed locally in each worker. Prior
to the computation stage, workers may require speciic blocks of a matrix from other workers, necessitating a
communication stage. Typically, the communication stage is accomplished through a broadcasting mechanism.
After the computation stage, a worker may only possess partial results of the inal matrix block. In such cases, a
reduction stage is necessary to gather the remaining partial results from other workers. Workers retrieve and
sum up these partial results to obtain the inal matrix.
According to the existence of the above three stages, we categorize the distributed SpMM into three execution
models: computation-only, communication-computation, and communication-computation-reduction. The exis-
tence of the communication and reduction stage is determined by two factors: ➀partition strategy: how the
matrix is partitioned and stored; ➁stationary strategy: which matrix is kept as the stationary matrix (i.e., no
communication is incurred to move data in this matrix). The partition strategy determines whether a matrix
is replicated stored on multiple workers, or partitioned into blocks and distributed to diferent workers. The
partition strategy also dictates the matrix partitioning mechanisms, such as 1D, 2D, etc. The stationary strategy
determines the choice of the stationary matrix. During the execution of distributed SpMM, each partitioned
block of the stationary matrix is pinned on the corresponding worker, and no communication is required for
the stationary matrix. We summarize how these factors impact the execution model in Table 2, which will be
introduced in detail in the following.
As discussed in Section 2, the weight parameter matrix � is relatively small in GNN models. As a result, � is
fully replicated across all workers, adhering to the basic data parallelism principle. The subsequent discussion
primarily focuses on matrices �, � , and their product � = �� .
Computation-only Execution Model. In this model, both the communication and reduction stages are
eliminated by adopting speciic partition strategy. One of � or � is fully replicated across each worker, while the
other matrix is partitioned properly among diferent workers to achieve a communication-free paradigm [72]. For
instance, matrix � is replicated across all workers, and matrix � is partitioned into column blocks. Each worker
holds a column block, which contains multiple columns of � . In such case, no communication is required prior to
local computation. After local computation, each worker holds a corresponding column block of the inal matrix
� = �� , thus no reduction is required either. In this case, only the computation stage is performed in the SpMM
operation. However, this execution model lacks strong scalability if both matrices � and � exceed the memory
capacity of an individual worker, since at least one of the matrices needs to be fully replicated across all workers.
Communication-computation Execution Model. For the circumstances that both matrix � and � have to
be partitioned and stored on the workers in a distributed manner, the communication-computation execution
model is introduced. In this execution model, the workers need to share the matrix partitions they hold with each
other, necessitating the communication stage prior to local computation. This communication can be performed
either in a broadcast fashion or a point-to-point (P2P) fashion, as detailed in Section 7. Furthermore, since the
reduction stage is not executed, � -Stationary strategy is adopted in the communication-computation execution
model and no communication is required to obtain the result matrix � .
The third column in Table 2 shows that when � -Stationary strategy is adopted, many partition strategies
(1D, 1.5D and 2D) can be applied in the communication-computation execution model. For the � -Stationary 1D
partitioning, each worker stores a row block of matrices �, � , and � . During the communication stage, each
worker broadcasts its row block of � to all other workers. Subsequently, local computation is performed to
compute the block rows of � . In this paradigm, matrix � is also stationary, making this 1D � -Stationary SpMM
also �-Stationary. It is worth noting that the 1D partitioning can also be performed in a column-wise manner [46].
In such cases, the 1D � -Stationary is also � -Stationary. In short, under 1D partitioning, both �-Stationary
and � -Stationary are equivalent to � -Stationary. Therefore, the 1D partitioning follows a communication-
computation execution model. However, 1D � -Stationary faces scalability challenges as a worker needs to
broadcast the partition to all remote workers, resulting in linear communication costs proportional to the number
of workers [110]. To address this, optimizations such as P2P communication and non-blocking techniques [29, 94]
can be employed to accelerate the communication stage, which will be discussed in Section 7.
For � -Stationary 1.5D partitioning, either matrix � or matrix � is partitioned in a 2D manner, while the other
matrix is partitioned in a 1D manner. To partition a matrix in a 2D manner, each processor holds a row-column
block of the complete matrix, comprising elements in the matrix that satisfy both the assigned column IDs and
row IDs for the processor. Under 1.5D partitioning, although matrix � is set to be stationary, either matrix � or
matrix � needs to be broadcast to all processors, leading to scalability challenges similar to 1D partitioning.
Another approach is to leverage � -Stationary 2D partitioning, where both matrix � and matrix � are partitioned
in a 2D manner. For a row-column block in matrix � , the processor holding this block also holds the corresponding
row-column blocks for matrix � and matrix � . It only needs to receive the blocks with the same row ID of
matrix � and the blocks with the same column ID of matrix � . In this way, the total communication overhead is
reduced.
Communication-computation-reduction Execution Model. As described above, adopting � -Stationary
strategy is the key to eliminating the reduction stage. Therefore, in this model, � -Stationary strategy is not adopted.
In other words, diferent from the communication-computation execution model where � -Stationary strategy is
adopted, other stationary strategies including �-Stationary, � -Stationary, and Non-Stationary strategies are
considered. As discussed previously, under both replicated and 1D partitioning, all stationary strategies are
equivalent to � -Stationary, and no reduction stage is required. For �-Stationary and � -Stationary, under 1.5D
and 2D partitioning, the results obtained after the local computation stage are still partial. Each worker performs a
reduction operation to sum up these remote partial results with its local partial result. The 3D partitioning can be
Iteration � Iteration � + 1
Iteration � Iteration � + 1 Forward
Forward Forward Layer 1 � − 1 Layers Backward Layer 1
Forward Layer 1 Backward Layer 1
� − 1 Layers
GA …… ∇GA ∇AE ∇SC Model
……
Worker 0 � SC AE AV ∇AV Update SC
� GA …… ∇AV ∇GA ∇AE ∇SC
Model …… . .
Worker 0 SC AE AV Update SC
Comm.
Sync.
Comm.
Comm.
Comm.
Comm.
Comm.
Sync.
Sync.
Sync.
� GA …… ∇GA ∇AE ∇SC Model
……
Worker 1 SC AE AV ∇AV SC
GA …… ∇GA ∇AE ∇SC
Model …… . . Update
Worker 1 � SC AE AV ∇AV Update SC
.
Comm.
Sync.
Comm.
Comm.
Comm.
Comm.
Comm.
Sync.
Sync.
Sync.
� GA …… ∇GA ∇AE ∇SC Model
……
∇GA Worker 2 SC AE AV ∇AV SC
Worker 2 � SC AE GA AV …… ∇AV ∇AE ∇SC
Model
SC …… . . Update
Update
counterparts form an execution pipeline for the graph data, following the order of SC-AE-GA-AV-▽AV-▽GA-
▽AE-▽SC. Among these eight stages, two of them involve the communication of states of boundary vertices,
that is, GA and ▽GA. In GA, neighbor vertices features should be aggregated to the target vertices, thus the
features of boundary vertices should be transferred. In ▽GA, the gradients of the boundary vertices should be
sent back to their belonging workers. Therefore, GA and ▽GA are two synchronization points in synchronous
execution model. On these points, the execution low has to be blocked until all communication inishes. Systems
like NeuGraph [86], CAGNET [110], FlexGraph [121], DistGNN [89] apply this execution model. To reduce the
communication cost and improve the training eiciency, several communication protocols are proposed and we
review them in Section 7.1. We point out that these two synchronization points exist irrespective of the view of
execution model. From a matrix view, GA corresponds to the �� SpMM operation, and ▽GA corresponds to the
�� SpMM operation, where � represents the gradient matrix in backward propagation.
6.2.5 Asynchronous execution model. The asynchronous execution model allows the computation starts with
historical states and avoids the expensive synchronization cost. According to the diferent types of states, we
classify the asynchronous execution model into type I asynchronization and type II asynchronization.
Type I asynchronization. As mentioned above, two synchronization points exist in the execution pipeline of
the SAGA-NN model, and similar synchronization points exist in other programming models as well. Removing
such synchronization points of computation graph computation in the execution pipeline introduces type I
asynchronization [89, 94, 117]. In type I asynchronization (Figure 9(b)), workers do not wait for the hidden
embeddings (hidden features after the irst GNN layer) of boundary vertices to arrive. Instead, it uses the historical
hidden embeddings of the boundary vertices from previous epochs which are cached or received earlier to
perform the aggregation of target vertices. Similar historical gradients are used in the ▽GA stage.
Type II asynchronization. During GNN training, another synchronization point occurs when the weight
parameters need to be updated. This is identical to traditional deep learning models, where many eforts have
been made to remove such synchronization point [92, 98]. Gandhi et al. [46] further adopt this idea into GNN
models, which forms type II asynchronization. Under such protocol, mini-batches are executed in parallel, and
form a mini-batch pipeline.
Note that type I and type II asynchronization can be adopted individually or jointly. Detailed communication
protocol for GNN-speciic Type I asynchronous execution model is described in Section 7.2.
providing the opportunity to overlap the communication and computation of diferent chunks. Here we call
the aggregation of a vertex’s partial neighborhood the partial aggregation. SAR [91] aggregates the chunks
in a predeined order. Each worker in SAR irst computes the partial aggregation of the local neighborhood
of the target vertex, then fetches the remote neighborhood in the predeined order, computes the partial ag-
gregation of the remote neighborhood, and accumulates the results at local one-by-one. Consequently, the
aggregation of the complete neighborhood in SAR is divided into diferent stages, and in each stage, a partial
aggregation is performed. The inal aggregation is computed when all partial aggregation completes. SAR applies
the pipeline-based communication protocol to reduce memory consumption since it does not need to store
all the features of the neighborhood and compute the aggregation at once. ParallelGCN adopts a similar idea
without pre-deining the aggregation order. The remote chunks of neighborhood are received in a random order,
and it starts aggregating the received vertices immediately. Furthermore, to reduce the overhead of network
communication caused by transferring of the partial neighborhood, DeepGalois [54] performs a remote partial
aggregation before communication. Therefore, only the partial aggregation results need to be transferred. The
cd-0 communication strategy in DistGNN is similar, which simply performs remote partial aggregation, fetches
it to the local worker, and computes the inal aggregation result. To further improve the training eiciency,
FlexGraph [121] overlaps the computation and communication. Each worker irst issues a request to partially
aggregate the neighborhood at the remote worker and then computes the local partial aggregation while waiting
for the remote partial aggregations to complete the transfer. After receiving the remote partial aggregations, the
worker directly aggregates them with the local partial aggregation. Therefore, the local partial aggregation is
overlapped with the communication to get remote partial aggregation.
7.1.4 Communication via Shared Memory. To train large GNN at scale, it is also feasible to leverage the CPU
memory to retrieve the required information instead of GPU-GPU communication. The complete graph and
feature embeddings are stored in shared memory (i.e., CPU memory) and the device memory of each GPU is
treated as a cache. In ROC [65], the authors assume that all the GNN data is it in the CPU memory and they
repetitively store the whole graph structure and features in the CPU DRAM on each worker. GPU-based workers
retrieve the vertex features and hidden embeddings from the local CPU-shared memory. For larger graphs or
scenarios that all the GNN data cannot it into the CPU DRAM of a single worker, distributed shared memory is
preferred. DistDGL [164] partitions the graph and stores the partitions with a specially designed KVStore. During
the training, if the data is co-located with the worker, DistDGL accesses them via local CPU memory, otherwise,
it incurs RPC requests to retrieve information from other remote CPU memory. In NeuGraph [86], the graph
structure data and hidden embeddings are also stored in shared memory. To support large-scale graphs, the input
graph is partitioned into � × � edge chunks by the 2D partitioning method, and the feature matrix (as well as
hidden embedding matrix) is partitioned into � vertex chunks by the row-wise partition method. As described in
Section 6.2.1, it follows a chunk-based execution model. To compute the partial aggregation of a chunk, each
GPU fetches the corresponding edge chunk as well as the vertex chunk from the CPU memory. To speed up
the vertex chunk transfer among GPUs, a chain-based streaming scheduling method is applied considering the
communication topology to avoid bandwidth contention.
Table 3. Summary of asynchronous communication protocols. The staleness bound � is set by users.
7.2.1 Staleness Models for Asynchronous GNN communication protocol. Generally, bringing in asynchronization
means information with staleness should be used in training. Speciically, in type I asynchronization, the GA or
▽GA stage is performed without completely gathering the latest states of the neighborhood, and the historical
vertex embeddings or gradients are used in aggregation. Diferent staleness models are introduced to maintain the
staleness of historical information, and each of them should ensure that the staleness of aggregated information
is bounded so that the GNN training is converged. In the following, we review three popular staleness models.
Epoch-Fixed Staleness. One straightforward method is to aggregate historical information with ixed
staleness [46, 89, 117]. Let � be the current epoch in training, and �˜ be the epoch of the historical information
used in aggregation. In epoch-ixed staleness model, |�˜ − � | = �� , where �� is a hyper-parameter set by users. In
this way, the staleness is bounded explicitly by �� .
Epoch-Adaptive Staleness. Another variation of the above basic model is to use an epoch-adaptive stale-
ness [18, 108]. During the aggregation of a vertex �, let �˜� � be the epoch of historical information of any vertex
� ∈ � (�) (i.e., any neighbor vertex of �) used in aggregation. In epoch-adaptive staleness model, for all � ∈ � (�),
|�˜� � − � | ⩽ �� holds. This means in diferent epochs, the staleness of historical information for aggregation may
be diferent, and in one epoch, the staleness of aggregated neighbors of a vertex may also be diferent. Generally,
once �˜ − � reaches �� , then the latest embeddings or gradients should be broadcast under decentralized training or
pushed to the historical embedding server under centralized training. If the above condition does not hold when
aggregation is performed, then the aggregation is blocked until the historical information within the staleness
bound (i.e., �� ) can be retrieved.
Variation-Based Staleness. Third, the staleness can also be measured by the variation of embeddings or
gradients. In other words, we only aggregate embeddings or gradients when they are signiicantly changed.
Speciically, let � �(� ) be the embeddings of layer � held by worker � , and �˜ �(� ) be the historical embedding last
shared to other workers. In the variation-based staleness model, � �(� ) − �˜ �(� ) ⩽ �� , where �� is the maximum
diference bound set by users so that the latest embeddings or gradients are broadcast if the historical version
available for other workers are too stale. In this way, the staleness is bounded with the help of �� .
7.2.2 Realization of Asynchronous GNN communication protocols. As described above, historical information can
be used in aggregation during GA or ▽GA stages. Some protocols only take the GA stage into consideration,
supporting asynchronous embedding aggregation. Some protocols are designed to use historical information
during both GA and ▽GA stages, supporting both asynchronous embedding and gradient aggregation.
Asynchronous Embedding Aggregation. Peng et al. [94] take the basic training paradigm of broadcast
in CAGNET into consideration, and design a skip-broadcast mechanism under a 1D partition. This mechanism
auto-checks the staleness of the partitioned vertex features on each compute vertex and skips the broadcast if
it is not too stale. If the broadcast of a partition is skipped, then other workers should use the historical vertex
embeddings cached previously to process the forward computation. Three staleness check algorithms are further
proposed, and each one of them ensures that the staleness of cached vertex embeddings is bounded.
Chai et al. [18] adopt a similar idea while using parameter servers to maintain the historical embeddings of all
workers. The vertex hidden embeddings are pushed to these servers every several epochs, and the workers pull
the historical embeddings to their local cache in the next epoch after the push. The staleness of historical vertex
embeddings is thus bounded by the push-pull period.
The cd-r algorithms introduced by Md et al. [89] overlap the communication of partial aggregation results
from each worker with the forward computation in GNN. Speciically, the partial aggregation in iteration � is
transmitted asynchronously to the target vertex, and the inal aggregation is performed in iteration � + � . Under
this algorithm, the staleness is bounded to � .
Asynchronous Embedding and Gradients Aggregation. As we introduced in Section 6.2.4, GA and ▽GA
are two synchronization points in the GNN execution model. While the above protocols only use historical vertex
embeddings during stage GA, Wan et al. [117] also take the synchronization point ▽GA into consideration, and
use stale vertex gradients in the back propagation. During the training, both embeddings and gradients from
the neighborhood are asynchronously sent in a point-to-point fashion in the last epoch and received by the
target vertex in the current epoch. Therefore, the communication is overlapped in both forward and backward
computation and the training pipeline with a ixed-epoch gap is thus constructed, in which workers are only
allowed to use features or gradients exactly one epoch ago, and the staleness is bounded. Thorpe et al. [108]
design a iner and more lexible pipeline. Similarly to the method proposed by Wan et al. [117], both stale vertex
embeddings and vertex gradients are used in the pipeline. Moreover, it also removes the type II synchronization,
and thus a trainer may use stale weight parameters and starts the next epoch immediately. The staleness in the
pipeline is explicitly bounded with � set by users. The bounded staleness � ensures that the trainer which moves
the fastest in the pipeline is allowed to use historical embeddings or gradients at most � epochs earlier. With this
staleness bound, the staleness of weight parameters is also bounded accordingly.
Vertex-level Asynchronization. Without using stale embeddings or gradients, asynchronization can also
be designed in vertex-level [158], where each vertex starts the computation of the next layer as soon as all its
neighbors’ embeddings are received. Diferent from the traditional synchronous method in which all vertices
on a worker start the computation of a layer together, during vertex-level asynchronous processing, diferent
vertices in one worker may be computing diferent layers at the same time. Note that this asynchronization does
not make any diference in the aggregation result, since all the information required should be the latest, and no
embeddings or gradients from previous epochs are used.
by most systems. Therefore, we only note the execution models of systems with special designs. For detailed
introduction of each system and their relations, please refer to our online supplemental materials.
9 FUTURE DIRECTION
As a general solution to train on large-scale graphs, distributed GNN training gain widespread attention in
recent years. In addition to the techniques and systems discussed above, there are other interesting but emerging
research topics in distributed GNN training. We discuss some interesting directions in the following.
Benchmark for distributed GNN training. Many eforts are made in benchmarking traditional deep
learning models. For instance, DAWNBench [26] provides a standard evaluation criterion to compare diferent
training algorithms. It focuses on the end-to-end training time consumed to converge to a state-of-the-art
accuracy, as well as the inference time. Both single machine and distributed computing scenarios are considered.
Furthermore, many benchmark suites are developed for traditional DNN training to proile and characterize the
computation [33, 88, 172]. As for GNNs, Dwivedi [35] attempts to benchmark the GNN pipeline and compare
the performance of diferent GNN models on medium-size datasets. Meanwhile, larger datasets [34, 42, 56] for
graph machine learning tasks have been published. However, to our knowledge, few eforts have been made
to compare the eiciency of diferent GNN training algorithms, especially in distributed computing scenarios.
GNNMark [8] is the irst benchmark suite developed speciically for GNN training. It leverages the NVIDIA nvprof
proiler [13] to characterize kernel-level computations, NVBit framework [114] proiles memory divergence and
further modiies PyTorch source code to collect data sparsity. However, it lacks lexibility for quick proiling of
diferent GNN models and does not pay much attention to proiling the distributed and multi-machine training
scenarios. Therefore, it would be practical for a new benchmark to be designed for large-scale distributed GNN
training.
Large-scale dynamic graph neural networks. In many applications, graphs are not static. The vertex
attributes or graph structures often evolve with changes, which requires the representations to be updated in time.
Li and Chen [74] proposed a general cache-based GNN system to accelerate the representation updating. It sets a
cache for hidden representations and selects valuable representations to save time for updating. DynaGraph [51]
eiciently trains GNN via cached message passing and timestep fusion. Furthermore, it optimizes the graph
partitioning in order to train GNN in a data-parallel method. Although dynamic GNN has long been an interesting
area of research, as far as we know, there are no more works that speciically focus on distributed dynamic GNN
training. The dynamism, including features and structure, poses new challenges to the ordinary solution in a
distributed environment. Graph partitioning has to quickly adjust to the change of vertices and edges while
meeting the requirements of load balance and communication reduction. The update of the graph structure drops
the cache hit ratio, which signiicantly inluences the end-to-end performance of GNN training.
Large-scale heterogeneous graph neural networks. Many heterogeneous GNN structures are proposed
in recent years [19, 44, 77, 125, 148, 159]. However, few distributed systems take the unique characteristic of
heterogeneous graphs into consideration to support heterogeneous GNNs. Since the size of the feature and the
number of neighbors may vary greatly for vertices with diferent types, processing heterogeneous graphs in a
distributed manner may cause severe problems such as load imbalance and high network overhead. Paddle Graph
Learning 1 framework provides easy and fast programming of the message passing-based graph learning on
heterogeneous graphs with distributed computing. DistDGLv2 [165] takes the imbalanced workload partition into
consideration and leverages the multi-constraint technique in METIS to mitigate this problem. More attention
should be paid to this research topic to address the above problems.
GNN model compression technique. Although model compression, including pruning, quantization, weight
sharing, etc., is widely used in deep learning, it has not been extensively applied in distributed GNN training. The
compression on network structures like pruning [169] can be contacted with a graph sampling strategy to solve the
out-of-memory problem. Model quantization is another promising approach to improving the scalability of GNN
models. SGQuant [38] is a GNN-tailored quantization algorithm and develops multi-granularity quantization and
auto-bit selection. Degree-Quant [106] stochastically protects (high-degree) vertices from quantization to improve
weight update accuracy. BinaryGNN [5] applies a binarization strategy inspired by the latest developments in
binary neural networks for images and knowledge distillation for graph networks. For the distributed settings,
recently, Song et al.[103] proposed EC-Graph for distributed GNN training with CPU clusters which aims to
reduce the communication costs by message compression. It adopts lossy compression and designs compensation
methods to mitigate the induced errors. Meanwhile, A Bit-Tuner is used to keep the balance between model
accuracy and message size. GNN model compression is orthogonal to the aforementioned distributed GNN
optimization techniques, and it deserves more attention to help improve the eiciency of distributed GNN
training.
10 CONCLUSION
Distributed GNN training is one of the successful approaches of scaling GNN models to large graphs. In this
survey, we systematically reviewed the existing distributed GNN training techniques from graph data processing
to distributed model execution and covered the life-cycle of end-to-end distributed GNN training. We divide the
distributed GNN training pipeline into three stages, data partition, batch generation, and GNN model training,
which heavily inluence the GNN training eiciency. To clearly organize the new technical contributions which
optimize these stages, we proposed a new taxonomy that consists of four orthogonal categories ś GNN data
partition, GNN batch generation, GNN execution model and GNN communication protocol. In the GNN data
partition category, we described the data partition techniques for distributed GNN training; in the GNN batch
generation category, we presented the techniques of fast GNN batch generation for mini-batch distributed
GNN training; in the GNN execution model category, we discussed the execution model used in mini-batch and
1 https://github.com/PaddlePaddle/PGL
full-graph training respectively; in the GNN communication protocol category, we discussed both synchronous
and asynchronous protocols for distributed GNN training. After carefully reviewing the techniques in the four
categories, we summarized existing representative distributed GNN systems for multi-GPUs, GPU-clusters and
CPU-clusters, respectively, and gave a discussion about the future direction in optimizing distributed GNN
training.
ACKNOWLEDGEMENTS
This work is supported by the National Key R&D Program of China (2022ZD0116315), National Natural Science
Foundation of China (Nos. 62272054, 62192784, U23B2048, U22B2037), Beijing Nova Program (No. 20230484319),
and Xiaomi Young Talents Program. Lei Chen’s work is partially supported by National Science Foundation of
China (NSFC) under Grant No. U22B2060, the Hong Kong RGC GRF Project 16209519, CRF Project C6030-18G,
C2004-21GF, AOE Project AoE/E-603/18, RIF Project R6020-19, Theme-based project TRS T41-603/20R, Guangdong
Basic and Applied Basic Research Foundation 2019B151530001, Hong Kong ITC ITF grants MHX/078/21 and
PRP/004/22FX, Microsoft Research Asia Collaborative Research Grant, HKUST-Webank joint research lab grant
and HKUST Global Strategic Partnership Fund (2021 SJTU-HKUST).
REFERENCES
[1] Sergi Abadal, Akshay Jain, Robert Guirado, Jorge López-Alonso, and Eduard Alarcón. 2021. Computing Graph Neural Networks: A
Survey from Algorithms to Accelerators. ACM Comput. Surv. 54, 9 (2021), 1ś38.
[2] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jefrey Dean, Matthieu Devin, Sanjay Ghemawat, Geofrey
Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine
learning. In 12th USENIX Symposium on Operating Systems Design and Implementation. 265ś283.
[3] David Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman, Clinton Fookes, and Lars Petersson. 2021. Graph-Based Deep
Learning for Medical Diagnosis and Analysis: Past, Present and Future. Sensors 21, 14 (2021), 4758.
[4] Alexandra Angerd, Keshav Balasubramanian, and Murali Annavaram. 2020. Distributed training of graph convolutional networks
using subgraph approximation. arXiv preprint arXiv:2012.04930 (2020), 1ś14.
[5] Mehdi Bahri, Gaetan Bahl, and Stefanos Zafeiriou. 2021. Binary Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 9492ś9501.
[6] Youhui Bai, Cheng Li, Zhiqi Lin, Yufei Wu, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2021. Eicient Data Loader for Fast Sampling-
Based GNN Training on Large Graphs. IEEE Transactions on Parallel and Distributed Systems 32, 10 (2021), 2541ś2556.
[7] Ziv Bar-Yossef and Li-Tal Mashiach. 2008. Local approximation of pagerank and reverse pagerank. In Proceedings of the 17th ACM
conference on Information and knowledge management. 279ś288.
[8] Trinayan Baruah, Kaustubh Shivdikar, Shi Dong, Yifan Sun, Saiful A Mojumder, Kihoon Jung, José L Abellán, Yash Ukidave, Ajay Joshi,
John Kim, et al. 2021. GNNMark: A benchmark suite to characterize graph neural network training on GPUs. In 2021 IEEE International
Symposium on Performance Analysis of Systems and Software. 13ś23.
[9] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea
Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261 (2018), 1ś40.
[10] Maciej Besta and Torsten Hoeler. 2022. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. arXiv
preprint arXiv:2205.09702 (2022), 1ś27.
[11] Erik G Boman, Karen D Devine, and Sivasankaran Rajamanickam. 2013. Scalable matrix computations on large scale-free graphs using
2D graph partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.
1ś12.
[12] Pietro Bongini, Monica Bianchini, and Franco Scarselli. 2021. Molecular generative graph neural networks for drug discovery.
Neurocomputing 450 (2021), 242ś252.
[13] Thomas Bradley. 2012. GPU performance analysis and optimisation. NVIDIA Corporation (2012), 1ś117.
[14] Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. 2021. DGCL: An eicient communication library for distributed
GNN training. In Proceedings of the European Conference on Computer Systems. 130ś144.
[15] Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, and George Karypis. 2023. DSP:
Eicient GNN training with multiple GPUs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel
Programming. 392ś404.
[16] Umit V Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector
multiplication. IEEE Transactions on parallel and distributed systems 10, 7 (1999), 673ś693.
[17] Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Peng Zhang, Guohao Dai,
et al. 2021. Cogdl: An extensive toolkit for deep learning on graphs. arXiv preprint arXiv:2103.00959 (2021), 1ś11.
[18] Zheng Chai, Guangji Bai, Liang Zhao, and Yue Cheng. 2022. Distributed Graph Neural Network Training with Periodic Historical
Embedding Synchronization. arXiv preprint arXiv:2206.00057 (2022), 1ś20.
[19] Yaomin Chang, Chuan Chen, Weibo Hu, Zibin Zheng, Xiaocong Zhou, and Shouzhi Chen. 2022. Megnn: Meta-path extracted graph
neural network for heterogeneous graph representation learning. Knowledge-Based Systems 235 (2022), 107611.
[20] Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In
International Conference on Learning Representations. 1ś15.
[21] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv
preprint arXiv:1604.00981 (2016), 1ś10.
[22] Jianfei Chen, Jun Zhu, and Le Song. 2018. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In Proceedings
of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). 942ś950.
[23] Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang, Mingyu Yan, Lei Deng, Yufei Ding, Zidong Du, et al.
2022. Rubik: A Hierarchical Architecture for Eicient Graph Neural Network Training. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 41, 4 (2022), 936ś949.
[24] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-GCN: An Eicient Algorithm for Training
Deep and Large Graph Convolutional Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. 257ś266.
[25] Edward Choi, Zhen Xu, Yujia Li, Michael Dusenberry, Gerardo Flores, Emily Xue, and Andrew Dai. 2020. Learning the graphical
structure of electronic health records with graph convolutional transformer. In Proceedings of the AAAI conference on artiicial intelligence.
606ś613.
[26] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and
Matei Zaharia. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training 100, 101 (2017), 102.
[27] Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. 2020. Minimal variance sampling with provable guarantees for
fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. 1393ś1403.
[28] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Velickovic. 2020. Principal Neighbourhood Aggregation for
Graph Nets. In Advances in Neural Information Processing Systems. 13260ś13271.
[29] Gunduz Vehbi Demirci, Aparajita Haldar, and Hakan Ferhatosmanoglu. 2023. Scalable Graph Convolutional Network Training on
Distributed-Memory Systems. In Proceedings of the VLDB Endowment, Vol. 16. 711ś724.
[30] Xiang Deng and Zhongfei Zhang. 2021. Graph-Free Knowledge Distillation for Graph Neural Networks. In Proceedings of the Thirtieth
International Joint Conference on Artiicial Intelligence. 2321ś2327.
[31] Kien Do, Truyen Tran, and Svetha Venkatesh. 2019. Graph Transformation Policy Network for Chemical Reaction Prediction. In
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 750ś760.
[32] Jialin Dong, Da Zheng, Lin F Yang, and George Karypis. 2021. Global Neighbor Sampling for Mixed CPU-GPU Training on Giant
Graphs. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 289ś299.
[33] Shi Dong and David Kaeli. 2017. DNNMark: A Deep Neural Network Benchmark Suite for GPUs. In Proceedings of the General Purpose
GPUs. 63ś72.
[34] Yuanqi Du, Shiyu Wang, Xiaojie Guo, Hengning Cao, Shujie Hu, Junji Jiang, Aishwarya Varala, Abhinav Angirekula, and Liang Zhao.
2021. Graphgt: Machine learning datasets for graph generation and transformation. In Thirty-ifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track (Round 2). 1ś17.
[35] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2020. Benchmarking graph neural
networks. arXiv preprint arXiv:2003.00982 (2020), 1ś47.
[36] Wenfei Fan, Ruochun Jin, Muyang Liu, Ping Lu, Xiaojian Luo, Ruiqi Xu, Qiang Yin, Wenyuan Yu, and Jingren Zhou. 2020. Application
Driven Graph Partitioning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1765ś1779.
[37] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph Neural Networks for Social Recommendation.
In The World Wide Web Conference. 417ś426.
[38] Boyuan Feng, Yuke Wang, Xu Li, Shu Yang, Xueqiao Peng, and Yufei Ding. 2020. SGQuant: Squeezing the Last Bit on Graph Neural
Networks with Specialized Quantization. In 2020 IEEE 32nd International Conference on Tools with Artiicial Intelligence (ICTAI).
1044ś1052.
[39] Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428
(2019), 1ś9.
[40] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. 2017. Protein Interface Prediction Using Graph Convolutional Networks. In
Proceedings of the 31st International Conference on Neural Information Processing Systems. 6533ś6542.
[41] Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti. 2020. Sign: Scalable
inception graph neural networks. arXiv preprint arXiv:2004.11198 (2020), 1ś17.
[42] Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. 2021. A Large-Scale Database for Graph Representation Learning. In
Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 1ś13.
[43] Qiang Fu, Yuede Ji, and H Howie Huang. 2022. TLPGNN: A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network
Computation on GPU. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing.
122ś134.
[44] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. Magnn: Metapath aggregated graph neural network for heterogeneous
graph embedding. In Proceedings of The Web Conference 2020. 2331ś2341.
[45] Yasuhiro Fujiwara, Yasutoshi Ida, Atsutoshi Kumagai, Masahiro Nakano, Akisato Kimura, and Naonori Ueda. 2023. Eicient Network
Representation Learning via Cluster Similarity. Data Sci. Eng. 8, 3 (2023), 279ś291.
[46] Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on
Operating Systems Design and Implementation. 551ś568.
[47] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum
Chemistry. In Proceedings of the 34th International Conference on Machine Learning. 1263ś1272.
[48] Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W Fletcher, Christopher J Hughes, and Josep Torrellas. 2022. Graphite:
optimizing graph neural networks on CPUs through cooperative software-hardware techniques. In Proceedings of the 49th Annual
International Symposium on Computer Architecture. 916ś931.
[49] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel
Computation on Natural Graphs. In 10th USENIX symposium on operating systems design and implementation (OSDI 12). 17ś30.
[50] Daniele Grattarola and Cesare Alippi. 2021. Graph neural networks in TensorFlow and keras with spektral [application notes]. IEEE
Computational Intelligence Magazine 16, 1 (2021), 99ś106.
[51] Mingyu Guan, Anand Padmanabha Iyer, and Taesoo Kim. 2022. DynaGraph: dynamic graph neural networks at scale. In Proceedings of
the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data
Analytics (NDA). 1ś10.
[52] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st
International Conference on Neural Information Processing Systems. 1025ś1035.
[53] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph
Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development
in Information Retrieval. 639ś648.
[54] Loc Hoang, Xuhao Chen, Hochan Lee, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2021. Eicient Distribution for Deep
Learning on Large Graphs. In Proceedings of the First MLSys Workshop on Graph Neural Networks and Systems. 1ś9.
[55] Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, Chuan Shi, Nan Duan, Xing Xie, and Ming Zhou. 2020. Graph neural news recommendation
with unsupervised preference disentanglement. In Proceedings of the 58th annual meeting of the association for computational linguistics.
4255ś4264.
[56] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open
graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118ś22133.
[57] Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020. Featgraph: A lexible
and eicient backend for graph neural network systems. In SC20: International Conference for High Performance Computing, Networking,
Storage and Analysis. 1ś13.
[58] Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. Ge-spmm: General-purpose sparse matrix-matrix multiplication on
gpus for graph neural networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
1ś12.
[59] Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, and Xipeng Shen. 2021. Understanding and Bridging the Gaps in Current GNN
Performance Optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
119ś132.
[60] Linyong Huang, Zhe Zhang, Zhaoyang Du, Shuangchen Li, Hongzhong Zheng, Yuan Xie, and Nianxiong Tan. 2022. EPQuant: A Graph
Neural Network compression approach based on product quantization. Neurocomputing 503 (2022), 49ś61.
[61] Tinglin Huang, Yuxiao Dong, Ming Ding, Zhen Yang, Wenzheng Feng, Xinyu Wang, and Jie Tang. 2021. MixGCF: An Improved
Training Method for Graph Neural Network-Based Recommender Systems. In Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining. 665ś674.
[62] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive Sampling towards Fast Graph Representation Learning.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4563ś4572.
[63] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hyoukjoong Lee, Jiquan Ngiam, Quoc V Le,
Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Eicient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in
Neural Information Processing Systems, Vol. 32. 1ś10.
[64] Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Seraini. 2021. Accelerating graph sampling for graph machine learning
using GPUs. In Proceedings of the Sixteenth European Conference on Computer Systems. 311ś326.
[65] Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph
Neural Networks with Roc. In Proceedings of Machine Learning and Systems. 187ś198.
[66] Peng Jiang and Masuma Akter Rumi. 2021. Communication-eicient sampling for distributed training of graph convolutional networks.
arXiv preprint arXiv:2101.07706 (2021), 1ś11.
[67] Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traic forecasting: A survey. Expert Systems with Applications 207 (2022),
117921.
[68] Chaitanya K. Joshi. 2022. Recent Advances in Eicient and Scalable Graph Neural Networks. https://www.chaitjo.com/post/eicient-
gnns/. (2022).
[69] Tim Kaler, Alexandros Iliopoulos, Philip Murzynowski, Tao Schardl, Charles E Leiserson, and Jie Chen. 2023. Communication-Eicient
Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching. In Proceedings of Machine Learning and
Systems. 1ś14.
[70] George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on
scientiic Computing 20, 1 (1998), 359ś392.
[71] Yunyong Ko, Kibong Choi, Jiwon Seo, and Sang Wook Kim. 2021. An in-depth analysis of distributed training of deep neural networks.
In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 994ś1003.
[72] Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, and P Sadayappan. 2023. Communication Optimization
for Distributed Execution of Graph Neural Networks. In IEEE International Parallel and Distributed Processing Symposium. 512ś523.
[73] Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. 2020. Distributed Training of Deep Learning Models: A Taxonomic
Perspective. IEEE Transactions on Parallel and Distributed Systems 31, 12 (2020), 2802ś2818.
[74] Haoyang Li and Lei Chen. 2021. Cache-Based GNN System for Dynamic Graphs. In Proceedings of the 30th ACM International Conference
on Information & Knowledge Management. 937ś946.
[75] Houyi Li, Yongchao Liu, Yongyong Li, Bin Huang, Peng Zhang, Guowei Zhang, Xintan Zeng, Kefeng Deng, Wenguang Chen, and
Changhua He. 2021. GraphTheta: A distributed graph neural network learning system with lexible training strategy. arXiv preprint
arXiv:2104.10569 (2021), 1ś18.
[76] Hongzheng Li, Yingxia Shao, Junping Du, Bin Cui, and Lei Chen. 2022. An I/O-eicient disk-based graph system for scalable second-order
random walk of large graphs. Proceedings of the VLDB Endowment 15, 8 (2022), 1619ś1631.
[77] Longhai Li, Lei Duan, Junchen Wang, Chengxin He, Zihao Chen, Guicai Xie, Song Deng, and Zhaohang Luo. 2023. Memory-Enhanced
Transformer for Representation Learning on Temporal Heterogeneous Graphs. Data Sci. Eng. 8, 2 (2023), 98ś111.
[78] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jef Smith, Brian Vaughan, Pritam
Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the
VLDB Endowment 13, 12 (2020), 3005ś3018.
[79] Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, and Hui Xiong. 2021. Structure-
Aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Ainity. In Proceedings of the 27th ACM SIGKDD
Conference on Knowledge Discovery & Data Mining. 975ś985.
[80] Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2020. PaGraph: Scaling GNN Training on Large Graphs via
Computation-Aware Caching. In Proceedings of the 11th ACM Symposium on Cloud Computing. 401ś415.
[81] Meng Liu, Youzhi Luo, Limei Wang, Yaochen Xie, Hao Yuan, Shurui Gui, Haiyang Yu, Zhao Xu, Jingtun Zhang, Yi Liu, et al. 2021. DIG:
A Turnkey Library for Diving into Graph Deep Learning Research. Journal of Machine Learning Research 22 (2021), 1ś9.
[82] Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, and Chuanxiong
Guo. 2023. BGL: GPU-Eicient GNN Training by Optimizing Graph Data I/O and Preprocessing. In Proceedings of the 20th USENIX
Symposium on Networked Systems Design and Implementation. 103ś118.
[83] Xin Liu, Mingyu Yan, Lei Deng, Guoqi Li, Xiaochun Ye, Dongrui Fan, Shirui Pan, and Yuan Xie. 2022. Survey on Graph Neural Network
Acceleration: An Algorithmic Perspective. In Proceedings of the Thirty-First International Joint Conference on Artiicial Intelligence.
5521ś5529.
[84] Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. 2021. EXACT: Scalable graph neural networks training via extreme
activation compression. In International Conference on Learning Representations. 1ś32.
[85] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab:
A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (2012), 716ś727.
[86] Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. NeuGraph: Parallel Deep Neural
Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference. 443ś458.
[87] Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010.
Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
135ś146.
[88] Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon
Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang,
Daniel Kang, David Kanter, Naveen Kumar, Jefery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost,
Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Clif Young, and Matei Zaharia. 2020. MLPerf Training
Benchmark. In Proceedings of Machine Learning and Systems. 336ś349.
[89] Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar,
Nesreen K Ahmed, and Sasikanth Avancha. 2021. Distgnn: Scalable distributed training for large-scale graph neural networks. In
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1ś14.
[90] Seung Won Min, Kun Wu, Mert Hidayetoglu, Jinjun Xiong, Xiang Song, and Wen-mei Hwu. 2022. Graph Neural Network Training and
Data Tiering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3555ś3565.
[91] Hesham Mostafa. 2022. Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on
Large Graphs. In Proceedings of Machine Learning and Systems. 265ś275.
[92] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and
Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on
Operating Systems Principles. 1ś15.
[93] Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S. Li, and Hang Liu. 2020. C-SAW: A Framework for Graph Sampling and Random
Walk on GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1ś15.
[94] Jingshu Peng, Zhao Chen, Yingxia Shao, Yanyan Shen, Lei Chen, and Jiannong Cao. 2022. Sancus: staleness-aware communication-
avoiding full-graph decentralized training in large-scale graph neural networks. Proceedings of the VLDB Endowment 15, 9 (2022),
1937ś1950.
[95] Md Khaledur Rahman, Majedul Haque Sujon, and Ariful Azad. 2021. Fusedmm: A uniied sddmm-spmm kernel for graph embedding
and graph neural networks. In 2021 IEEE International Parallel and Distributed Processing Symposium. 256ś266.
[96] Morteza Ramezani, Weilin Cong, Mehrdad Mahdavi, Mahmut T Kandemir, and Anand Sivasubramaniam. 2021. Learn locally, correct
globally: A distributed algorithm for training graph neural networks. arXiv preprint arXiv:2111.08202 (2021), 1ś32.
[97] Jiahua Rao, Xiang Zhou, Yutong Lu, Huiying Zhao, and Yuedong Yang. 2021. Imputing single-cell RNA-seq data by combining graph
convolution and autoencoder neural networks. Iscience 24, 5 (2021), 102393.
[98] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic
Gradient Descent. In Advances in Neural Information Processing Systems. 1ś9.
[99] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia.
2018. Graph Networks as Learnable Physics Engines for Inference and Control. In Proceedings of the 35th International Conference on
Machine Learning. 4470ś4479.
[100] Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2021. Distributed-memory parallel algo-
rithms for sparse times tall-skinny-dense matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing.
431ś442.
[101] Shihui Song and Peng Jiang. 2022. Rethinking Graph Data Placement for Graph Neural Network Training on Multiple GPUs. In
Proceedings of the 36th ACM International Conference on Supercomputing. 1ś10.
[102] Zheng Song, Fengshan Bai, Jianfeng Zhao, and Jie Zhang. 2021. Spammer Detection Using Graph-level Classiication Model of Graph
Neural Network. In 2021 IEEE 2nd International Conference on Big Data, Artiicial Intelligence and Internet of Things Engineering. 531ś538.
[103] Zhen Song, Yu Gu, Jianzhong Qi, Zhigang Wang, and Ge Yu. 2022. EC-Graph: A Distributed Graph Neural Network System with
Error-Compensated Compression. In 2022 IEEE 38th International Conference on Data Engineering. 648ś660.
[104] Isabelle Stanton and Gabriel Kliot. 2012. Streaming Graph Partitioning for Large Distributed Graphs. In Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. 1222ś1230.
[105] Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie Zhang, Yong Li, Wenyuan Yu, Jingren Zhou, and Fei Wu. 2023.
Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training. In Proceedings of the USENIX Annual
Technical Conference. 165Ð-179.
[106] Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. 2020. Degree-quant: Quantization-aware training for graph neural
networks. arXiv preprint arXiv:2008.05000 (2020), 1ś22.
[107] Qiaoyu Tan, Ninghao Liu, and Xia Hu. 2019. Deep Representation Learning for Social Network Analysis. Frontiers in Big Data 2 (2019),
2.
[108] John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung
Kim, et al. 2021. Dorylus: Afordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In
USENIX Symposium on Operating Systems Design and Implementation. 495ś514.
[109] Chao Tian, Lingxiao Ma, Zhi Yang, and Yafei Dai. 2020. PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional
Network. In 2020 IEEE International Parallel and Distributed Processing Symposium. 936ś945.
[110] Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing communication in graph neural network training. In SC20:
International Conference for High Performance Computing, Networking, Storage and Analysis. 1ś14.
[111] Jana Vatter, Ruben Mayer, and Hans-Arno Jacobsen. 2023. The Evolution of Distributed Systems for Graph Neural Networks and their
Origin in Graph Processing and Deep Learning: A Survey. Comput. Surveys (2023), 1ś35.
[112] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention
Networks. In International Conference on Learning Representations. 1ś12.
[113] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A Survey on
Distributed Machine Learning. Acm computing surveys 53, 2 (2020), 1ś33.
[114] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. Nvbit: A dynamic binary instrumentation framework for
nvidia gpus. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. 372ś383.
[115] Borui Wan, Juntao Zhao, and Chuan Wu. 2023. Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN
Training. In Proceedings of Machine Learning and Systems. 1ś15.
[116] Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, and Yingyan Lin. 2022. BNS-GCN: Eicient full-graph training of graph convolutional
networks with partition-parallelism and random boundary node sampling. In Proceedings of Machine Learning and Systems. 673ś693.
[117] Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, and Yingyan Lin. 2022. PipeGCN: Eicient Full-
Graph Training of Graph Convolutional Networks with Pipelined Feature Communication. In International Conference on Learning
Representations. 1ś24.
[118] Xinchen Wan, Kaiqiang Xu, Xudong Liao, Yilun Jin, Kai Chen, and Xin Jin. 2023. Scalable and Eicient Full-Graph GNN Training for
Large Graphs. In Proceedings of the 2023 ACM SIGMOD International Conference on Management of Data. 1ś23.
[119] Hanchen Wang, Defu Lian, Ying Zhang, Lu Qin, Xiangjian He, Yiguang Lin, and Xuemin Lin. 2021. Binarized graph neural network. In
World Wide Web. 825ś848.
[120] Junfu Wang, Yunhong Wang, Zhen Yang, Liang Yang, and Yuanfang Guo. 2021. Bi-GCN: Binary Graph Convolutional Network. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1561ś1570.
[121] Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and Jingren Zhou. 2021. FlexGraph: a lexible
and eicient distributed framework for GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 67ś82.
[122] M. Wang, W. Fu, X. He, S. Hao, and X. Wu. 2022. A Survey on Large-Scale Machine Learning. IEEE Transactions on Knowledge & Data
Engineering 34, 06 (2022), 2574ś2594.
[123] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng
Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. Deep Graph Library: Towards
Eicient and Scalable Deep Learning on Graphs. In ICLR workshop on representation learning on graphs and manifolds. 1ś7.
[124] Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021. Skywalker: Eicient
alias-method-based graph sampling and random walk on gpus. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques. 304ś317.
[125] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. In
The world wide web conference. 2022ś2032.
[126] Yuke Wang, Boyuan Feng, and Yufei Ding. 2022. QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core. In
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 107ś119.
[127] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. 2021. GNNAdvisor: An Adaptive and Eicient
Runtime System for GNN Acceleration on GPUs. In 15th USENIX Symposium on Operating Systems Design and Implementation. 515ś531.
[128] Max Welling and Thomas N Kipf. 2017. Semi-supervised classiication with graph convolutional networks. In J. International Conference
on Learning Representations. 1ś14.
[129] Cameron R Wolfe, Jingkang Yang, Arindam Chowdhury, Chen Dun, Artun Bayer, Santiago Segarra, and Anastasios Kyrillidis. 2021.
GIST: Distributed training for large-scale graph convolutional networks. arXiv preprint arXiv:2102.10424 (2021), 1ś28.
[130] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying Graph Convolutional
Networks. In Proceedings of the 36th International Conference on Machine Learning. 6861ś6871.
[131] Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph Neural Networks in Recommender Systems: A Survey. Comput.
Surveys (2022), 1ś37.
[132] Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. 2020. Graph Convolutional Networks with Markov Random Field
Reasoning for Social Spammer Detection. In Proceedings of the AAAI Conference on Artiicial Intelligence. 1054ś1061.
[133] Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, and Fan Yu. 2021. Seastar: vertex-centric
programming for graph neural networks. In Proceedings of the Sixteenth European Conference on Computer Systems. 359ś375.
[134] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2021. A Comprehensive Survey on Graph
Neural Networks. IEEE transactions on neural networks and learning systems 32, 1 (2021), 4Ð24.
[135] Shuo Xiao, Dongqing Zhu, Chaogang Tang, and Zhenzhen Huang. 2023. Combining Graph Contrastive Embedding and Multi-head
Cross-Attention Transfer for Cross-Domain Recommendation. Data Sci. Eng. 8, 3 (2023), 247ś262.
[136] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2022. Self-Supervised Learning of Graph Neural Networks:
A Uniied Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2022), 1ś1.
[137] Zhiqiang Xie, Minjie Wang, Zihao Ye, Zheng Zhang, and Rui Fan. 2022. Graphiler: Optimizing Graph Neural Networks with Message
Passing Data Flow Graph. In Proceedings of Machine Learning and Systems. 515ś528.
[138] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International
Conference on Learning Representations. 1ś17.
[139] Ning Xu, Bin Cui, Lei Chen, Zi Huang, and Yingxia Shao. 2015. Heterogeneous Environment Aware Streaming Graph Partitioning. In
IEEE Transactions on Knowledge and Data Engineering. 1560ś1572.
[140] Zihui Xue, Yuedong Yang, Mengtian Yang, and Radu Marculescu. 2022. SUGAR: Eicient Subgraph-level Training via Resource-aware
Graph Partitioning. arXiv preprint arXiv:2202.00075 (2022), 1ś16.
[141] Bencheng Yan, Chaokun Wang, Gaoyang Guo, and Yunkai Lou. 2020. TinyGNN: Learning Eicient Graph Neural Networks. In
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1848ś1856.
[142] Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou. 2022. GNNLab: A
Factored System for Sample-Based GNN Training over GPUs. In Proceedings of the Seventeenth European Conference on Computer
Systems. 417ś434.
[143] Ke Yang, MingXing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. Knightking: a fast distributed graph random
walk engine. In Proceedings of the ACM symposium on operating systems principles. 524ś537.
[144] Hongbo Yin, Yingxia Shao, Xupeng Miao, Yawen Li, and Bin Cui. 2022. Scalable Graph Sampling on GPUs with Compressed Graph. In
Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2383ś2392.
[145] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. L2-GCN: Layer-Wise and Learned Eicient Training of Graph
Convolutional Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1ś9.
[146] Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In 2020
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 255ś265.
[147] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph Sampling
Based Inductive Learning Method. In International Conference on Learning Representations. 1ś19.
[148] Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph neural network. In
Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 793ś803.
[149] Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi.
2020. AGL: A Scalable System for Industrial-Purpose Graph Machine Learning. Proceedings of the VLDB Endowment 13, 12 (2020),
3125ś3137.
[150] Guo Zhang, Hao He, and Dina Katabi. 2019. Circuit-GNN: Graph Neural Networks for Distributed Circuit Design. In Proceedings of the
36th International Conference on Machine Learning. 7364ś7373.
[151] Hengrui Zhang, Zhongming Yu, Guohao Dai, Guyue Huang, Yufei Ding, Yuan Xie, and Yu Wang. 2022. Understanding GNN
Computational Graph: A Coordinated Computation, IO, and Memory Perspective. In Proceedings of Machine Learning and Systems.
467ś484.
[152] Weijia Zhang, Hao Liu, Yanchi Liu, Jingbo Zhou, and Hui Xiong. 2020. Semi-Supervised Hierarchical Recurrent Graph Neural Network
for City-Wide Parking Availability Prediction. In The Thirty-Fourth AAAI Conference on Artiicial Intelligence. 1186ś1193.
[153] Wentao Zhang, Xupeng Miao, Yingxia Shao, Jiawei Jiang, Lei Chen, Olivier Ruas, and Bin Cui. 2020. Reliable Data Distillation on
Graph Convolutional Network. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1399ś1414.
[154] Xin Zhang, Yanyan Shen, Yingxia Shao, and Lei Chen. 2023. DUCATI: A Dual-Cache Training System for Graph Neural Networks on
Giant Graphs with the GPU. In Proceedings of the ACM on Management of Data. 1ś24.
[155] Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. 2021. Graph Neural Networks and Their Current Applications in Bioinfor-
matics. Frontiers in Genetics 12 (2021), 1ś22.
[156] Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. 2020. Every Document Owns Its Structure: Inductive
Text Classiication via Graph Neural Networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
334ś339.
[157] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2022. Deep Learning on Graphs: A Survey. IEEE Transactions on Knowledge and Data
Engineering 34, 1 (2022), 249ś270.
[158] Guoyi Zhao, Tian Zhou, and Lixin Gao. 2021. CM-GCN: A Distributed Framework for Graph Convolutional Networks using Cohesive
Mini-batches. In 2021 IEEE International Conference on Big Data. 153ś163.
[159] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. 2021. Heterogeneous graph structure learning for graph
neural networks. In Proceedings of the AAAI Conference on Artiicial Intelligence. 4697ś4705.
[160] Taige Zhao, Xiangyu Song, Jianxin Li, Wei Luo, and Imran Razzak. 2021. Distributed Optimization of Graph Convolutional Network
using Subgraph Variance. arXiv preprint arXiv:2110.02987 (2021), 1ś12.
[161] Yiren Zhao, Duo Wang, Daniel Bates, Robert Mullins, Mateja Jamnik, and Pietro Lio. 2020. Learned low precision graph neural networks.
arXiv preprint arXiv:2009.09232 (2020), 1ś14.
[162] Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li, James Cheng, Hao Yang, and Shuai Zhang.
2022. ByteGNN: eicient graph neural network training at large scale. Proceedings of the VLDB Endowment 15, 6 (2022), 1228ś1242.
[163] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. GMAN: A Graph Multi-Attention Network for Traic Prediction.
In Proceedings of the AAAI Conference on Artiicial Intelligence. 1234ś1241.
[164] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020. DistDGL:
Distributed Graph Neural Network Training for Billion-Scale Graphs. In 10th IEEE/ACM Workshop on Irregular Applications: Architectures
and Algorithms. 36ś44.
[165] Da Zheng, Xiang Song, Chengru Yang, Dominique LaSalle, and George Karypis. 2022. Distributed Hybrid CPU and GPU Training for
Graph Neural Networks on Billion-Scale Heterogeneous Graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining. 4582ś4591.
[166] Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. 2020. Learning Sparse Nonparametric DAGs. In The 23rd
International Conference on Artiicial Intelligence and Statistics. 3414ś3425.
[167] Shanna Zhong, Jiahui Wang, Kun Yue, Liang Duan, Zhengbao Sun, and Yan Fang. 2023. Few-Shot Relation Prediction of Knowledge
Graph via Convolutional Neural Network with Self-Attention. Data Sci. Eng. 8, 4 (2023), 385ś395.
[168] Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2020. Reasoning Over
Semantic-Level Graph for Fact Checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
6170ś6180.
[169] Hongkuan Zhou, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. 2021. Accelerating large scale real-time
GNN inference using channel pruning. Proceedings of the VLDB Endowment 14, 9 (2021), 1597ś1605.
[170] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun.
2020. Graph neural networks: A review of methods and applications. AI Open (2020), 57ś81.
[171] Zhe Zhou, Cong Li, Xuechao Wei, and Guangyu Sun. 2021. Gcnear: A hybrid architecture for eicient gcn training with near-memory
processing. arXiv preprint arXiv:2111.00680 (2021), 1ś15.
[172] Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady
Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload
Characterization. 88ś100.
[173] Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. AliGraph: a comprehensive
graph neural network platform. Proceedings of the VLDB Endowment 12, 12 (2019), 2094ś2105.
[174] Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. 2019. Layer-dependent importance sampling for training
deep and large graph convolutional networks. In Proceedings of the 33rd International Conference on Neural Information Processing
Systems. 11249ś11259.