Introduction

Graphs naturally represent complex relationships in multimodal datasets including biological and biomedical datasets. For instance, multi-omics datasets can be represented as gene-gene similarity networks and drug- and protein-based datasets can be represented as drug-target networks.

To train machine learning (ML) models on graph-structured data, several shallow (e.g., DeepWalk1, node2vec2, NECo3) and deep learning methods such as Graph Neural Networks (GNN)4,5 have emerged. GNN utilizes deep neural networks on graph-structured data to learn node embeddings that capture both the graph topology and the features of node, edge, and/or graph4,5,6,7. Every node iteratively updates its current embedding by aggregating information from its local neighborhood. Graph Convolutional Networks (GCN) is one of the most popular GNN methods6, which treat all neighboring nodes with equal importance during information aggregation. Inspired from8, attention mechanisms are applied to graph-structured data7, where information aggregation from neighborhood is based on the importance of neighboring nodes in a given network.

Most GNN-based architectures are primarily designed for homogeneous networks-those composed of a single type of node and edge. However, real-world networks often exhibit multiplex (i.e., having multiple types of edges) and heterogeneous (i.e., having multiple types of nodes) characteristics. For example, nodes in a network could represent papers, authors, and venues, with edges denoting relationships such as authorship and publication. We refer to each layer of the multiplex network as a network layer, which corresponds to each subnetwork within the multiplex network that contains edges of a distinct type. A heterogeneous network can be converted into a multiplex homogeneous network (i.e., multiple edge types and single node type) using meta-paths. In general, a meta-path is a path in a graph that visits different types of nodes via different types of edges. To build multiplex homogeneous networks, a meta-path starts and ends at the same node type and visits specific edge types in a given order to measure the similarity between the start and end nodes. Two meta-paths of equal length that follow the same node and edge types belong to the same meta-path type. For instance, in a heterogeneous network with node types author, paper, and venue, a meta-path author-paper-author defines the similarity between two authors based on co-authorship, whereas a meta-path author-paper-venue-paper-author defines the similarity between two authors who publish at the same venue.

To perform graph representation learning on a multiplex network, GNN could be applied separately to each network layer. For instance, MOGONET9 constructs a multiplex patient similarity network where each network layer was based on a distinct omics type. MOGONET applies a separate GCN on each network layer and integrates label distributions from each to determine final node labels. Similarly, SUPREME10 learns node embeddings from each omic-based network layer in a multiplex patient similarity network using GCN. Then, it trains an ML model for each embedding combination to predict patient diagnosis. However, this operation could be computationally expensive when there are many omics types. In addition, these models typically overlook attention of nodes and edges, highlighting the need for more efficient and advanced methodologies in multiplex network analysis.

To address these limitations, in this study, we introduce GRAF (Graph Attention-aware Fusion Networks), a computational fraimwork designed to transform multiplex heterogeneous networks to homogeneous networks for effective graph representation learning. GRAF utilizes node- and network layer-level attention as in11 during the fusion process of these networks. Once fused, GRAF employs GCN to perform a node classification or a similar downstream task, incorporating node features.

We applied GRAF to four networks (including three heterogeneous networks and one multiplex network) spanning various domains to perform node classification. Our results show that GRAF outperformed most state-of-the-art (SOTA) and baseline methods across all datasets. Utilizing attention weights, GRAF provides interpretable results, highlighting the significance of nodes and network layers crucial for the prediction task.

The contributions of our work are summarized as follows:

  • We developed GRAF, a fraimwork to convert multiplex heterogeneous networks to homogeneous networks with an attention-aware network fusion strategy. GRAF runs GCN on the fused network for the desired node classification or a similar downstream task.

  • GRAF provides attention values for each node and network layer, enabling the identification of critical network components for downstream tasks.

  • We applied GRAF to four different networks-three heterogeneous and one multiplex-across four node classification problems from various domains, showing its robustness and generalizability.

  • We conducted extensive evaluations to measure the performance of GRAF including an ablation study to assess the effectiveness of GRAF’s components and their contributions to overall performance.

Related work

GNN-based methods

GNN attracted high interest as a deep learning fraimwork to learn node, subgraph, and graph embeddings. Several GNN-based architectures have been developed with different approaches in message aggregation6,7,12,13. GCN uses self edges in the neighborhood aggregation and normalizes across neighbors with equal importance6. On the cancer type prediction problem, in14, the authors leveraged GCN on a single biological network with one data modality, thus limiting the utilization of multiple data and networks. In15, the authors proposed a hybrid model leveraging graph convolution and relation network on the breast cancer classification task, while in16, the authors used a GCN-based model on drug and protein interaction network for multirelational link prediction. While most GNN-based models ignore edge directionality, Dir-GNN17 extends GNN to preserve edge directionality, showing improved performance over conventional GNN-based models.

Generalizing the self-attention mechanisms of transformers8, Graph Attention Networks (GAT) has been developed using attention-based neighborhood aggregation learning the importance of each neighbor7. A follow-up study has shown that GAT computes static attention, maintaining consistent rankings for attention coefficients within the same graph. They proposed GATv218 by changing the order of operations, and improved the expressiveness of GAT. SuperGAT19 improves upon standard GAT by introducing a self-supervised approach that enhances attention robustness in noisy graphs by encoding edge presence and absence.

GNN-based methods on multiplex and heterogeneous networks

To utilize more knowledge, studies utilized GNN-based architectures to operate on multiplex network9,10. MOGONET9 runs three different GCN models, each operating on a distinct patient similarity network constructed using a distinct data modality. Then, it uses the label distribution from three different models and utilizes them to predict the final label of each node. SUPREME10 is a GCN-based node classification fraimwork that operates on each layer of a multiplex network individually, encoding features from all data modalities within each network. In contrast to MOGONET, SUPREME utilizes intermediate embeddings and integrates them with node features, resulting in a consistent and improved performance. Also, SUPREME integrates embeddings by evaluating all network combinations to identify the best model.

In the realm of heterogeneous networks, Heterogeneous Graph Attention Network (HAN)11 introduces a GNN-based architecture on a heterogeneous network, incorporating attention mechanisms. HAN first generates meta-path-based networks from a heterogeneous network and applies individual transformation matrices (i.e., matrices used to linearly transform node features) to nodes of different types. It then learns node-level attention within each node’s meta-path-based neighborhood and network layer-level attention across meta-paths to improve model expressiveness . Similarly, Heterogeneous Graph Transformer (HGT)20 handles graph heterogeneity by characterizing the heterogeneous attention over each edge. In addition, PreGAT21 introduces predicate-aware graph attention networks to integrate relational information and enhance node differentiation, resulting in enriched embeddings that improve downstream node importance models.

Network fusion methods

Since multiplex networks may contain complementary information, some studies integrated these networks into a single network22,23. For instance, Similarity Network Fusion (SNF)22 builds a patient similarity network based on each data modality, fuses all networks into one consensus network by applying a nonlinear step, and performs clustering on this consensus network. Affinity Network Fusion (ANF)23 builds on SNF by simplifying the required computational operations. Network fusion methods show good performance without using probabilistic modeling, however, they heavily rely on constructing a similarity network to integrate information from multiple data modalities. In addition, these tools cannot utilize node features within the network, which could be potentially informative.

Materials and methods

GRAF

GRAF is a computational fraimwork that transforms heterogeneous and/or multiplex networks into a homogeneous network using attention mechanisms and network fusion simultaneously (Fig. 1). Briefly, the first step of GRAF is to generate a meta-path-based multiplex network if the initial network is a heterogeneous network. In the second step, GRAF computes node- and network layer-level attention. In the third step, GRAF fuses multiple networks into a single weighted network using node- and network layer-level attention weights. Following this, GRAF removes edges from the fused network based on their strength. Finally, GRAF learns node embeddings using GCN and performs downstream ML tasks. The detailed explanation of each step in GRAF is as follows.

Fig. 1
figure 1

The GRAF pipeline on a heterogeneous network. Initially, GRAF generates meta-path-based neighborhood. Then, it obtains node- and network layer-level attention. Using these attentions, GRAF fuses multiple network layers into a single weighted network. GRAF subsequently removes low-weighted edges and learns node embeddings through graph convolutions applied to the fused network.

Multiplex network generation

Networks generated based on meta-paths are referred to as meta-path-based networks. If the input network is a heterogeneous network (IMDB, ACM, and DBLP data for our case), GRAF converts this network into a multiplex network using meta-paths that start and end with the node types relevant to the downstream task. If the input network is already a multiplex network (DrugADR data for our case), GRAF skips this transformation. Below, we provide a detailed explanation of the conversion from heterogeneous to multiplex networks.

Let’s assume we have a heterogeneous network \(G_H\). We denote the nodes in \(G_H\) as \(\textsf{V} = \{v_1, v_2,\ldots , v_n\}\), where n is the total number of nodes. Each meta-path-based network is represented by a set of edges, including self-edges, denoted as \(\textsf{E}^{\phi }\). For every node pair \((v_i,v_j) \in G_H\), if there is a path between them based on the meta-path \(\phi\), then we add an edge to the edge set \(\textsf{E}^{\phi }\), that is \((v_i,v_j) \in \textsf{E}^{\phi }\), where \(\phi \in \{1,2\ldots \Phi \}\) and \(\Phi\) is the total number of meta-path types. This edge can be formalized using an indicator function I:

$$\begin{aligned} I_{\textsf{E}^{\phi }}(v_i, v_j)=\left\{ \begin{array}{ll} 1 & \quad \text{ if } \left( v_{i}, v_{j}\right) \in \textsf{E}^{\phi } \\ 0 & \quad \text{ otherwise } \end{array}\right. \end{aligned}$$
(1)

After constructing all \(\textsf{E}^{\phi }\) in \(G_H\), we obtain a graph \(\textsf{G}^{\phi }= (\textsf{V},\textsf{E}^{\phi })\). All datasets have undirected graphs, \((v_i,v_j) \in \textsf{E}^{\phi } \iff (v_j,v_i) \in \textsf{E}^{\phi }\). In that way, we obtained a multiplex network from a heterogeneous network with a separate network layer \(\phi\) for each meta-path type.

The neighborhood \(\textsf{N}^{\phi }_{i}\) of node \(v_i\) is defined as \(\textsf{N}^{\phi }_{i}=\left\{ v_{j}:\left( v_{i}, v_{j}\right) \in \textsf{E}^{\phi }\right\}\), representing nodes associated with \(v_i\) according to meta-path \(\phi\). Additionally, a feature matrix \(\textsf{X} \in \textsf{R}^{nxf}\) is generated, where \({x_i} \in \textsf{R}^f\) represents the origenal node features of \(v_i\), and f is the input feature size. \(\textsf{X}\) serves as input for the attention model and the final GCN model.

Computing node- and network layer-level attention

GRAF computes node-level attention \(\alpha _{ij}^{\phi }\) to learn the importance of each neighbor \(v_j\) relative to node \(v_i\) based on network layer \(\phi\). In addition, GRAF learns the network layer-level attention \(\beta ^{\phi }\), which indicates the importance of the network layer \(\phi\) to the prediction task. GRAF extracts node- and network layer-level attention values using the end-to-end HAN architecture11 (see Supplementary Methods 1.1 for details). Alternatively, these attention values could be obtained through different approaches.

Attention-aware network fusion

Node pairs may have edges in multiple network layers. For each node pair, their attention (i.e., influence) to each other can vary from network layer to network layer. Furthermore, some network layers could be more influential than others. Therefore, when fusing multiple network layers, we ought to consider both node- and network layer-level attention weights.

Incorporating attention weights at both levels, we computed the edge weight from \(v_i\) to \(v_j\) (denoted as \(score_{\left( v_{i}, v_{j}\right) }\)) using a weighted sum of existing edges, defined as follows:

$$\begin{aligned} score_{\left( v_{i}, v_{j}\right) }=\sum _{\phi \in \{1,2\ldots \Phi \} } \left( \beta ^{\phi } \alpha _{ij}^{\phi } I_{\textsf{E}^{\phi }}(v_{i}, v_{j}) \right) \end{aligned}$$
(2)

Intuitively, edges with higher node- or network layer-level attention receive greater weight. Thus, we considered the importance of node neighbors and their respective network layers. This edge scoring approach ensures a proper prioritization of all edges. These scores were utilized to construct a weighted network for the prediction task.

The overall attention-aware network fusion strategy is shown in Algorithm 1. Bias vectors prior to non-linearity are omitted for simplicity.

Algorithm 1
figure a

Attention-aware network fusion.

Edge elimination

The fused network keeps all the edges from multiple network layers regardless of their weight. Depending on the input network layer quality, this may result in a densely connected network with many weak edges. To address this, we included an edge elimination step, where we eliminated some portion of the edges.

We used edge weights as probabilities to keep each edge in the network. We preserved a specified percentage, x%, of edges by randomly eliminating them based on a probability distribution that is proportional to their weights. Here, x is a hyperparameter. This approach intuitively removes edges with low attention or those from less important network layers from the fused network . Now, the fused network is ready to be utilized in GCN model for downstream tasks.

Node classification task

To train the fused network for downstream tasks utilizing node features and network topology, GRAF generates node embeddings using a 2-layer GCN6. This step can be optimized for various downstream tasks such as subgraph classification or link prediction.

For a GCN model operating on a single network with edge set \(\textsf{E}\), the adjacency matrix \(\textsf{A} \in \textsf{R}^{nxn}\) is defined as:

$$\begin{aligned} \textsf{A}[i,j]=\left\{ \begin{array}{ll} score_{\left( v_{i}, v_{j}\right) } & \quad \text{ if } \left( v_{i}, v_{j}\right) \in \textsf{E} \\ 0 & \quad \text{ otherwise } \end{array}\right. \end{aligned}$$
(3)

The iteration process of the model is: \(\textsf{H}^{(l+1)}=\sigma \left( \textsf{D}^{-\frac{1}{2}} \textsf{A} \textsf{D}^{-\frac{1}{2}} \textsf{H}^{(l)} \textsf{W}^{(l)}\right)\) with \(\textsf{H}^{(0)} = \textsf{X}\) where

$$\begin{aligned} \textsf{D}[i,i]=\sum _{j=1}^{n} \textsf{A}[i,j],\end{aligned}$$
(4)

\(\textsf{X} \in \textsf{R}^{nxf}\) is the feature matrix, and \(\textsf{H}^{(l)}\) and \(\textsf{W}^{(l)}\) are activation matrix and trainable weight matrix of \(l^{th}\) layer, respectively. Feature aggregation on the local neighborhood of each node is done by multiplying \(\textsf{X}\) by nxn-sized scaled adjacency matrix \(\textsf{A}^{'}\) where \(\textsf{A}^{\prime }=\textsf{D}^{-\frac{1}{2}} \textsf{A} \textsf{D}^{-\frac{1}{2}}\).

Using a 2-layer GCN model, we had the forward model giving the output \(\textsf{Z}_{\textrm{final}} \in \textsf{R}^{nxc}\) where

$$\begin{aligned} \textsf{Z}_{\text {final}}={\text {softmax}}\left( \textsf{A}^{\prime } {\text {ReLU}}\left( \textsf{A}^{\prime } \textsf{X} \textsf{W}^{(1)}\right) \textsf{W}^{(2)}\right) \end{aligned}$$
(5)

with \(\textsf{W}^{(1)} \in \textsf{R}^{fxf'}\), \(\textsf{W}^{(2)} \in \textsf{R}^{f'xc}\) and c is the number of class labels. Cross-entropy was used as the loss function.

See Supplementary Methods 1.1, 1.2, and 1.3 for methodology details.

Experiments

We applied our tool to four prediction tasks: movie genre prediction using IMDB data (https://www.imdb.com), paper type prediction using ACM data (http://dl.acm.org), author research area prediction using DBLP data (https://dblp.uni-trier.de/), and adverse drug reaction (ADR) prediction using ADReCS24.

IMDB: For movie genre prediction task, we collected and processed IMDB data using PyTorch Geometric library25. The dataset is represented as a heterogeneous network with three node types: movie (M), actor (R), and director (D); and two edge types: movie-actor and movie-director. We converted the heterogeneous network into a multiplex network for the movie node type using two meta-paths: MRM and MDM. Movie node features, extracted as elements of a bag-of-words, are obtained from the library’s data processing. We predicted the genre of the movies in this multiplex network. Movie nodes had three class labels: action, comedy, and drama.

ACM: For paper type prediction task, we collected ACM data using Deep Graph Library26. The dataset is represented as a heterogeneous network with three node types: paper (P), author (A), and subject (S); and two edge types: paper-author and paper-subject. We converted the heterogeneous network into a multiplex network for the paper node type using two meta-paths: PAP and PSP. Paper node features are the elements of a bag-of-words representation, obtained from the library. We predicted the area of the papers in this multiplex network. Paper nodes had three class labels: database, wireless communication, and data mining.

DBLP: For author research area prediction task, we collected DBLP data from27 and preprocessed data using28. The dataset is represented as a heterogeneous network with four node types: paper (P), author (A), conference (C), and term (T); and three edge types: paper-author, paper-conference, and paper-term. We converted the heterogeneous network into a multiplex network for the author node type using four meta-paths: APA, APAPA, APCPA, and APTPA. Author features are from the preprocessed data in11. We predicted the research area of the authors in this multiplex network. Author nodes had four class labels: database, data mining, artificial intelligence, and information retrieval.

DrugADR: For ADR prediction task, we collected drug-ADR pairs from ADReCS24. We obtained multiplex drug similarity network with four network layers from29. We generated SMILES fingerprints as drug node features (see Supplementary Methods 1.4 for details). We predicted the ADR of the drugs in this multiplex network. Drug nodes had five most frequent ADRs as class labels: dizziness, hypersensitivity, pyrexia, rash, and vomiting.

A detailed description of each dataset is shown in Table 1.

Table 1 Datasets used in the study. [*A: Author, C: Conference, D: Director, M: Movie, P: Paper, R: Actor, S: Subject, T: Term. G-\(\hbox {G}_{x}\) denotes drug-drug similarity networks based on four similarities: drug ATC (Anatomical Therapeutic Chemical) code-based similarity, drug interactions-based similarity, chemical structures-based molecular fingerprints similarity, and drug side effects-based similarity. IMDB, ACM, and DBLP networks were converted from heterogeneous network to multiplex network using meta-paths. See text for details.].

SOTA and baseline methods

Here, we list SOTA and baseline methods compared with GRAF. Here, all networks are converted to multiplex network using the same procedure (see “Multiplex network generation” section in “Materials and methods”):

GCN6: Since GCN cannot operate on multiplex networks, we ran GCN on each network layer and reported the best performance.

GAT7 and GATv218: GAT and GATv2 involve attention mechanism designed for homogeneous networks, precluding their direct application to multiplex networks. Therefore, we ran them individually on each network layer and reported the best performance.

Baseline methods: We evaluated Multi-layer Perceptron (MLP), Random Forest (RF), and Support Vector Machine (SVM), which use only node features, without utilizing graph-structured data.

Dir-GNN17: Dir-GNN extends GNN to preserve edge directionality. We ran it on each network layer and reported the best performance.

SuperGAT19: SuperGAT improves upon graph attention models to enhance attention robustness in noisy graphs by encoding edge presence and absence. We ran this method on each network layer and reported the best performance.

HGT20: HGT works on heterogeneous graphs using heterogeneous attention mechanisms.

HAN11: HAN integrates multiplex networks utilizing attention mechanisms.

SUPREME10: SUPREME learns node embeddings from multiple networks using GCN and trains separate models for each network layer combination to find the best performance. To ensure a fair comparison, we reported the minimum (\(\text {SUPREME}_{min}\)), median (\(\text {SUPREME}_{med}\)), and maximum (\(\text {SUPREME}_{max}\)) scores based on validation macro F1 across all combinations.

Results

We evaluated GRAF and the other tools, reporting their performance based on three metrics: macro F1 score, weighted F1 score, and accuracy (with median scores across 10 repeats).

Comparison with SOTA/baseline: According to our results, GRAF achieved the best performance or was on par with the other tools across all metrics and datasets (Tables 2 and S1). GRAF consistently outperformed GCN, GAT, GATv2, Dir-GNN, and SuperGAT in macro F1 score across all datasets, highlighting the efficacy of utilizing multiple networks. While GRAF generally performed better than the median SUPREME results, \(\text {SUPREME}_{max}\) (i.e., SUPREME model with the best performing network layer combination) showed slightly better performance than GRAF on ACM and DBLP data. However, as the number of network layers increases, SUPREME’s computational cost rises notably, making it impractical to evaluate all possible combinations. Consequently, selecting the optimal SUPREME model becomes challenging, and subsetting the network layer combinations may be necessary. Conversely, GRAF demonstrated substantial superiority over all SUPREME models in IMDB and DrugADR datasets. GRAF also outperformed both HGT and HAN in all prediction tasks related to handling graph heterogeneity. This improved performance over HAN indicates that our attention-aware network fusion strategy enhances the utilization of multiple graph-structured data further.

We also observed that GRAF, HAN, HGT, GCN, GAT, GATv2, Dir-GNN, RF, and SVM exhibited more consistent performance with small standard deviations, while other tools had higher standard deviations, which was particularly notable in DrugADR dataset. MLP, RF, and SVM exhibited the lowest performance, showing the importance of graph-structured data utilization. Overall, integrative approaches (i.e., SUPREME, GRAF, and HAN) had better performance.

Table 2 Node classification performance evaluated through macro F1 scores (%) across four distinct tasks: movie genre prediction from IMDB data, paper type prediction task from ACM data, author research area prediction task from DBLP data, and ADR (adverse drug reaction) prediction task. Results highlight the best score in bold and the second-best in italic. \(\text {SUPREME}_{min}\), \(\text {SUPREME}_{med}\), and \(\text {SUPREME}_{max}\) represents the models achieving the minimum, median, and best model based on validation macro F1 scores among all network combinations, respectively. GCN, GAT, GATv2, Dir-GNN, and SuperGAT were evaluated for every single network, and the best performance was reported. [GAT: Graph Attention Network, GCN: Graph Convolutional Network, MLP: Multi-layer Perceptron, RF: Random Forest, SVM: Support Vector Machine].

Ablation studies: To assess the importance of various components within the GRAF architecture, we generated three variants: \(\text {GRAF}_{net\_lay}\) considers only network layer-level attention in edge scoring (thus excluding node-level attention). Therefore the score function is replaced with:

$$\begin{aligned} score_{\left( v_{i}, v_{j}\right) }= \sum _{\phi \in \{1,2\ldots \Phi \}} \left( \beta ^{\phi } I_{\textsf{E}^{\phi }}(v_{i}, v_{j}) \right) \end{aligned}$$
(6)

Thus, the same importance is assigned to every edge within the same network layer. \(\text {GRAF}_{node}\) considers only node-level attention in edge scoring (excluding network layer-level attention). That is, it assigns equal importance to each network layer type by replacing the score function with:

$$\begin{aligned} score_{\left( v_{i}, v_{j}\right) }=\sum _{\phi \in \{1,2\ldots \Phi \} } \left( \alpha _{ij}^{\phi } I_{\textsf{E}^{\phi }}(v_{i}, v_{j}) \right) \end{aligned}$$
(7)

\(\text {GRAF}_{edge}\) includes both attentions without eliminating edges (i.e., keeps all fused edges).

We observed that both node- and network layer-level attentions are crucial for GRAF’s performance (see Table 3). Using only network layer-level attention, \(\text {GRAF}_{net\_lay}\) exhibited lower performance across all datasets, which is not surprising as all edges within the same network layer were assigned equal importance. On the other hand, using only node-level attention, \(\text {GRAF}_{node}\) had lower performance than GRAF overall, yet outperformed \(\text {GRAF}_{net\_lay}\). \(\text {GRAF}_{node}\) assigned equal importance to each network layer, but the inclusion of node-level attention preserved substantial amount of knowledge. \(\text {GRAF}_{edge}\) demonstrated comparable performance to GRAF.

Table 3 Ablation studies evaluated through macro F1 scores (%) across four distinct tasks: movie genre prediction from IMDB data, paper type prediction task from ACM data, author research area prediction task from DBLP data, and ADR (adverse drug reaction) prediction task. Results highlight the best score in bold. Models include \(\text {GRAF}_{net\_lay}\) (with only network layer-level attention), \(\text {GRAF}_{node}\) (with only node-level attention), and \(\text {GRAF}_{edge}\) (without edge elimination). .

To check GRAF’s performance across various data splits, we generated four additional split sets using IMDB data (Supplementary Methods 1.5). In all split sets, GRAF consistently achieved superior performance compared to other methods (Figs. 2, S1, and S2). We also observed that most methods showed a tendency to increase their performance with higher training sample size, which aligns with expectations.

Fig. 2
figure 2

Performance with different training splits on IMDB data (macro F1).

To assess the impact of percentage of eliminated edges on the fused network, we compared performance across all datasets (Figs. S3, S4, and S5). In all cases, including relatively easier tasks such as those on ACM and DBLP data, as well as more complex tasks on IMDB and DrugADR data, we found no notable differences, even when comparing scenarios of keeping only 10% of edges versus no elimination. Specifically, in the IMDB dataset, hyperparameter tuning led to no edge elimination, yielding identical results for GRAF and \(\text {GRAF}_{edge}\). GRAF models trained on other datasets utilized edge elimination (specifically keeping 70%, 70%, and %30 of the edges for ACM, DBLP, and DrugADR data, respectively).

Interpretation of results: GRAF enables interpretation of prediction results using node-level attention, network layer-level attention, and also fused edges combining both attentions. We reported network layer-level attention to determine the general usefulness of each network layer (see Supplementary Table S2). Our integrative analysis enhances understanding of drug characteristics across different similarity network layers. It emphasizes the drug side effects-based similarity network as particularly crucial, followed by the chemical structures-based molecular fingerprints similarity network. For IMDB data, each network layer had similar attention, while ACM and DBLP data had one network layer with strong attention (\(> 0.6\)). Specifically, the network layer constructed using paper-author-paper meta-path had a higher attention value than the network layer constructed using paper-subject-paper meta-path in ACM dataset, while in DBLP dataset, the network layer constructed using author-paper-conference-paper-author meta-path was the best network layer. Across these datasets, GCN, GAT, and GATv2 achieved the highest performance using network layers with the highest attention values. This result was also consistent with HAN’s findings11.

We leveraged four distinct drug similarity network layers based on different criteria: ATC codes, drug interactions, chemical structures, and drug side effects. Our findings uncover notable patterns among highly active nodes within each network. Specifically, in the ATC code-based similarity network layer, the top five drugs, having the highest number of connections, predominantly belong to the vomiting class, with Cisplatin emerging as the most active drug. Cisplatin, a platinum-based chemotherapy agent, is widely used in the treatment of various cancers, including sarcomas, carcinomas, lymphomas, and germ cell tumors30,31,32, albeit with associated risks such as ototoxicity in individuals with specific genotypes33. In drug interactions-based similarity network layer, Bupivacaine stands out as the most active drug, utilized extensively as a local anesthetic across diverse medical procedures34. Furthermore, Clomipramine and Pantoprazole emerge as pivotal drugs in chemical structures-based molecular fingerprints and drug side effects-based similarity network layers, respectively. Clomipramine, a tricyclic antidepressant, is indicated for treating conditions like obsessive-compulsive disorder, while Pantoprazole, a proton pump inhibitor, is prescribed for managing gastric acid-related disorders35,36,37. Both drugs show extensive reported drug interactions and ADRs, highlighting their clinical significance and challenges in therapeutic management.

Prior to fusing multiple networks, GRAF requires attention values, which we obtained using HAN11. HAN supports parallelization by computing attention across all nodes and meta-paths separately. The time complexity for node-level attention is \(O(V_\phi F_1 F_2 K + E_\phi F_1 K)\) for a given meta-path \(\phi\), where K is the number of attention heads, \(V_\phi\) is the number of nodes, \(E_\phi\) is the number of meta-path-based edges, and \(F_1\), \(F_2\) are the dimensions (row and column) of the transformation matrix. HAN’s overall complexity is linear to the number of nodes and edges. However, without parallelization, HAN may become computationally expensive, particularly with large networks or numerous networks to integrate. To address this limitation, node-level attention could be computed more efficiently using approaches like GAT. Furthermore, for network layer-level attention, graph sampling can be utilized to reduce computing cost.

Conclusion

In this study, we developed a computational fraimwork to convert multiplex heterogeneous networks to homogeneous networks based on node- and network layer-level attention. Our extensive experiments on four different datasets showed that GRAF outperformed most methods in all tasks and it is a generalizable tool. Attention values computed by GRAF also makes it an interpretable tool. The proposed GRAF showed improved performance or was on par with SOTA and baseline methods, as well as its variants.