MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation

Zhang, Jianqiang; Chen, Renyao; Li, Shengwen; Li, Tailong; Yao, Hong

doi:10.3390/a17120593

Open AccessArticle

MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation

by

Jianqiang Zhang

¹,

Renyao Chen

¹

,

Shengwen Li

^1,2,3

,

Tailong Li

⁴

and

Hong Yao

^1,2,3,4,*

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences, Wuhan 430074, China

³

Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430078, China

⁴

School of Future Technology, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(12), 593; https://doi.org/10.3390/a17120593

Submission received: 30 October 2024 / Revised: 11 December 2024 / Accepted: 20 December 2024 / Published: 23 December 2024

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors of entities and their relationships from their spatial attributes and relationships, which ignores various semantics of entities, resulting in poor embeddings on geographic knowledge graphs. This study proposes a two-stage multimodal geographic knowledge graph representation (MGKGR) model that integrates multiple kinds of semantics to improve the embedding learning of geographic knowledge graph representation. Specifically, in the first stage, a spatial feature fusion method for modality enhancement is proposed to combine the structural features of geographic knowledge graphs with two modal semantic features. In the second stage, a multi-level modality feature fusion method is proposed to integrate heterogeneous features from different modalities. By fusing the semantics of text and images, the performance of geographic knowledge graph representation is improved, providing accurate representations for downstream geographic intelligence tasks. Extensive experiments on two datasets show that the proposed MGKGR model outperforms the baselines. Moreover, the results demonstrate that integrating textual and image data into geographic knowledge graphs can effectively enhance the model’s performance.

Keywords:

multimodal; geographic knowledge graph; knowledge graph representation

1. Introduction

Geographic knowledge graph (GeoKG) organizes the entities, attributes, and spatial relationships into triples of the form <h,r,t>, where h and t represent the entities in the triple, and r represents the relationship between them. GeoKG provides basic knowledge for researchers and decision-makers to understand the interactions between complex geographic environments and human activities [1]. Knowledge graph representation learning (KRL) is the emerging method that facilities GeoKG for various geographic applications, which transforms entities and relationships of a knowledge graph into low-dimensional embeddings. Thanks to that KRL not only effectively represents geographic data, but also facilitates the further exploration of potential associations between entities in GeoKG; KRL has served as an essential task, and makes great progress in geographic applications, such as spatial analysis [2,3], spatial planning [4], POI recommendation [5,6,7], geographic entity retrieval [8], and human activity trajectory mining [9,10,11].

Classical KRL methods primarily learn the embedding vector of entities by capturing the structural features in KG using translation-based models [12], bilinear models [13], and neural network-based models [14]. On the basis of these traditional KRL methods, geographic knowledge graph representations have integrated geospatial distance constraints into traditional KRL models, enabling models to explicitly encode and represent spatial relationships within geographic data [15]. Furthermore, to capture relative positional relationships between entities, some researchers have integrated spatial attributes, such as point coordinates or entity boundaries, directly into the GeoKG embedding space, thereby achieving more precise geographic knowledge representations [16].

However, previous studies ignore the rich multimodal semantics of geographic entities. Usually, the various semantics of entities could enrich GeoKGs from different semantic views. For example, in traditional geographic knowledge graph representation methods, decisions are typically made based on structural information when performing geographic attribute prediction tasks. However, these traditional methods do not always provide accurate predictions without multimodal semantic information (e.g., text and image). As demonstrated in Figure 1, it is difficult to infer whether a cafe offers parking or delivery services based on a GeoKG without multimodal semantic information. However, the image suggests a high likelihood of parking services, while the text indicates that delivery services are unavailable. In the example, the semantic information contained in multimodal data enhances the ability to infer additional attributes of the cafe, enriching the learned representation of the geographic knowledge graph.

To address this issue, this paper introduces a multimodal geographic knowledge graph representation method (MGKGR) that integrates multiple kinds of semantics to improve the embedding learning of geographic knowledge graph representation. MGKGR employs a two-stage fusion strategy that combines the structural features of geographic knowledge graphs with the semantics from multimodal information of geographic entities. Specifically, in the first stage, a spatial feature fusion for modality enhancement method is employed to strengthen image and text features using structural information. To capture common features and address distributional discrepancies between modalities, contrastive learning [17] is applied to align their distributions. In the second stage, text and image features enhanced by structural features are further integrated to form multimodal features. To address the semantic divergence between images and text, a multi-level modality feature fusion method is employed to unify these heterogeneous data. To the best of our knowledge, this paper is the first to explore multimodal geographic knowledge representation learning. Compared to traditional methods, MGKGR introduces text and image modal data into a geographic knowledge graph representation, enhancing the feature encoding of GeoKG by integrating both structural and modality features. The contributions of this paper are as follows:

This paper highlights the importance of multimodal semantics in the representation of geographic knowledge graphs and proposes to introduce it to enhance the model’s representation capability for GeoKG.
This paper proposes the MGKGR method, which mitigates the impact of heterogeneity between different modal features through a two-stage fusion process, providing methodological references for the multimodal feature fusion of GeoKG.
We constructed two multimodal geographic knowledge datasets and conducted extensive experiments to evaluate MGKGR. The experiments confirmed that multimodal data effectively improves the feature quality of geographic knowledge graphs.

2. Related Work

In this section, we review related work across three key domains: (1) Traditional Knowledge Graph, (2) Multimodal Knowledge Graph, (3) Geographic Knowledge Graph. These areas are integral to the foundation and objectives of this paper.

2.1. Traditional Knowledge Graph

Research on traditional knowledge graphs primarily focuses on their representation. Knowledge graph representation aims to embed the entities and relations of a knowledge graph into a low-dimensional embedding space, enabling tasks such as link prediction and knowledge inference. Current methods of knowledge graph representation are typically classified into translation-based models, bilinear-based models, and neural network-based models. Among translation-based models, TransE [12] is a foundational method, relying on the principle that

h + r \approx t

, where the embedding of the head entity, when translated by the relation, approximates the embedding of the tail entity. However, TransE struggles with modeling complex relations, which led to the development of more advanced models such as TransH [18], TransR [19], and TransD [20]. In bilinear-based models, RESCAL [21] calculates the correlation of triples by multiplying a relation matrix with the head and tail entity vectors. To reduce model complexity, DistMult [13] simplifies the relation matrix to a diagonal form, while ComplEx [22] extends DistMult by employing a complex vector space, enabling better handling of non-symmetric relations. Neural network models utilize deep learning techniques like CNNs and GNNs to learn entity and relation embeddings. ConvE [23] leverages CNNs to capture the intricate interactions between entities through entity and relation embeddings. CompGCN [24] and RGCN [25], both extensions of GCN [26], represent different methods of knowledge graph modeling. CompGCN enhances entity-relation interactions by applying compositional operations on relations, integrating both entity-level convolutional aggregation and relation-specific modeling. RGCN focuses on learning entity representations by aggregating information from neighboring nodes across different relations, making it particularly effective for handling complex multi-relational knowledge graphs.

However, traditional knowledge graph representation methods are limited in their ability to effectively utilize features from multimodal data. Consequently, there has been an increasing amount of research dedicated to the development of multimodal knowledge graphs [27,28]. These studies integrate multimodal data into knowledge graphs, fusing features from various modalities to enhance the overall quality of the graph.

2.2. Multimodal Knowledge Graph

Current research on multimodal knowledge graphs (MMKG)s primarily focuses on their construction and completion. The construction of MMKGs involves integrating data such as text and images into traditional knowledge graphs, enriching them with more comprehensive semantic information. MMKG [29] introduced two multimodal knowledge graphs, DBpedia15K and YAGO15K, which are based on DBpedia [30] and YAGO [31], respectively. These graphs utilize web crawlers to collect multimodal data, which are subsequently linked to entities as attribute nodes.

The completion of MMKG is a key task in multimodal representation learning, effectively evaluating the representational capabilities of these graphs. IMF [32] encodes the graph structure, images, and text separately, and fuses multimodal features using Tucker decomposition [33]. CMGNN [34] directly integrates multimodal data with node features and aggregates messages from subgraphs representing different relations within the knowledge graph using GCNs. However, a common challenge in MMKGs is the semantic misalignment between images and text. To address this, MKGFormer [35] employs a multi-level feature fusion strategy and reformats textual representations, converting the knowledge graph completion task into a mask prediction task [36].

In geographic knowledge graphs, the significant structural features of geographic entities contain the spatial distribution of entities in the real world, and the data from various modalities for the same entity often exhibit low correlations due to the heterogeneity across these modalities. The aforementioned multimodal KRL methods face challenges in effectively capturing these structural features while concurrently addressing the heterogeneity of data across different modalities. In contrast, the MGKGR proposed in this study accounts for both the spatial characteristics of GeoKG and the heterogeneity among multimodal features.

2.3. Geographic Knowledge Graph

Current research on GeoKG focuses primarily on their construction. With the growth of the Internet, platforms such as OpenStreetMap (OSM) and Wikidata contain large amounts of data that can be used to extract and construct geographic knowledge. Based on this, several studies CrowdGeoKG [37] extracts different types of entities from OSM and enriches them with geographic knowledge from Wikidata. WORLDKG [38] refines the class hierarchy of OSM elements by analyzing large amounts of heterogeneous OSM data tags, combined with manual filtering, to create a top-down hierarchical classification of geographic entities covering a wide range of geographic categories. It also links geographic entities to specific classes in the Wikidata and DBpedia [30] ontologies. UUKG [39] models the hierarchical relationships between geographic entities, constructing a multi-level urban knowledge graph that provides strong support for spatiotemporal urban forecasting.

Current research on geographic knowledge graph representation primarily focuses on incorporating spatial location information from geographic knowledge into traditional knowledge graph representation methods. In earlier studies [15], researchers used spatial distance as a constraint, employing translation-based models [18,19,20] to encode this distance constraint along with entities and relations in a low-dimensional embedding space. However, constraints based on spatial distance only capture the separation between entities and do not reflect their relative spatial positions. To address this limitation, researchers have begun to encode spatial information, such as point coordinates and entity boundaries, directly into the GeoKG embedding space [16].

However, current geographic knowledge graph representation methods focus primarily on modeling the structure of GeoKG, while failing to take into account the multimodal data. Therefore, the MGKGR method proposed in this paper not only models the structural features of geographic knowledge graphs but also captures the features from multimodal data associated with geographic entities. By employing a two-stage multimodal feature fusion method, it integrates heterogeneous features from different modalities, extracting both common and complementary characteristics, thereby enhancing the model’s ability to represent geographic knowledge graphs.

3. Methods

In this section, the multimodal semantic fusion for geographic knowledge graph representation (MGKGR) method is presented. As illustrated in Figure 2, MGKGR consists of two components: (A) Multimodal GeoKG Encoding (B) Two-Stage Multimodal Feature Fusion. Multimodal geographic knowledge graph Encoding leverages a knowledge graph embedding model, a text encoder, and an image encoder to separately encode the structural, textual, and image data within the multimodal GeoKG. Two-Stage Multimodal Feature Fusion first employs the Spatial Feature Fusion for Modality Enhancement method to enhance the representation of text and visual features using the structured features of multimodal GeoKG. Subsequently, the Multi-level Modality Feature Fusion method is applied to integrate the enhanced text and visual features, resulting in the multimodal features of multimodal GeoKG. This section will provide a detailed description of MGKGR in three parts: Section 3.1. Multimodal GeoKG Encoding, Section 3.2. Two-Stage Multimodal Feature Fusion and Section 3.3. Model Optimization.

3.1. Multimodal GeoKG Encoding

The multimodal geographic knowledge graph incorporates diverse geographic data, including entity IDs, coordinates, attributes, and multimodal data such as text and images. The multimodal geographic knowledge graph is represented as

G_{geo} = {V, R, E, M}

, where

V = {V_{geo}, V_{att}}

denotes the set of geographic and attribute entities,

R = {R_{adj}, R_{att}}

represents spatial and attribute relations,

E = {(s, r, t) ∣ s, t \in V, r \in R}

is the set of triples, and

M = {M_{text}, M_{image}}

includes the multimodal data.

Structural Feature Encoding

The structural features in GeoKG can provide spatial topological information between geographic entities, as well as the relation feature between geographic entities and attribute entities. For the geographic knowledge graph

G_{geo}

, MGKGR utilizes a KRL model to encode GeoKG and capture its structural features. Specifically, we obtain the initial embedding for each entity

H^{0} = {h_{1}^{0}, h_{2}^{0}, h_{3}^{0}, \dots, h_{n}^{0}}

. These entity embeddings are then encoded through the KRL model. To effectively capture the spatial features in KG with multiple relations, we adopt RGCN [14] as the KRL model in MGKGR, which captures the structural features

H^{g}

of the geographic knowledge graph by aggregating the neighboring node features

h_{i}

of each entity. The training formula is presented in Equation (1).

h_{i}^{l + 1} = σ (\sum_{r \in R} \sum_{j \in N_{i}^{r}} \frac{1}{| N_{i}^{r} |} W_{r}^{l} h_{j}^{l} + W_{0}^{l} h_{i}^{l}),

(1)

where

N_{i}^{r}

is the set of first-order neighbors of entity i under relation r, and

| N_{i}^{r} |

is used to balance the influence of varying numbers of neighbors.

h_{i}^{l}

is the l-th layer representation of node i, with the final layer output representing the structural feature of the geographic knowledge graph, denoted as

H^{g}

.

2.: Text Feature Encoding

To capture the rich semantic information contained in the text, MGKGR employs a text encoder to encode the textual data. The text encoder is composed of the first

L_{t}

layers of BERT [36]. To more effectively leverage the capabilities of the Transformer encoder for better text feature representation, we adopted the concept from MKGFormer [35] by converting the link prediction task into a Masked Language Modeling (MLM) task. The predicted tail entity in the link prediction is treated as the token masked with [MASK] in the MLM task. Specifically, we organized the triples in the form presented in Equation (2).

{Text}_{(e_{h}, r, ?)} = [CLS] e_{h} d_{e_{h}} [SEP] r [SEP] [MASK] [SEP],

(2)

where

e_{h} \in V

,

r \in R

, and

d_{e_{h}} \in M_{text}

represent the textual modality data corresponding to the head entity

e_{h}

. [CLS] is a special token used to aggregate features from the entire input, while [SEP] is used to separate different parts of the input. The tail entity to be predicted is replaced by

[M A S K]

, and the “?” in

{Text}_{(e_{h}, r, ?)}

represents the tail entity that needs to be predicted. This manner not only makes more effective use of the textual data of entities but also incorporates relations in a textual format, allowing for rich semantic relation expressions that enhance the ability of the model to capture and represent relationships. The text

T_{(e_{h}, r, ?)}

is encoded by the text encoder via Equation (3) to obtain the textual features of the triple, denoted as

H^{t}

.

H^{t} = T_{Encoder} ({Text}_{(e_{h}, r, ?)}),

(3)

where

h^{t} \in H^{t}

, representing the textual features of the triple.

3.: Vision Feature Encoding

To capture the rich semantic information contained in visual images, MGKGR employs a vision encoder to encode the visual data. As shown in Equation (4), the visual images

{Pic}_{e_{h}}

of geographic entities are embedded through the vision encoder to obtain the image features

H^{v}

. The vision encoder is composed of the first

L_{v}

layers of VIT [40].

H^{v} = V_{Encoder} ({Pic}_{e_{h}})

(4)

The features from the textual, visual, and structural modalities will be fused within the Two-Stage Multimodal Feature Fusion module.

It is worth noting that the initialization parameters of the text encoder and vision encoder used in this study were obtained through pre-training. The purpose of the pre-training is to align the features of geographic entities with their corresponding textual and visual features. The organization of the textual description during pretraining is shown in Equation (5). Through pretraining, the entity

e_{h}

is aligned with its textual data

d_{e_{h}}

. Additionally, in MGKGR, the text encoder is used to encode triples involving the relation type

R_{att}

, which contains rich semantic information, enabling the model to better capture and represent these relations. Meanwhile, the structural encoder captures the overall structure feature of

G_{geo}

.

{Text}_{e_{h}} = [CLS] d_{e_{h}} is the description of [MASK] [SEP]

(5)

3.2. Two-Stage Multimodal Feature Fusion

This paper combines the features from text, image, and structure by a two-stage multimodal feature fusion that incorporates Spatial Feature Fusion for Modality Enhancement and Multi-level Modality Feature Fusion Encoder.

Spatial Feature Fusion for Modality Enhancement (SFF). The SSF module enhances text embedding

H^{t}

and visual embedding

H^{v}

by structural embedding

H^{g}

. Although the text and visual embedding exhibit different distributions with the structural embedding in the knowledge graph, their underlying correlations enable complementary feature integration. To leverage this underlying correlation, an adaptive feature fusion method is utilized. For the text features, a learnable parameter matrix

W_{1}^{g}

is used to adaptively obtain relevant information from the structural features

H^{g}

, which is then fused with the text features

H^{t}

. For the visual features, another matrix

W_{2}^{g}

is used to capture relevant information from the structural features and fuse it with the visual features

H^{v}

. As demonstrated in Equations (6) and (7). This method takes into account the underlying correlations between modalities, adaptively extracting structural features and fusing them with text and visual features.

H^{gt} = H^{t} + W_{1}^{g} \times H^{g},

(6)

H^{gv} = H^{v} + W_{2}^{g} \times H^{g},

(7)

where

H^{gt}

and

H^{gv}

represent the text feature and image feature after fusing structural feature, respectively.

To further mitigate the impact of distributional heterogeneity among different modal features during feature fusion, while also capturing the underlying common features across different modalities this paper utilizes contrastive learning to align the distribution of multimodal semantic embedding with the structural embedding of geographic entities. The contrastive learning adopts the InfoNCE loss function [41], as outlined in Equations (8) and (9). Specifically, by constructing positive and negative sample pairs, the method brings the structural features

h_{i}^{g}

of entity

i

closer to its text features

h_{i}^{t}

and visual features

h_{i}^{v}

as positive pairs while pushing the text and visual features of other entities away as negative pairs. This contributes to a more uniform distribution of the samples, mitigates the impact of heterogeneity, and further enhances the model’s ability to extract underlying common features.

L_{c t} = \sum_{i}^{N} - \log (\frac{\exp (\cos (h_{i}^{g}, h_{i}^{t}) / t)}{\exp (\cos (h_{i}^{g}, h_{i}^{t}) / t) + \sum_{j}^{|N^{'}|} \exp (\cos (h_{i}^{g}, h_{j}^{t}) / t)})

(8)

L_{c v} = \sum_{i}^{N} - \log (\frac{\exp (\cos (h_{i}^{g}, h_{i}^{v}) / t)}{\exp (\cos (h_{i}^{g}, h_{i}^{v}) / t) + \sum_{j}^{|N^{'}|} \exp (\cos (h_{i}^{g}, h_{j}^{v}) / t)})

(9)

In the equations

h_{i}^{t} \in H^{t}

,

h_{i}^{v} \in H^{v}

, N represents the total number of samples, while

N^{'}

denotes the number of negative samples, and t is a hyperparameter. The function

\cos (\cdot, \cdot)

computes the cosine similarity between two embeddings.

Multi-level Modality Feature Fusion Encoder (M-Encoder) integrates the enhanced text embedding

H^{gt}

and visual embedding

H^{gv}

. The weak correlation between text and visual data in multimodal geographic knowledge graphs introduces modality heterogeneity, which could undermine the effective extraction of meaningful information during the feature fusion. To address this issue, this paper draws on the insights from MKGFormer [35], the M-Encoder achieves two-level feature fusion between text and visual modalities through two rounds of feature correlation selection. Specifically, The M-Encoder consists of the final

L_{m}

layers of BERT [36] and VIT [40], which are combined with the preceding Text Encoder and Vision Encoder to form the complete BERT and VIT models.

The first level of fusion reduces the impact of modal heterogeneity by computing multi-head attention at each layer. Specifically, as shown in Equations (10) and (11):

{head}_{i}^{t} = Attn (h^{gt} W_{q}^{gt}, h^{gt} W_{k}^{gt}, h^{gt} W_{v}^{gt})

(10)

{head}_{i}^{v} = Attn (h^{gv} W_{q}^{gv}, [h^{gv} W_{k}^{gv}, h^{gt} W_{k}^{gt}], [h^{gv} W_{v}^{gv}, h^{gt} W_{v}^{gt}])

(11)

In the formula, Attn is used to calculate attention scores, which calculates the attention scores for the i-th attention head for both text and image features. When computing the visual attention features,

{head}_{i}^{v}

incorporates text features, thereby enhancing the model’s ability to capture common features between images and text. After concatenating multiple attention heads, the result is passed through a residual connection to serve as the input for the second-level fusion, as shown in Equations (12) and (13).

\bar{h_{l}^{g t}} = \bar{h_{l - 1}^{g t}} + L N ([{head}_{1}^{t}; {head}_{2}^{t}; \dots; {head}_{h}^{t}]),

(12)

\bar{h_{l}^{g v}} = \bar{h_{l - 1}^{g v}} + L N ([{head}_{1}^{v}; {head}_{2}^{v}; \dots; {head}_{h}^{v}]),

(13)

where

l = 1, 2, \dots, L_{m}

, and

LN (\cdot)

represents a linear layer function.

The second-level fusion computes the similarity between text and image features to select the visual features that are highly correlated with the text and integrate these selected visual features into the text features. Specifically, as shown in Equation (14), the fused visual features are derived from the similarity between the visual and text features.

{Agg}^{v} = \cos (\bar{h_{l}^{gt}}, \bar{h_{l}^{gv}}) \cdot \bar{h_{l}^{gv}}

(14)

To obtain the output of the second level fusion,

{Agg}^{v}

is fused with the text feature by a feedforward neural network (FFN), as shown in Equation (15).

F F N (\bar{h^{g t}}) = ReLU (\bar{h^{g t}} W_{1} + b_{1} + {Agg}^{v} W_{3}) W_{2} + b_{2}

(15)

Finally, the above process is iterated

L_{m}

times, with the text features fused with

H^{gv}

used as the multimodal fused features

H_{m}

.

3.3. Model Optimization

The overall loss function is shown in Equation (16), the loss function consists of contrastive learning loss and a task-specific loss.

L = L_{B C E} + L_{c t} + L_{c v}

(16)

For the task-specific loss, MGKGR reformulates the link prediction task as a Masked Language Modeling (MLM) task. In this task. Consequently, binary cross-entropy loss

L_{B C E}

is used as the task-specific loss function.

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} \log (p (y_{i} (H_{m}))) + (1 - y_{i}) \log (1 - p (y_{i} (H_{m}))),

(17)

where

y_{i}

represents the true label of the [MASK] entity, and

p (y_{i} (H_{m}))

denotes the probability that the sample belongs to this label, using the probability distribution of

H_{m}

at the [MASK] position. In the MGKGR model, the loss from this task is jointly trained with the two contrastive learning loss functions.

MGKGR encodes the structure, text, and image features of the multimodal GeoKG separately, followed by a two-stage feature fusion process. In the first stage, the Adaptive Feature Fusion method is used to selectively integrate structural features into the text and visual features. During the fusion process, contrastive learning is employed to mitigate the noise introduced by distributional differences among the features. In the second stage, features with a higher text-image correlation are fused to alleviate the impact of modal heterogeneity. Finally, MGKGR utilizes the fused multimodal features as the feature representation of the multimodal geographic knowledge graph and performs a link prediction task. The overall training procedure of MGKGR is provided in Algorithm 1.

Algorithm 1 The training procedure of MGKGR

Input:: Multimodal Geographic Knowledge Graph $G_{g e o} = {V, R, E, M}$
Output:: Model parameters $Θ$ , multimodal geographic knowledge graph embedding $H_{m}$
1:: Initialize model parameters $Θ$
2:: Construct the text and triple in the format of $T e x t (e_{h, r, ?})$
3:: while not converge do
4:: Encode $G_{g e o}$ , $T e x t (e_{h, r, t})$ , and $P i c t u r e$ as $H^{g}$ , $H^{t}$ , and $H^{v}$
5:: Fuse $H^{g}$ with $H^{t}$ and $H^{v}$ to obtain $H^{gt}$ , $H^{gv}$
6:: Compute the contrastive learning loss $L_{c t}$ and $L_{c v}$ with Equations (8) and (9)
7:: Calculate $H^{gt}$ , $H^{gv}$ multi-head attention by Equations (10) and (11)
8:: Obtain fusion feature $\bar{h^{g t}}$ , $\bar{h^{g v}}$ with Equations (12) and (13)
9:: Combine the vision feature: ${Agg}^{v} = \cos (\bar{h_{l}^{g t}}, \bar{h_{l}^{g v}}) \cdot \bar{h_{l}^{g v}}$
10:: Compute multimodal fusion features $H_{m}$ by Equation (15)
11:: Calculate the task loss function $L_{B C E}$
12:: Update $Θ$ by minimizing $L = L_{B C E} + L_{c t} + L_{c v}$
13:: end while
14:: return $Θ$ , $H_{m}$

4. Experiment

This section first introduces the multimodal GeoKG dataset we constructed. It then outlines the experimental settings, including parameter configurations, evaluation tasks, and metrics. Finally, the experimental results and analysis of the model’s performance on the geographic attribute prediction task are presented to evaluate the model’s performance.

4.1. Dataset

Currently, there are no publicly available datasets for multimodal GeoKG. Therefore, we constructed two multimodal GeoKG datasets based on publicly accessible data from the Internet. We used the attributes, latitude and longitude, textual data, and image data of geographic entities, such as stores, restaurants, and hotels from the Yelp (https://www.yelp.com/dataset, accessed on 30 October 2024) dataset as the foundational data for constructing the geographic knowledge graph. The dataset covers 27 states in the United States, and we selected data from Pennsylvania and Florida to construct two multimodal geographic knowledge graph datasets: PA-30k and FL-25k.

In the two datasets, triples are categorized into two types: adjacency triples and attribute triples. For adjacency triples, entities within a distance of 50 m are considered adjacent. Additionally, we incorporate the entity category information into the adjacency relations as rich-semantic adjacency relations, enriching the types and information within these relations. In attribute triples, the head and tail entities represent the geographic entity and its attribute value, respectively, while the relation corresponds to the attribute type. Table 1 provides the statistical details of the PA and FL geographic knowledge graph datasets, including the number of entities, relations, and triples. The entity count is divided into attribute and geographic entities, while relations and triples are categorized as attribute and adjacency types, respectively.

4.2. Experimental Setup

In this paper, geographic attribute prediction is used as the experimental evaluation task. The purpose of geographic attribute prediction is to predict the missing entity in the attribute triple. For example, given (h, r, ?), the task is to predict the tail entity t. In the test set, the results are computed by ranking the scores predicted by the score function. In the geographic attribute prediction task, adjacency triples will be used as supplementary information to assist model training.

In this work, Hit@k and MRR are used as evaluation metrics. Hit@k measures the proportion of correct predictions that appear in the top k results, while MRR (Mean Reciprocal Rank) calculates the average of the reciprocal ranks of the correct predictions.

To evaluate the performance of the MGKGR model, we compare it with two categories of models: traditional KRL models and multimodal knowledge graph models in the general domain. For traditional knowledge graph representation models, we select TransE [12], DistMult [13], ComplEx [22], ConvE [23], CompGCN [24], and RGCN [25] as baseline models. For multimodal knowledge graph models, we choose the MKGFormer [35] as the baseline.

TransE is a translation-based embedding model that represents relationships between entities by interpreting them as translations in the embedding space
DistMult is a bilinear model that represents relationships as diagonal matrices, capturing interactions between entities through element-wise multiplication.
ComplEx extends DistMult by using complex-valued embeddings, enabling it to capture asymmetric relations between entities.
ConvE employs convolutional neural networks to learn interactions between entities and relations, enabling more expressive feature learning.
CompGCN integrates graph convolutional networks with various composition operations to capture complex interactions between entities and relations.
RGCN extends traditional GCNs to handle multi-relational data by incorporating relation-specific transformations.
MKGFormer is a transformer-based model for multimodal knowledge graphs that integrates textual, visual, and structural information, enabling comprehensive feature fusion and enhanced representation learning.

Experiments on MGKGR and other baseline models were conducted on a server equipped with an Intel(R) Core(TM) i7-10700 CPU and a GeForce RTX 3090 GPU (24GB memory). The experimental environment was based on Ubuntu 18.04 and CUDA 11.1.

4.3. Results and Analysis

The experimental results are shown in Table 2. It can be observed that MGKGR achieves the best performance across all metrics. This demonstrates that multimodal semantic information effectively enriches the feature representation of geographic knowledge graphs, thereby enhancing the model’s performance in geographic attribute prediction.

In traditional KRL models, TransE, ComplEx and RGCN achieved a relatively strong performance on two datasets. This can be attributed to the fact that TransE aims to minimize the distance between entities, a direct optimization objective that helps the model more accurately capture relation features within the knowledge graph. ComplEx is highly effective in modeling asymmetric relations and demonstrates strong performance in capturing rich-semantic adjacency relations. The design of the RGCN model is inherently capable of handling complex relationships, which enables it to comprehensively understand the information in the knowledge graph. Therefore, these two models are able to effectively leverage the enriched adjacency information in the training set, resulting in better feature representations. In contrast, models such as ConvE and CompGCN struggle to capture the connections between adjacency relations and attribute relations. The abundance of adjacency information even acts as noise for attribute prediction, leading to lower performance.

The multimodal model MKGFormer [35] achieved the second highest results in multiple metrics, which validates the importance of multimodal data for geographic knowledge graph representation. However, since it does not model the structural information of the knowledge graph, it still has certain limitations. In contrast, our proposed model MGKGR captures the structural features of the knowledge graph using the RGCN, and by integrating these with multimodal semantic features, MGKGR obtains a more comprehensive feature representation. The experimental results show that our approach achieves optimal performance across all metrics. This demonstrates that the two-stage multimodal feature fusion method proposed in this paper can effectively mitigate the impact of heterogeneity across different modalities. Additionally, it enhances the model’s ability to represent the knowledge graph by integrating features from multiple modalities.

5. Discussion

5.1. Ablation Study

This section discusses the impact of each module on the overall model performance. The MGKGR captures structural features from the GeoKG and fuses them with multimodal semantic features through a two-stage multimodal feature fusion process. It further constrains the distribution between semantic and structural features through contrastive learning, thereby achieving favorable results. Therefore, the ablation experiments are conducted under the following settings:

w/o Multimodal: MGKGR without the use of multimodal data. The input data are $G_{geo}$ defined in Section 3.1 excluding the multimodal dataset M.
w/o Structure: MGKGR without the fusion of structural features captured by KRL model.
w/o CL: MGKGR without the contrastive learning.
w/o SFF: MGKGR without the spatial feature fusion for modality enhancement.
w/o CL and SFF: MGKGR without both the contrastive learning and the spatial feature fusion for modality enhancement.

The experiments were conducted on the two datasets we constructed, and the results are presented in Table 3. Considering all metrics across the two datasets, MGKGR achieved excellent overall performance. Notably, compared to the outcomes without using multimodal data (w/o Multimodal), it is evident that multimodal semantic features significantly enhance the feature representation of geographic knowledge graphs, thereby improving the model’s performance. This indicates that, in scenarios similar to the example shown in Figure 1, combining multimodal semantic information from text and images can effectively improve the model’s ability to predict geographic entity attributes. In the PA-30K dataset, the ablation of the SFF and contrastive learning mechanisms leads to improvements in certain metrics, which can be attributed to the relatively dense distribution of geographic entities and the substantial proportion of adjacency triples present. The fusion of adjacency and attribute features within the geographic knowledge graph may introduce model bias during training, potentially interfering with the precise prediction of attribute triples. However, the abundance of adjacency triples enriches the information contained within the geographic knowledge graph’s features, resulting in enhanced overall performance and comparatively higher Hit@3 and Hit@10 metrics. Conversely, the FL-25K dataset has a sparser distribution of geographic entities, resulting in fewer adjacency triples. This prevents the model from developing a bias due to the large number of adjacency triples during the training process. Consequently, the model demonstrates improved performance in average rank predictions, yielding a higher MRR score.

5.2. Performance Comparison Across Different Relation Scenarios

In this paper, the triples in multimodal GeoKG are divided into attribute relation triples and adjacency relation triples. In the previous experiment, we focused on inferring attribute relations and did not evaluate the model in the adjacency relation scenario. This is based on the assumption that predicting geographic attributes holds greater significance in a geographic knowledge graph. To verify that our method also performs well in adjacency scenarios, we conducted experiments under different scenarios in this section and provided an analysis and comparison. Specifically, one-fifth of adjacency triples from the training set were included in the test set. To prevent information leakage, these adjacency triples were removed from the training set. During testing, adjacency triples and attribute triples were evaluated together. The results were separately recorded for adjacency scenarios, attribute scenarios, and mixed scenarios. The experiments were conducted using MGKGR and MKGFormer across two datasets. The results are shown in Figure 3.

The experimental results indicate that both models achieve significantly higher prediction scores in adjacency scenarios compared to attribute scenarios. This supports the assumption that the prediction of geographic attributes holds greater significance within a geographic knowledge graph. Furthermore, MGKGR consistently outperforms MKGFormer across various metrics, particularly in predicting adjacency scenarios, thereby validating the effectiveness of our method in integrating multimodal semantic features with the structural features of geographic knowledge graphs. In comparing the experimental outcomes across two different datasets, the prediction results in adjacency scenarios are higher in the PA-30K dataset than in the FL-25K dataset, while results in attribute scenarios are comparatively lower in PA-30K. This discrepancy is ascribed to the higher proportion of adjacency triples in the PA-30K dataset, where excessive adjacency information can lead to biased learning within the model. Furthermore, compared to the experimental results in Section IV under attribute scenarios, the outcomes in this section are relatively lower. This discrepancy arises from the splitting of the dataset, where certain adjacency triples were moved from the training set to the test set, reducing the integrity of adjacency information in the training data and consequently weakening the model’s predictive performance.

5.3. The Impact of Diversity in Relation Semantics

This section discusses the impact of rich-semantic adjacency relations and sparse semantic adjacency relations on the experimental results. Rich-semantic adjacency is represented by concatenating the categories of the head and tail entities with the adjacency relation, forming the pattern “EntityCategory-adj-EntityCategory”. Sparse-semantic adjacency is simply represented by “adj”. In the experiments above, rich-semantic adjacency relations are used for training, which provide rich information and enable the model to capture the latent relationships between entity categories and spatial topology. However, while rich-semantic adjacency relations enriched the information available to the model, it also increased the number of relations, thereby raising the complexity of model training. On the other hand, sparse-semantic adjacency triples contain only adjacency information, providing solely spatial topological relations between geographic entities. We selected several models for this comparative experiment, including the translation-based TransE, the bilinear model DistMult, the neural network-based RGCN, and the multimodal models MKGFormer and MGKGR. The experimental results are shown in the Table 4.

It can be observed that for the DistMult model, sparse-semantic adjacency leads to better prediction results. This is because DistMult excels at learning symmetric relationships and transforming rich-semantic adjacency, converting a large number of asymmetric relationships into symmetric ones, resulting in improved performance. For other models, the experimental results on rich-semantic adjacency relationships which contain more detailed information are superior. This indicates that combining rich semantic information with adjacency relationships can effectively enhance the model’s performance in representing and understanding knowledge graphs.

6. Conclusions

This paper proposes a multimodal geographic knowledge graph learning method called MGKGR. This method employs a two-stage fusion method to integrate the structural features of the geographic knowledge graph with the multimodal semantic information of geographic entities, effectively enhancing the representation of geographic knowledge graphs. Experimental results demonstrate that incorporating multimodal semantic information can significantly improve the quality of geographic knowledge graph representation. Furthermore, this paper verifies that adding additional semantic information to adjacency relationships can effectively enhance the model’s capability. However, there are still some limitations in this study, such as not accounting for the missing modality data in some entities and determining the optimal ratio between attribute triples and adjacency triples. These issues will be further explored in future work.

Author Contributions

Conceptualization, J.Z. and R.C.; Data curation, J.Z. and T.L.; Formal analysis, J.Z., R.C., S.L. and H.Y.; Investigation, J.Z.; Methodology, J.Z. and R.C.; Project administration, J.Z. and R.C.; Software, J.Z.; Supervision, J.Z., H.Y. and S.L.; Validation, J.Z. and R.C.; Visualization, J.Z.; Writing—original draft, J.Z., R.C. and S.L.; Writing—review and editing, J.Z., R.C. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available from the https://github.com/zhang-jian-qiang/MGKGR, accessed on 30 October 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, X.; Huang, Y.; Zhang, C.; Ye, P. Geoscience knowledge graph (GeoKG): Development, construction and challenges. Trans. GIS 2022, 26, 2480–2494. [Google Scholar] [CrossRef]
Ijumulana, J.; Ligate, F.; Bhattacharya, P.; Mtalo, F.; Zhang, C. Spatial analysis and GIS mapping of regional hotspots and potential health risk of fluoride concentrations in groundwater of northern Tanzania. Sci. Total. Environ. 2020, 735, 139584. [Google Scholar] [CrossRef] [PubMed]
Casali, Y.; Aydin, N.Y.; Comes, T. Machine learning for spatial analyses in urban areas: A scoping review. Sustain. Cities Soc. 2022, 85, 104050. [Google Scholar] [CrossRef]
Meng, M.; Dabrowski, M.; Stead, D. Enhancing flood resilience and climate adaptation: The state of the art and new directions for spatial planning. Sustainability 2020, 12, 7864. [Google Scholar] [CrossRef]
Werneck, H.; Silva, N.; Viana, M.C.; Mourão, F.; Pereira, A.C.; Rocha, L. A survey on point-of-interest recommendation in location-based social networks. In Proceedings of the Brazilian Symposium on Multimedia and the Web, São Luís, Brazil, 30 November–4 December 2020; pp. 185–192. [Google Scholar]
Islam, M.A.; Mohammad, M.M.; Das, S.S.S.; Ali, M.E. A survey on deep learning based Point-of-Interest (POI) recommendations. Neurocomputing 2022, 472, 306–325. [Google Scholar] [CrossRef]
Zhao, S.; Zhao, T.; King, I.; Lyu, M.R. Geo-teaser: Geo-temporal sequential embedding rank for point-of-interest recommendation. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 May 2017; pp. 153–162. [Google Scholar]
Grbovic, M.; Cheng, H. Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 311–320. [Google Scholar]
Liu, X.; Liu, Y.; Li, X. Exploring the context of locations for personalized location recommendations. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–15 July 2016; pp. 1188–1194. [Google Scholar]
Bijalwan, V.; Semwal, V.B.; Gupta, V. Wearable sensor-based pattern mining for human activity recognition: Deep learning approach. Ind. Robot. Int. J. Robot. Res. Appl. 2022, 49, 21–33. [Google Scholar] [CrossRef]
Rodrigues, R.; Bhargava, N.; Velmurugan, R.; Chaudhuri, S. Multi-timescale trajectory prediction for abnormal human activity detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2626–2634. [Google Scholar]
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013; Volume 26. [Google Scholar]
Yang, B.; Yih, W.t.; He, X.; Gao, J.; Deng, L. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chen, J.; Hou, H.; Gao, J.; Ji, Y.; Bai, T. RGCN: Recurrent graph convolutional networks for target-dependent sentiment analysis. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Athens, Greece, 28–30 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 667–675. [Google Scholar]
Qiu, P.; Gao, J.; Yu, L.; Lu, F. Knowledge embedding with geospatial distance restriction for geographic knowledge graph completion. ISPRS Int. J. Geo-Inf. 2019, 8, 254. [Google Scholar] [CrossRef]
Mai, G.; Janowicz, K.; Cai, L.; Zhu, R.; Regalia, B.; Yan, B.; Shi, M.; Lao, N. SE-KGE: A location-aware knowledge graph embedding model for geographic question answering and spatial semantic lifting. Trans. GIS 2020, 24, 623–655. [Google Scholar] [CrossRef]
Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive Representation Learning: A Framework and Review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
Ji, G.; He, S.; Xu, L.; Liu, K.; Zhao, J. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 687–696. [Google Scholar]
Nickel, M.; Tresp, V.; Kriegel, H.P. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; Volume 11, pp. 3104482–3104584. [Google Scholar]
Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; Bouchard, G. Complex embeddings for simple link prediction. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 2071–2080. [Google Scholar]
Dettmers, T.; Minervini, P.; Stenetorp, P.; Riedel, S. Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Vashishth, S.; Sanyal, S.; Nitin, V.; Talukdar, P. Composition-based multi-relational graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; Van Den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In Proceedings of the The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, 3–7 June 2018; Proceedings 15; Springer: Berlin/Heidelberg, Germany, 2018; pp. 593–607. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Usmani, A.; Khan, M.J.; Breslin, J.G.; Curry, E. Towards Multimodal Knowledge Graphs for Data Spaces. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1494–1499. [Google Scholar]
Kannan, A.V.; Fradkin, D.; Akrotirianakis, I.; Kulahcioglu, T.; Canedo, A.; Roy, A.; Yu, S.Y.; Arnav, M.; Al Faruque, M.A. Multimodal knowledge graph for deep learning papers and code. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Galway, Ireland, 19–23 October 2020; pp. 3417–3420. [Google Scholar]
Liu, Y.; Li, H.; Garcia-Duran, A.; Niepert, M.; Onoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal knowledge graphs. In Proceedings of the The Semantic Web: 16th International Conference, ESWC 2019, Portorož, Slovenia, 2–6 June 2019; Proceedings 16. Springer: Berlin/Heidelberg, Germany, 2019; pp. 459–474. [Google Scholar]
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference, Busan, Republic of Korea, 11–15 November 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
Fabian, M.; Gjergji, K.; Gerhard, W. Yago: A core of semantic knowledge unifying wordnet and wikipedia. In Proceedings of the 16th International World Wide Web Conference (WWW), Banff, AB, Canada, 8–12 May 2007; pp. 697–706. [Google Scholar]
Li, X.; Zhao, X.; Xu, J.; Zhang, Y.; Xing, C. IMF: Interactive multimodal fusion model for link prediction. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2572–2580. [Google Scholar]
Ben-Younes, H.; Cadene, R.; Cord, M.; Thome, N. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2612–2620. [Google Scholar]
Fang, Q.; Zhang, X.; Hu, J.; Wu, X.; Xu, C. Contrastive multi-modal knowledge graph representation learning. IEEE Trans. Knowl. Data Eng. 2022, 35, 8983–8996. [Google Scholar] [CrossRef]
Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; Chen, H. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 904–915. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Chen, J.; Deng, S.; Chen, H. Crowdgeokg: Crowdsourced geo-knowledge graph. In Proceedings of the Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence: Second China Conference, CCKS 2017, Chengdu, China, 26–29 August 2017; Revised Selected Papers 2; Springer: Berlin/Heidelberg, Germany, 2017; pp. 165–172. [Google Scholar]
Dsouza, A.; Tempelmeier, N.; Yu, R.; Gottschalk, S.; Demidova, E. Worldkg: A world-scale geographic knowledge graph. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Gold Coast, QLD, Australia, 1–5 November 2021; pp. 4475–4484. [Google Scholar]
Ning, Y.; Liu, H.; Wang, H.; Zeng, Z.; Xiong, H. UUKG: Unified urban knowledge graph dataset for urban spatiotemporal prediction. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2024; Volume 36. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]

Figure 1. Multimodal data in the geographic knowledge graph provides semantic information for geographic attribute prediction.

Figure 2. The framework of the proposed MGKGR. (A) Multimodal GeoKG Encoding module processes the multimodal data of multimodal GeoKG for effective encoding. (B) Two-Stage Multimodal Feature Fusion module integrates features from multiple modalities to generate the multimodal features of multimodal GeoKG.

Figure 3. Model performance on attribute relations, adjacency relations, and mixed relations.

Table 1. Dataset Statistics.

Dataset	#Entity		#Relation		#Triple
Dataset	#Attr_Entity	#Geo_Entity	#Attr_Relation	#Adj_Relation	#Attr_Triple	#Adj_Triple
PA-30K	218	31,267	87	514	223,987	186,049
FL-25K	222	24,037	87	360	249,222	76,326

Table 2. Overall results of MGKGR and baselines. The bold indicates the highest value in the experiment, and the underlined values represent the second highest.

Dataset	PA-30K				FL-25K
Dataset	Hit@1	Hit@3	Hit@10	MRR	Hit@1	Hit@3	Hit@10	MRR
TransE	30.66	48.37	50.58	39.51	33.51	49.88	52.08	41.82
DistMult	23.81	32.48	34.67	28.52	33.06	43.26	45.17	38.46
ComplEx	32.04	43.75	45.33	38.12	31.83	44.12	45.62	38.13
ConvE	19.49	26.49	30.46	24.08	20.18	25.88	29.87	24.08
CompGCN	18.89	24.89	25.84	22.11	19.83	25.40	26.46	22.88
RGCN	32.22	46.99	49.89	39.93	35.36	47.72	49.82	41.83
MKGFormer	34.04	48.75	51.72	41.85	36.24	48.93	51.14	42.99
MGKGR (ours)	35.06	49.93	52.43	42.96	36.63	50.16	52.73	43.92

Table 3. Ablation Study of MGKGR. The bold indicates the highest value in the experiment.

Dataset	PA-30K				FL-25K
Dataset	Hit@1	Hit@3	Hit@10	MRR	Hit@1	Hit@3	Hit@10	MRR
MGKGR	35.06	49.93	52.43	42.96	36.63	50.16	52.73	43.92
w/o Multimodal	32.22	46.99	49.89	39.93	35.36	47.72	49.82	41.83
w/o Structure	35.00	49.15	51.55	42.38	37.03	49.87	51.61	43.76
w/o CL	35.26	49.86	52.34	43.04	35.97	49.62	52.98	43.49
w/o SFF	35.72	49.53	52.21	43.15	36.91	49.95	52.21	43.91
w/o CL & SFF	35.41	49.88	52.38	43.16	34.41	48.71	52.71	43.32

Table 4. Performance comparison on different diversity of relation semantics. The bold indicates the highest value in the experiment.

Model		PA-30K				FL-25K
Model		Hit@1	Hit@3	Hit@10	MRR	Hit@1	Hit@3	Hit@10	MRR
TransE	“Rich”	30.66	48.37	50.58	39.51	33.51	49.83	52.08	41.82
TransE	“Sparse”	30.47	48.34	50.55	39.37	33.14	49.38	50.94	40.15
DistMult	“Rich”	23.81	32.48	34.67	28.52	33.06	42.36	45.17	38.46
DistMult	“Sparse”	28.41	35.25	38.56	33.64	44.05	45.98	49.38	38.58
RGCN	“Rich”	32.22	46.99	49.89	39.93	35.36	47.62	49.82	41.83
RGCN	“Sparse”	31.78	44.99	49.37	39.54	34.76	47.24	50.61	40.54
MKGFormer	“Rich”	34.04	48.75	51.72	41.85	36.24	48.93	51.14	42.99
MKGFormer	“Sparse”	31.51	47.22	51.43	40.56	36.39	48.94	50.88	40.44
MGKGR (ours)	“Rich”	35.06	49.93	52.43	42.96	36.63	50.16	52.73	43.92
MGKGR (ours)	“Sparse”	34.98	49.89	52.22	42.89	36.26	50.11	52.83	43.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Chen, R.; Li, S.; Li, T.; Yao, H. MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation. Algorithms 2024, 17, 593. https://doi.org/10.3390/a17120593

AMA Style

Zhang J, Chen R, Li S, Li T, Yao H. MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation. Algorithms. 2024; 17(12):593. https://doi.org/10.3390/a17120593

Chicago/Turabian Style

Zhang, Jianqiang, Renyao Chen, Shengwen Li, Tailong Li, and Hong Yao. 2024. "MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation" Algorithms 17, no. 12: 593. https://doi.org/10.3390/a17120593

APA Style

Zhang, J., Chen, R., Li, S., Li, T., & Yao, H. (2024). MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation. Algorithms, 17(12), 593. https://doi.org/10.3390/a17120593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation

Abstract

1. Introduction

2. Related Work

2.1. Traditional Knowledge Graph

2.2. Multimodal Knowledge Graph

2.3. Geographic Knowledge Graph

3. Methods

3.1. Multimodal GeoKG Encoding

3.2. Two-Stage Multimodal Feature Fusion

3.3. Model Optimization

4. Experiment

4.1. Dataset

4.2. Experimental Setup

4.3. Results and Analysis

5. Discussion

5.1. Ablation Study

5.2. Performance Comparison Across Different Relation Scenarios

5.3. The Impact of Diversity in Relation Semantics

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.