Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model
Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model
Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model
https://doi.org/10.1007/s00521-022-07790-5 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Abstract
The goal of this research is to use social media to gain situational awareness in the wake of a crisis. With the developments
in information and communication technologies, social media became the de facto norm for gathering and disseminating
information. We present a method for classifying informative tweets from the massive volume of user tweets on social
media. Once the informative tweets have been found, emergency responders can use them to gain situational awareness so
that recovery actions can be carried out efficiently. The majority of previous research has focused on either text data or
images in tweets. A thorough review of the literature illustrates that text and image carry complementary information. The
proposed method is a deep learning framework which utilizes multiple input modalities, specifically text and image from a
user-generated tweet. We mainly focused to devise an improved multimodal fusion strategy. The proposed system has a
transformer-based image and text models. The main building blocks include fine-tuned RoBERTa model for text, Vision
Transformer model for image, biLSTM and attention mechanism. We put forward a multiplicative fusion strategy for
image and text inputs. Extensive experiments have been done on various network architectures with seven datasets
spanning different types of disasters, including wildfire, hurricane, earth-quake and flood. Several state-of-the-art
approaches were surpassed by our system. It showed good accuracy in the range of 94–98%. The results showed that
identifying the interaction between multiple related modalities will enhance the quality of a deep learning classifier.
Keywords Disaster tweet classification Multimodal data fusion BiLSTM Attention RoBERTa Vision transformer
123
Neural Computing and Applications
Physical sensors, human sensors, and social media users different classification model. Individual classification
are all data sources for social media analysis [37]. In the model decisions are merged in some way to arrive at a final
proposed system, we focus on data supplied by social judgement. Early Fusion: Here, fusion is taking place at the
media users, particularly the general public. Early warning feature level. Different features will be merged in some
and event detection, situational awareness, resource col- manner and subsequently processed.
lection and dissemination and post-disaster analysis are We explored a variety of text and image models, as well
some ways that social media can be used in disaster as experiments, to better understand how text and image
informatics. features interact. For text and image processing, we used
In the proposed work, we concentrate on getting situa- cutting-edge transformer-based models. To identify the
tional awareness from the available data. For this, we have interaction between different input modalities, we put
to identify whether the social media content is informative forward an early fusion strategy.
or not informative. Situational awareness is when you are We performed extensive experiments with CrisisMMD
aware of what is going on around you [11] or—in a tech- [2] which is a real-world dataset, collected from Twitter
nical manner—as ‘‘the perception of the elements in the during 7 major natural disasters occurred at different parts
environment within a volume of time and space, the of the world. The experimental setup is the same for all the
comprehension of their meaning and the projection of their experiments to get comparable results. We tested our sys-
status in the near future’’ [10]. tem in two scenarios to confirm its viability in real-world
Social media analysis has numerous obstacles in this era use. In-domain Classification: The same dataset’s frag-
of data deluge. The data on social media is overflowing ments are used for both training and testing. Cross-domain
with noisy data. When a crisis occurs, social media Classification: The classifier is tested on a different dataset
becomes overburdened with messages, making it impossi- after being trained on one. The results showed that the
ble for rescuers to keep track of them. It’s challenging to proposed approach is more robust than baseline models and
pick out messages that require immediate attention from single-modality systems.
the massive volume of texts. The nonstandard terminology Even though the proposed approach showed good
and brevity of social media material also make it difficult results, there are still chances for improvement. With
to find important information. In addition to textual mate- greater hardware resources, we can expand the system’s
rial, users can contribute images and videos from the crisis capacity. Deep learning architectures need a lot of
region, which can improve situational awareness. The resources, including greater memory and computing power.
studies [2, 40–42, 50, 51] say that single modality analysis The experiments are constrained by the resources available.
is not sufficient for getting good results. The proposed system helps to improve the situational
In the past, the majority of social media research awareness for disaster response and recovery operations,
focused on text-only or image-only data. However, when which benefits people’s quality of life. Our findings are
image and text data are combined, they reveal comple- useful to research disciplines that require diverse inputs
mentary information. Multimodal data analysis is an active from several modalities, such as fake news detection,
open research area. Using various modalities provides question-answering systems and so on.
more contextual information, allowing more robust learn- The contributions of this research are listed below:
ing. Thus it necessitates an information processing system
• An extensive comparative analysis of various textual,
which can automatically identify disaster-relevant tweets
visual and multimodal systems for classification task in
by considering both text and image. Figure 1 displays some
in-domain and cross domain scenarios is performed.
tweets from recent emergency situations that include text
• Transformer-based image and text processing models
and related images.
are utilized for multimodal fusion systems.
This research aims to develop a classification system
• A robust deep learning neural network architecture is
that considers both the text and the associated images to
proposed that has a novel multi-modal feature fusion
determine whether a tweet is informative or not. Identify-
layer and some of the most recent deep learning
ing disaster-relevant information is an essential require-
techniques such as BiLSTM, ViT model and RoBERTa
ment for authorities to deliver efficient and on-time
model as the building blocks. The proposed model can
recovery operations.
exploit the hidden interaction between multiple modal-
Multimodality can be dealt with a variety of ways,
ities for the automatic detection of informative tweets
including integrating various models at the output
for a disaster response system. Later, the tweet can be
level(referred to as decision-level or late fusion) or at the
used for disaster recovery and mitigation.
feature level (referred to as feature-level or early fusion)
[14]. Late Fusion: Fusion takes place at the decision- The arrangement of the paper is as follows: Sect. 2 gives a
making level. For each input modal, there will be a quick summary of some of the most noteworthy disaster-
123
Neural Computing and Applications
related works. The necessary background details for shown the utility of Twitter text and images in disaster
building up the multimodal fusion system are elaborated in recovery operations.
Sect. 3. Section 4 illustrates the approach and architecture
of the proposed system. Section 5 lays forth the experi- 2.1 Text-based classification
mental setup and assesses the outcomes. Finally, Sect. 6
gives a conclusion and prospects of the system. Sreenivasulu et al. [25] presented a work with random
forest classifier for identifying damage-related tweets.
They utilized lexical features, syntactic features and fre-
2 Related works quency of the words related to damage assessment. Linear
regression and support vector regression are used to weigh
In this section, we discuss previous research initiatives that different features. They achieved accuracy up to 94% on
have been relevant to our work. There are text-only anal- datasets of earthquakes in Italy, Chily and India floods. In S
yses, image-only analyses and multimodal analyses. In Madichetty et al. [29] a stacked Convolutional Neural
some multimodal systems, inputs from multiple sources Network (CNN) is proposed for recognising tweets con-
have been taken. E.g., different sensors (temperature sen- veying resource requirement and availability. Crisis word
sors, humidity sensors), data from weather departments embedding is used here. They concatenated the output of a
(wind speed, rainfall level) etc. In the proposed approach, K-nearest neighbour (KNN) classifier and a CNN classifier.
we take multiple inputs of different modalities from the The concatenated output is finally classified by a support
tweet. It provides the advantage of quick and immediate vector machine (SVM) classifier. They achieved accuracy
data availability in the event of a calamity. There is no in the range 67–77% on datasets for different disasters.
need to use data from other sources. Several studies have Snyder et al. [45] put forward an interactive learning
123
Neural Computing and Applications
framework for situational awareness classification. The MobileNetV3 with feed forward neural network (FFN)
user can iteratively interact with the system. Whenever the added for classification. They developed a dataset for
classification goes wrong, the user corrects the classifier. natural disasters (wildfire, flood, earthquake and volcanic
Only the tweet text is considered here. As the building eruption) and disaster intensity levels (severe, moderate
blocks, they employed Word2Vec embedding, CNN, and and insignificant). They achieved an accuracy of 96.8% for
Long Short Term Memory (LSTM). This system is inte- disaster type and 93.2% for intensity level.
grated with the SMART2.0 toolkit. Kyrkou and Theocharides [21] developed ERnet, a
To aid victims who require medical assistance, Sreeni- computationally efficient CNN model with residual con-
vasulu et al. designed a majority voting-based ensemble nections for aerial image classification suitable for
classifier which can identify medical resource tweets [28]. unmanned aerial vehicles (UAV). They introduced an
They achieved 82.4% accuracy. Zahra et al. [52] used a aerial image dataset, including images of fire, flooding,
random classifier with linguistic characteristics to identify ruined buildings and traffic accidents. They achieved an
eye-witness messages. They defined features exclusive to a accuracy of 90.1% with a reduced memory requirement.
direct eye-witness, such as words indicating perceptual
senses, first-person pronouns and adjectives. 2.3 Multimodal approach
In Ghafarian et al. [13] an approach based on a lin-
guistic concept known as ‘distributional hypothesis’ is Rizk et al. [39] developed a multimodal framework with
proposed. A tweet is modelled as a distribution of words. two stages suitable for energy-constrained devices to pro-
An SVM classifier predicts a label for a distribution. They cess social media tweet texts and images. Level 1 classi-
showed that the idea is superior than the bag-of-words fiers process image and text. These classifiers’ decisions
(BOW) model. They tested their system with several are combined and used to train level 2 classifier. They
datasets and achieved an accuracy of (74–80)%. Kejriwal attained an accuracy of 92.43%.
et al. [19] proposed a system to detect disaster-related Mouzannar et al. [32] developed a multimodal deep
urgent messages using minimally-supervised approach. learning model to identify damage-related information. For
They trained the system with labelled and unlabelled image processing, pretrained Inception model is used, and
tweets. This approach is suitable for adapting to a new for text processing, CNN-based neural network is used.
crisis. Finally, the text features and image features are combined
and given to the FFN classifier. They achieved an accuracy
2.2 Image-based classification of 92.62%.
Mohanty et al. [31] conducted a case study of Hurricane
Alam et al. [1] developed the Image4Act, a deep neural Irma. For a relevancy classifier, they proposed a multi-
network framework for image classification. They utilized modal technique. They used four different classifiers: a
social media images posted during a disaster to get the classifier to identify tweets with geospatial attributes, an
situational awareness. They fine-tuned VGG16 and image classification model, a user authenticity classifica-
achieved 67% accuracy on relevancy classification. tion model and a text classification model to find tweets
Alam et al. [3] implemented image classification for that were just about Hurricane Irma. The results of the four
damage level assessment. The images from disaster areas models are combined. If the score exceeds a predetermined
will be very messy and difficult to understand, even for level, the social media post is classified as disaster relevant.
humans. They implemented real-time image capturing, They used a decision-level fusion method.
deduplication and information extraction. For image clas- Another multimodal approach was proposed by Kumar
sification and deduplication, they used the VGG16 model et al. [20] to classify disaster-related informative content.
and the perceptual hashing technique, respectively. Their The feature vector generated by LSTM and VGG16 is
results showed that by utilizing social media imagery, concatenated and further passed through an FFN. They
emergency responders can deliver relief efforts effectively. achieved F1-score ranging from 0.61 to 0.92 for various
Chaudhuri et al. [7] developed a CNN model to classify datasets.
images having human body parts out of the debris. With Madichetty et al. [26] proposed a multimodal approach
this system, the emergency responders can get information utilizing BERT language model and DenseNet image
about trapped survivors. Their system was suited for a model to analyze tweet text and associated image. They
smart city environment. They achieved an accuracy of implemented a late fusion approach. The output probability
83.2%. vectors from the text model and image model are averaged.
Valdez and Godmalin [49] proposed a lightweight CNN This value is taken for prediction.
with two classification heads for identifying the type and Ofli et al. [34] proposed a deep learning neural network
intensity level of natural calamity. They used a fine-tuned that realizes a multimodal approach with image
123
Neural Computing and Applications
classification using VGG16 and text classification using 3.2 Word embedding
word2vec and CNN. Image feature vector and text feature
vector are concatenated and given to an FFN for final Word embedding is a distributed learned representation for
classification. They achieved an accuracy of 78.4%. text in the form of real-valued vector. The words having
Existing multimodal fusion approaches are not promis- the same meaning will have a representation that is more or
ing for handling complex multimodal and high-dimen- less similar. This is one of the key breakthroughs in deep
sional data. This is due to the fact that: learning techniques for Natural Language Processing
(NLP).
• Existing systems consider only a single modality input
and learn the pattern in that modality. They cannot
3.2.1 RoBERTa
identify relationships across different input modalities.
• The existing systems are unable to prioritise different
BERT (Bi-directional Encoder Representations from
features in the order of importance. They value all
Transformers), introduced by Google Brain in 2018 [8], is
features equally.
a milestone in the realm of NLP. It is pretrained on
As a solution, we require a framework for classification BookCorpus and Wikipedia. It outperformed several NLP
techniques based on feature-fusion that identify crucial systems. The basic building block of BERT is transformer,
latent associations in input data from different modalities. which is an improvement over traditional encoder-decoder
It must place varying degrees of emphasis on various ele- systems. Later, several improvements over BERT in terms
ments (e.g., different words, different regions of images, of performance and training speed were proposed, such as
etc.) depending on their importance. XLNet, RoBERTa, DistilBERT etc. RoBERTa, introduced
by Facebook, outperformed BERT in over 20 NLP tasks on
GLUE benchmark datasets.
3 Preliminaries In the proposed approach, RoBERTa [22] model is used
to generate word embedding. We experimented with sev-
This section provides an overview of the key components eral word embedding techniques, and the results are shown
of the proposed system. The proposed system constitutes a in the Table 2. RoBERTa showed promising results.
text preprocessing module, RoBERTa (Robustly Opti- RoBERTa is a transformer-based NLP model. It generates
mized BERT Pretraining Approach) text model, ViT dynamic word embedding.
(Vision Transformer) image model, BiLSTM (Bi-direc-
tional Long Short Term Memory) and attention module. 3.3 Fine-tuned RoBERTa
123
Neural Computing and Applications
trained model can be considered for solving the current parameters of forget gate, input gate and output gate of a
problem. single cell in LSTM. r; , þ and k denote the sigmoid
In [48], the authors have used the technique of fine- function, element-wise multiplication, element-wise addi-
tuning to utilize the ALBERT language model for cyber- tion and concatenation operation respectively. xt denotes
bullying analysis of social media data. Their system has the input data at time t. ht is the hidden state at time t. C~t
achieved state-of-art results with an F1 score of 95%. In and Ct represent the candidate cell state and the final cell
[33], the authors used a fine-tuned BERT model for the state respectively at time t.
sentiment analysis of the Indonesian user reviews about The forget gate determines what information is to be
mobile apps. Their experiments showed overfitting beha- removed and what information is to be retained from the
viour when the model was trained from scratch. They got previous states, the input gate determines what information
state-of-the-art results with Indo-BERT-Base, a BERT is to be taken into consideration from the current inputs,
variant pretrained in the Indonesian language. In [35] the and the output gate determines what information is to be
authors implemented an NLP system for multilingual passed to the next time step. An LSTM cell has two states:
grammatical error correction. The results showed that fine- hidden state and cell state. All these things are manifested
tuning produces better models with a relatively smaller with the help of sigmoid and tanh functions.
dataset and reduced computational power consumption.
I ¼ ht1 k xt ð1Þ
In our use case, a properly labelled dataset is lacking at
the onset of a disaster. We used the pre-trained RoBERTa ft ¼ rðWf I þ bf Þ ð2Þ
model, which is later fine-tuned for our task-specific
dataset. it ¼ rðWi I þ bi Þ ð3Þ
The key characteristics of RoBERTa are: C~t ¼ tanhðWc I þ bc Þ ð4Þ
• It generates contextualized word embedding.
Ct ¼ ft Ct1 þ it C~t ð5Þ
• It takes into account the bidirectional transformer
concept so that it can utilize past and future contexts. ot ¼ rðWo I þ bo Þ ð6Þ
• Pretrained with 10 times more data and 8 times larger
ht ¼ ot tanhðCt Þ ð7Þ
batches than BERT.
• Byte-pair encoding is used for tokenization. So, it can BiLSTM is a sequence processing neural network model. It
recognize rare words and out-of-vocabulary tokens. is the improved variant of RNN and LSTM. It is capable of
• In BERT, randomly chosen 15% masks used in the solving the vanishing and exploding gradient problem.
pretraining is fixed for the entire process. In RoBERTa, BiLSTM is made up of two LSTMs. One LSTM pro-
masks are dynamically changed during the training cesses text in the forward direction (from left to right),
time. while the other processes text in the backward direction
(from right to left). A high-level view of BiLSTM is shown
3.4 BiLSTM in Fig. 3.
wt: word embedding of tth word
The basic building block of Bi-directional Long Short wft: representation of wt by forward LSTM
Term Memory (BiLSTM) is Long Short Term Memory wbt: representation of wt by backward LSTM
(LSTM). LSTM was introduced as a replacement to
In the forward direction, the output of forward LSTM is
Recurrent Neural Network (RNN) to solve the problem of
determined by the current input word vector and the pre-
vanishing gradient.
ceding hidden vector. Similarly, in the backward direction,
LSTM has three blocks: forget gate, input gate and
the output of backward LSTM is determined by the current
output gate [17]. Other than the addition of gating mech-
input word vector and the prior hidden vector. The output
anism, the concept of LSTM is similar to RNN. The gating
of forward and backward LSTMs are then concatenated to
mechanism helps to remove irrelevant parts and retain
produce the final feature vector. As a result, the available
relevant parts from the previous states. Consequently,
contextual information is improved. This architecture
LSTM is able to beat the vanishing gradient (i.e. evading
materializes the idea that the meaning of a word is deter-
information) problem.
mined by the words that come before and after it. Hence,
Figure 2 shows a high-level architecture of LSTM.
BiLSTM is thought to generate a feature representation that
Equations 1 to 7 show the mathematical formulae for the
captures more information. This leads to improved
forget gate, input gate and output gate. ft ; it and ot represent
learning.
forget gate, input gate and output gate at time step t.
Wf ; Wi ; Wo ; bf ; bi and bo indicate the weight and bias
123
Neural Computing and Applications
Fig. 2 LSTM
123
Neural Computing and Applications
v is a trainable parameter which is randomly initialized Figure 6 shows the detailed architecture of the proposed
and jointly learned by the system to identify the most system.
attentive word. Over time, several variants of attention The proposed system achieved superior performance
mechanism according to the score calculation and context over state-of-the-art systems.
vector generation came out.
123
Neural Computing and Applications
123
Neural Computing and Applications
X 2 RHWC ð11Þ
2
CÞ
Xp 2 RNðP ð12Þ
HW
N¼ ð13Þ
P2
4.2.2 Visual feature extraction Multimodal fusion is the process of combining data from
many input modalities into a single unit. It exploits the
In the proposed approach, we are using Vision Transformer complementarity of heterogeneous inputs.
(ViT) [9], a transformer-based image classification model. We propose a multiplicative fusion approach with a
The authors of [9] claim that ViT achieved 4 times better BiLSTM followed by attention mechanism.
results when compared with the state-of-the-art CNN
models. 4.3.1 Multimodal interaction layer
We have experimented with CNN models and ViT. The
results are shown in the Table 3. ViT delivered a good This layer combines inputs of multiple modalities.
performance. In the proposed system, early fusion with multiplicative
In the proposed approach, we used the ViT_b32 model. merging is done. Early fusion can be considered as gen-
It features 12 layers of transformer encoders, and thus 12 erating fine quality features.
attention heads. Transformer can take only 1d input. Both feature extraction module generates feature vectors
Therefore, the image, X, is converted from the dimension of dimension 768. To reduce the dimensionality of feature
of H W C to a sequence of flattened 2d patches, Xp vectors, a nonlinear mapping is done by a feed forward
having the dimension of N ðP2 CÞ [9]. neural network. Then merging is done by multiplying each
123
Neural Computing and Applications
pair of elements from both feature vectors. Equations 14– 4.4 Classification module
16 shows the fusion operations implemented.
0
ð14Þ The contextual feature vector, C, is fed to a fully connected
T ¼ f ðT W0 þ b0 Þ
FFN having a softmax activation function. The classifier is
0
I ¼ f ðI W1 þ b1 Þ ð15Þ defined as follows:
TI ¼ T I
0 0 y~ ¼ argmax pðyjCÞ ð20Þ
y2Y
768
T; I 2 R
0 0
pðyjCÞ ¼ softmaxðW C þ bÞ ð21Þ
T ; I 2 Ru
u
ð16Þ where W is the weight matrix, b is the bias vector, Y is the
du
TI 2 Rd
set of classes, y~ is the predicted label for the feature vector
W0 ; W1 2 Ru768 C, i.e. informative or not_informative.
b0 ; b1 2 Ru The whole model is trained end-to-end with a supervised
learning procedure.
where, T is the tweet text, I is the image, u is the dimension
0 0
of nonlinearly-mapped feature vectors, T and I are the
mapped text and image feature vectors, respectively.
denotes the outerproduct. W 0 ; W 1 ; b0 and b1 are weight and
bias vectors.
The weight matrix and bias vectors are jointly learned in
the network. Later, TI is passed to BiLSTM with an
attention layer.
X
n
C¼ ai m i ð19Þ
i¼1
123
Neural Computing and Applications
5.1 Dataset
123
Neural Computing and Applications
the classification layers are relu and softmax functions. The learn the hyper-plane that clearly separates the classes
batch size used is 8. All the details of the proposed system with a good kernel function. But choosing the right
are given in Sect. 4. To beat the overfitting problem, we kernel is not an easy task.
used dropout regularization of 0.2 and an early stopping • Random Forest [25]: We have built an RF classifier for
mechanism for validation accuracy. text classification along with TF-IDF vectorization. RF
is a highly successful machine learning algorithm. It is
5.3 Baseline models based on a collection of decision trees, known as
’forest’, learned using the ‘bagging’ method. One
To exemplify the effectiveness of our system, we compared peculiarity of RF is that it can measure the relative
the system with certain baseline models for text classifi- importance of each feature on prediction. By randomly
cation, image classification and multimodal classification. selecting data samples and features, multiple decision
trees are built, and either the average or the maximum
5.3.1 Text-only classification vote of the results is taken.
• Stacking Ensemble [24, 29]: Multiple heterogeneous
We built tweet classification models for tweet text only. base learners are simultaneously trained. The meta
learner is trained using the base learner’s predictions as
• Gaussian Naive Bayes [38]: Gaussian Naive Bayes is a
features. An ensemble learner can outperform any
variant of the Naive Bayes classifier. It is a simple
single base model. We implemented a stacking ensem-
algorithm but has good predictive power. It is com-
ble with decision tree, random forest, KNN and
monly applied for text classification. We implemented a
XGBoost as base learners and logistic regression as
Naive Bayes’ classifier for tweet text classification with
the meta learner.
TF-IDF vectorization.
• Majority Voting Ensemble [28]: Multiple base models
• SVM [25]: We implemented a SVM (Support Vector
are built, and their predictions are combined. Finally,
Machine) classifier for tweet text classification with TF-
the system predicts the class with the most votes.
IDF vectorization. SVM is also proved to be successful
Ideally, the system will be better than any single model
in several text classification tasks. The algorithm tries to
used. We implemented a majority voting classifier with
123
Neural Computing and Applications
AdaBoost, XGBoost, Random Forest and SVM as the is implemented.They have done a late fusion strategy
contributing models. with an additive operation.
• Deep learning models: We implemented some deep • Abhinav Kumar (2020) [20]: A multimodal informa-
learning models with the following techniques as word tive tweet classification is done with LSTM and fine-
embedding methods. tuned VGG16 with a concatenative feature fusion.
• Gautam et al. (2019) (mean probability) [12]:
– GLOVE [23]
ResNet50 for image classification and
– Word2Vec [45]
BiLSTM,CNN?GLOVE for text classification are
– BERT [23, 26]
used. The late fusion strategy is done by averaging
– RoBERTa
the class probabilities.
– XLNet
• Gautam et al. (2019) (custom decision policy) [12]:
ResNet50 for image classification and
5.4 Image-only analysis BiLSTM,CNN?GLOVE for text classification are
used. The class probabilities are averaged and given
We built tweet classification models based on image clas- to a classification system with ReLU and softmax
sification using the following pre-trained image models. functions.
• Gautam et al. (2019) (logistic regression decision
• VGG16 [1, 3, 20, 27, 43]
policy) [12]: Image feature extraction is done with
• VGG19 [12]
ResNet50 and text feature extraction is done with
• InceptionV3 [32, 46]
BiLSTM, CNN with GLOVE. Later, the features are
• ResNet50 [12, 16]
concatenated and used for classification.
• DenseNet [18, 26]
• EfficientNetV2 [47]
• ViT [9]
6 Evaluation metrics
5.5 Multimodal analysis
Accuracy, macro-average precision, macro-average recall,
macro-average F1-Score and ROC-AUC score are used as
We experimented with various multimodal fusion tech-
the evaluation metrics for our classification model.
niques and compared the proposed approach with some
existing systems. We carried out both in-domain and cross- TP þ TN
Accuracy ¼ ð22Þ
domain classifications. TP þ TN þ FP þ FN
TP
• Additive fusion Different inputs are fused by addition Precision ¼ ð23Þ
operation. This is suitable for applications that are not TP þ FP
strongly affected by the joint values of the inputs. TP
Recall ¼ ð24Þ
• Concatenative fusion Multiple inputs are fused by TP þ FN
concatenating each other. The argument in support of 2 Precision Recall
concatenation is that the inputs are not at all modified or F1 Score ¼ ð25Þ
Precision þ Recall
at least limited to some extent. So, the naturality of the
inputs can be preserved. But it cannot capture the where TP: True Positive, TN: True Negative, FP: False
interaction between multiple modalities. Positive, FN: False Negative
• Averaging The inputs are averaged for getting the We have used macro-averaged precision, recall and F1-
fused vector. Score to give equal importance to all the classes. It is the
• Multiplicative fusion Multiple inputs are fused by arithmetic mean of all distinct classes’ respective values.
multiplication operation. Multiplicative fusion is good So, we will get an average measurement per class.
Pn
at learning interaction between multiple modalities. Precisioni
Macro averaged Precision ¼ i¼1 ð26Þ
• Sreenivasulu et al. (2021) [26] : The authors have n
implemented a multimodal tweet classification model Pn
Recalli
with fine-tuned BERT and DenseNet. They have done a Macro averaged Recall ¼ i¼1 ð27Þ
n
late fusion strategy with an averaging operation. Pn
• S. Madichetty et al. (2020) [27]: (Multimodal additive F1 Scorei
Macro averaged F1 Score ¼ i¼1
fusion with VGG16 and CNN) A multimodal tweet n
classification model with a CNN and fine-tuned VGG16 ð28Þ
123
Neural Computing and Applications
123
Neural Computing and Applications
Table 7 Results of multimodal fusion strategies on Hurricane Maria Table 10 Results of multimodal fusion strategies on Srilanka Floods
dataset Dataset
Model no. Acc. Precision Recall F1- ROC-AUC score Model no. Acc. Precision Recall F1- ROC-AUC score
score score
Table 8 Results of multimodal fusion strategies on Iraq-Iran Earth- Table 11 F1 scores of cross domain classification with proposed
quake dataset system; D1: Hurricane Irma, D2: Hurricane Maria, D3: Hurricane
Harvey, D4: California Wildfire, D5: Mexico Earthquake, D6: Sri
Model no. Acc. Precision Recall F1- ROC-AUC score Lanka Floods, D7: Iraq-Iran Earthquake
score
Train set Test set
M1 83.0 74.0 63.0 66.0 –
D1 D2 D3 D4 D5 D6 D7
M2 68.18 – – 67 –
M3 – 79.0 79.0 79.0 – D1 95.0 84.0 88.0 81.0 78.0 90.0 82.0
M4 73.5 – – – D2 73.0 98.0 87.0 80.0 76.0 85.0 79.0
M5 80.2 – – D3 76.0 84.0 97.0 82.0 76.0 90.0 81.0
M6 75.2 – – – D4 74.0 76.0 81.0 97.0 79.0 85.0 78.0
M7 93.94 94.0 94.0 94.0 .9889 D5 74.0 72.0 74.0 72.0 98.0 70.0 79.0
M8 96.46 97 96.0 96.0 .9720 D6 77.0 78.0 81.0 73.0 80.0 97.0 82.0
M9 94.95 95.0 95.0 95.0 .9917 D7 70.0 78.0 80.0 78.0 75.0 69.0 98.0
M10 97.98 98.0 98.0 98.0 .9833
Table 9 Results of multimodal fusion strategies on Mexico Earth- We implemented tweet classification with the pretrained
quake Dataset image classification models, including VGG16, VGG19,
Model no. Acc. Precision Recall F1- ROC-AUC score InceptionV3, ResNet50, EfficientNet, DenseNet and a
score transformer-based vision model, ViT. Each of the models
differs in computational complexity. Table 3 shows the
M1 83.0 76.0 81.0 78.0 –
results of various image classification models. ROC-AUC
M2 74.29 – – 74.25 –
curves are also shown in Fig. 9. The authors of ViT
M3 – 73.0 72.0 72.0 –
claimed that ViT achieved 88.55% accuracy in ImageNet
M4 77.3 – – –
1k dataset. We implemented a ViT base-32 based image
M5 74.6 – –
classification model.
M6 77.9 – – – In our experiments, ViT shows better results. It achieved
M7 92.76 93.0 93.0 93.0 .9772 an accuracy gain in the range 0–8% and F1-score gain in
M8 92.52 93.0 93.0 93.0 .9768 the range 0–8% over other models. The strength of ViT is
M9 93.22 93 93 93 .979 that it is a transformer-based model. The embedded image
M10 97.90 98 98 98 .9930 patches are output to a transformer encoder.
ViT enjoys all the benefits of transformers. It has some
residual connections to gain long-range dependency. The
123
Neural Computing and Applications
multihead self-attention layer in transformer enables In addition to the proposed method, we implemented
information to be embedded globally over the entire image. additive, concatenative, and averaged fusion strategies
CNN, on the other hand, is based on local filters, which together with RoBERTa and ViT. Proposed method has
results in poor performance in capturing useful patterns multiplicative fusion.
globally. Also, transformers introduce better parallelization Our experiments included in-domain and cross-domain
than CNN. scenarios.
Proposed approach uses ViT for visual feature extrac-
• In-domain classification We performed a comparative
tion. It shows a clear margin on all performance metrics
analysis of various fusion methods and existing sys-
over ViT only model, which clearly emphasizes the
tems. Details are given in Sect. 5.5.
necessity of multimodal systems.
Table 4, 5, 6, 7, 8, 9 and 10 and Fig. 11 shows the per-
7.3 Multimodal classification formance metrics of different multimodal systems tested on
seven datasets including Hurricanes—Harvey, Irma, Maria,
If either the text or the image in a tweet is informative, the Earthquakes—Iraq_Iran, Mexico, Srilanka Floods and
tweet is deemed informative. This assumption avoids the California Wildfires. Figure 10 shows the ROC-AUC
information loss. curves for different fusion strategies tested on Hurricane
123
Neural Computing and Applications
Maria dataset. The naming convention for models used in – M1: Sreenivasulu et al. [26] (Multimodal additive
the tables and figures is as follows: fusion with fine-tuned BERT and DenseNet)
– M2: Madichetty et al. [27] (Multimodal additive fusion
with VGG16 and CNN)
– M3: Kumar [20]
– M4: Gautam et al. [12] (Mean Probability)
– M5: Gautam et al. [12] (Custom Decision Policy)
– M6: Gautam et al. [12] (Logistic Regression Decision
Policy)
– M7: Additive fusion with RoBERTa and ViT
– M8: Concatenative fusion with RoBERTa and ViT
– M9: Averaged fusion with RoBERTa and ViT
– M10: Proposed System
In all the experiments, the proposed approach shows a
consistent performance).
The proposed approach is a multimodal data fusion
system for tweet classification which takes tweet text and
the associated image as the input. Text and image act as
complementary to each other. Each of the input modals is
processed in the context of the other one. Therefore, we
will get more contextual information. We used RoBERTa
for text feature extraction and ViT for visual feature
extraction.
We tested additive, concatenative, averaged and multi-
plicative fusion strategies. In concatenation-based fusion,
no interaction between different input modals is explored.
Fig. 10 ROC AUC curves for different fusion approaches on In additive and averaged fusion, joint value of features
Hurricane Maria dataset
123
Neural Computing and Applications
8 Conclusion
123
Neural Computing and Applications
123
Neural Computing and Applications
43. Simonyan K, Zisserman A (2014) Very deep convolutional net- international conference on artificial intelligence and its appli-
works for large-scale image recognition. arXiv preprint arXiv: cations, pp 1–7
1409.1556 50. Yu Y, Tang S, Aizawa K et al (2018) Category-based deep cca
44. Singh T, Kumari M (2016) Role of text pre-processing in twitter for fine-grained venue discovery from multimodal data. IEEE
sentiment analysis. Procedia Comput Sci 89:549–554 Trans Neural Netw Learn Syst 30(4):1250–1258
45. Snyder LS, Lin YS, Karimzadeh M et al (2019) Interactive 51. Yu Y, Tang S, Raposo F et al (2019) Deep cross-modal corre-
learning for identifying relevant tweets to support real-time sit- lation learning for audio and lyrics in music retrieval. ACM Trans
uational awareness. IEEE Trans Vis Comput Graph Multimed Comput Commun Appl 15(1):1–16
26(1):558–568 52. Zahra K, Imran M, Ostermann FO (2020) Automatic identifica-
46. Szegedy C, Liu W, Jia Y et al (2015) Going deeper with con- tion of eyewitness messages on twitter during disasters. Inf
volutions. In: Proceedings of the IEEE conference on computer Process Manag 57(1):102,107
vision and pattern recognition, pp 1–9
47. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for Publisher’s Note Springer Nature remains neutral with regard to
convolutional neural networks. In: International conference on jurisdictional claims in published maps and institutional affiliations.
machine learning, PMLR, pp 6105–6114
48. Tripathy JK, Chakkaravarthy SS, Satapathy SC et al (2020)
Springer Nature or its licensor holds exclusive rights to this article
Albert-based fine-tuning model for cyberbullying analysis. Mul-
under a publishing agreement with the author(s) or other rightsh-
timed Syst 2020:1–9
older(s); author self-archiving of the accepted manuscript version of
49. Valdez DB, Godmalin RAG (2021) A deep learning approach of
this article is solely governed by the terms of such publishing
recognizing natural disasters on images using convolutional
agreement and applicable law.
neural network and transfer learning. In: Proceedings of the
123