Yan 2022
Yan 2022
sciences
Review
Review of Visual Saliency Prediction: Development Process
from Neurobiological Basis to Deep Models
Fei Yan 1 , Cheng Chen 1 , Peng Xiao 1 , Siyu Qi 1 , Zhiliang Wang 1 and Ruoxiu Xiao 1,2, *
1 The School of Computer and Communication Engineering, University of Science and Technology Beijing,
Beijing 100083, China; b20140367@xs.ustb.edu.cn (F.Y.); b20170310@xs.ustb.edu.cn (C.C.);
xp_0311@163.com (P.X.); m202120772@ xs.ustb.edu.cn (S.Q.); wzl@ustb.edu.cn (Z.W.)
2 Beijing Engineering and Technology Center for Convergence Networks and Ubiquitous Services,
School of Computer and Communication Engineering, University of Science and Technology Beijing,
Beijing 100083, China
* Correspondence: xiaoruoxiu@ustb.edu.cn
Abstract: The human attention mechanism can be understood and simulated by closely associating
the saliency prediction task to neuroscience and psychology. Furthermore, saliency prediction is
widely used in computer vision and interdisciplinary subjects. In recent years, with the rapid
development of deep learning, deep models have made amazing achievements in saliency prediction.
Deep learning models can automatically learn features, thus solving many drawbacks of the classic
models, such as handcrafted features and task settings, among others. Nevertheless, the deep
models still have some limitations, for example in tasks involving multi-modality and semantic
understanding. This study focuses on summarizing the relevant achievements in the field of saliency
prediction, including the early neurological and psychological mechanisms and the guiding role
of classic models, followed by the development process and data comparison of classic and deep
Citation: Yan, F.; Chen, C.; Xiao, P.; saliency prediction models. This study also discusses the relationship between the model and human
Qi, S.; Wang, Z.; Xiao, R. Review of vision, as well as the factors that cause the semantic gaps, the influences of attention in cognitive
Visual Saliency Prediction: research, the limitations of the saliency model, and the emerging applications, to provide new saliency
Development Process from predictions for follow-up work and the necessary help and advice.
Neurobiological Basis to Deep
Models. Appl. Sci. 2022, 12, 309. Keywords: visual attention; visual saliency; saliency prediction; deep learning
https://doi.org/10.3390/
app12010309
model based on the bottom-up model, from which the deep learning models have gradually
flourished. Compared with the classic models, the performance of these newly developed
models has been greatly improved, and the performance is gradually approaching the
human inter-observer. The significance of the research on visual saliency detection lies
in two aspects: first, as a verifiable prediction, it can be used as a model-based hypothe-
sis test to understand human attention mechanisms at the behavioral and neural levels.
Second, the saliency prediction model based on the attention mechanism has been widely
used in numerous ways, such as target prediction [4], target tracking [5], image segmen-
tation [6], image classification [7], image stitching [8], video surveillance [9], image or
video compression [10], image or video retrieval [11], salient object detection [12], video
segmentation [13], image cropping [14], visual SLAM (Simultaneous Localization and Map-
ping) [15], end-to-end driving [16], video question answering [17], medical diagnosis [18],
health monitoring [19] and so on.
The current research on saliency detection mainly involves two types of tasks, namely,
saliency prediction (or eye fixation prediction) and Salient Object Detection (SOD). Both
types of tasks aim to detect the most significant area of a picture or a video. However,
differences exist between these two models and their application scenarios. Saliency
prediction is informed by the human visual attention mechanism and predicts the possibility
of the human eyes to stay in a certain position in the scene. By contrast, salient object
detection, as the other branch, focuses on the perception and description of the object level,
which is a pure computer vision task. The two types of tasks asre shown in Figure 1.
Figure 1. Two saliency detection tasks: (a) Original image, (b) Saliency prediction task, (c) Salient
object detection task.
applications, the saliency prediction task is related to human vision itself, and it is closely
related to neuroscience and psychology. Consequently, saliency prediction has been widely
used in interdisciplinary and emerging subjects. The main contributions of this study are
as follows:
• This research focused on the task of saliency prediction, analyzed the psychological
and physiological mechanisms related to saliency prediction, introduced the classic
models that have been affected by saliency prediction, and determined the impact of
these theories on deep learning models.
• The visual saliency model based on deep learning was analyzed in detail, and the
performance evaluation measures of the representative experimental datasets and
the model under static and dynamic conditions were discussed and summarized,
respectively.
• The limitations of the current deep learning model were analyzed, the possible di-
rections for improvement were proposed, new application areas based on the latest
progress of deep learning were discussed, and the contribution and significance of
saliency prediction with respect to future development trends were presented.
contrast model is based on the physiological and psychological principles of FIT and the
center surround antagonism, and it defines a certain mechanism when selecting salient
areas in an image to realize the simulation of the visual attention mechanism. For example,
the earliest model of Itti [4] could simulate the process of shifting human visual attention
without any prior information. According to the features captured from images, the model
analyzed visual stimuli, allocated computing resources, selected the salient areas in the
scene according to the saliency intensity of different positions, and simulated the process of
human visual attention transfer. Although the performance of the model was general, it was
the first successful attempt from the neurobiological model, which is of great significance.
Since then, other researchers have contributed improvements. Harel [28] changed the
graph-based visual saliency (GBVS) model to the Markov random field with non-linear
combination in the synthesis stage. The model formed activation maps on certain feature
channels, and then normalized them in a way which highlighted conspicuity and admitted
combination with other maps. Ma and Zhang [29] used local contrast analysis to extract the
saliency maps of an image, and on this basis, Tie Liu et al. [30] used 9 × 9 neighborhoods
and adopted a conditional random field (CRF) learning model. Borji [31] analyzed local
rarity based on the sparse coding. Sclaroff et al. [32] proposed a saliency prediction model
based on Boolean Map. In addition, researchers have used other models to predict saliency
by using local or global contrast. Some of the notable examples include the pixel-level
contrast saliency model proposed by Zhai and Shah [33], the sliding windows-based model
for global contrast calculation proposed by Wei [34], the color contrast linear fusion model
proposed by Margolin [35], the frequency tuning model proposed by Achanta [36], and the
color space quantization model proposed by Cheng [37]. Other researchers have used the
superpixel [38–40] as the processing unit to calculate the variance of color space distribution
as a means of improving the computational efficiency.
Some models have been based on information theory and image transformation. The
essence of these models based on information theory is to calculate the maximum infor-
mation sampling from the visual environment, select the richest part from the scene, and
discard the remaining part. Among them, the Attention-based on Information Maximiza-
tion (AIM) model of Bruce and Tsotsos [41] was influential. The AIM model has used
Shannon’s self-information measure to calculate the saliency of the image. Firstly, a certain
number of natural image blocks were randomly selected for training to obtain the basic
function. Then, the image was divided into blocks of the same size, the basis coefficients of
the corresponding blocks were extracted as the features of the block through Independent
Component Analysis (ICA), the distribution of each feature was obtained through proba-
bility density estimation, and finally the probability density of the feature was obtained.
Other notable models included the incremental coding length model proposed by Hou [42],
the rare linear combination model proposed by Mancas [43], the self-similarity prediction
model proposed by Seo [44] and the Mahalanobis distance calculation model proposed
by Rosenholtz [45]. As for the use of image transformation models for saliency predic-
tion, the spectral residual model proposed by Hou [46] did not examine the foreground
characteristics but rather utilizes the research background. The areas that did not match
these features are the areas of interest. After calculating the residual spectrum, the residual
spectrum was mapped back to the spatial domain by inverse Fourier transform to obtain
the saliency map. On this basis, Guo [47] proposed a model that used the phase spectrum to
obtain the saliency map and Holtzman-Gazit [48] extracted a variety of resolutions for the
picture. Sclaroff [49] proposed a Boolean Map based saliency model(BMS) by discovering
surrounding regions via boolean maps. The model obtained saliency maps by analyzing
the topological structure of boolean maps. Although BMS was simple to implement and
efficient to run, it performed well in the classical models.
visual saliency model is mainly based on the Bayesian model. In addition, the Bayesian
model can be regarded as a special case of the decision theoretical model, as both simulate
the biological calculation process of human visual saliency.
The Bayesian model in saliency prediction is a probabilistic combination model that
combined scene information and prior information according to Bayesian rules. The
model proposed by Torrallba et al. [51] multiplied the bottom-up and top-down saliency
maps to obtain the final saliency map. On this basis, Ehinger et al. [52] integrated the
feature prior information of the target into the above framework. Xie et al. [53] proposed
a saliency prediction model based on posterior probability. The SUN model proposed by
Zhang et al. [54] used visual features and spatial location as the prior knowledge.
The model based on decision theory in saliency prediction is a strategy model that
decides the optimal plan based on the information and evaluation criteria requirements,
i.e., how to make optimal decisions about perceptual information of the surrounding
environment. Gao and Vasconcelos [55,56] believe that the salient features in the recognition
process are derived from other classes of interest, and they defined top-down attention as
a classification problem with the smallest expected error. Kim et al. [57] recommended a
temporal and spatial saliency model based on motion perception grouping. Gu et al. [58]
proposed a model based on the decision theory mechanism to predict regions of interest.
Early machine learning models often use a variety of machine learning methods,
such as Convolutional Neural Networks (CNNs), Support Vector Machines (SVMs), or
probability kernel density estimation, and they mostly combined the bottom-up and top-
down methods. Notable examples included the nonlinear mapping model proposed by
Kienzle et al. [59], the regression classifier model proposed by Peters et al. [60], and the
linear SVM model proposed by Judd et al. [61]. Those early machine learning models had
a certain exploratory nature for subsequent deep learning models, and they played an
important guiding role for the subsequently developed deep learning models.
Although these classical models were designed in a variety of ways, their performance
gradually reached a bottleneck due to handcrafted features. The development process of
neurobiological models and classic models is shown in Figure 2.
researchers have used deep models to study saliency prediction, and the application of
deep models in static and dynamic saliency prediction has achieved better results.
The success of the model indicated that the choice of an appropriate loss function can be
treated as a method for improving the prediction effect.
In recent years, some excellent models have been proposed for saliency prediction.
Jia et al. [77] proposed a saliency model called EML-Net based on the similarities between
images and the integration of Extreme Learning Machines (ELMs). Wang et al. [78] pro-
posed the Deep Visual Attention (DVA) model in which the architecture was trained in mul-
tiple scales to predict pixel saliency based on a skip-layer network. The model proposed by
Gorji [79] used shared attention to enhance saliency prediction. Dodge et al. [80] proposed
a model called MxSalNet, which was formulated as a mixture of experts. Mahdi et al. [81]
proposed a deep feature-based saliency (DeepFeat) model to utilize features by combining
bottom-up and top-down saliency maps. AKa et al. [82] proposed the MSI-Net based on an
encoder–decoder structure and it includes a module with multiple convolutional layers at
different dilation rates to capture multi-scale features.
applied generative adversarial simulation models to train deep models. The development
process of saliency prediction models is shown in Figure 3.
with a total duration of 19,420 s. The DHF1K dataset also provides calibration for
movement mode and number of objects, among others, thus providing convenience
for studying high-level information of the dynamic attention mechanism.
• LEDOV dataset: The LEDOV dataset [101] was established by Wang et al. in 2018. It
includes daily activities, sports, social activities, art performances, and other content.
A total of 538 videos, with a resolution of 720px, contain a total of 179,336 frames of
video and 5,058,178 gaze locations.
For early dynamic models, the DIEM, Hollywood-2, and UCF-sports datasets were
the three most widely used datasets in video saliency research. In recent years, with the
continuous updating of datasets, there are more models also using the DHF1K database for
training and testing. The DHF1K database has a huge amount of data and a wide range of
application.
where P is the average value at the gaze point Q of the human eye, N is the total number of
human eye gazes, i represents the i-th pixel, and N is the total number of pixels at the gaze
point. A positive NSS indicates consistency between mappings, whereas a negative NSS is
the opposite. The NSS value is negatively correlated with model performance.
Appl. Sci. 2022, 12, 309 11 of 22
• Linear Correlation Coefficient (CC):The CC is the statistic used to measure the linear
correlation between two random variables. For the significance prediction evaluation,
the prediction significance map (P) and the true value view (G) can be regarded as the
two random variables. The calculation formula of CC is
cov(P, G)
CC = (2)
σ(P) × σ(G)
where cov is the covariance, σ is the standard deviation. The CC can equally distin-
guish false positives and false negatives at the value range of (−1,1). A value close to the
two ends indicates a better model performance.
• Earth Movers Distance (EMD): EMD [105] represents the distance between the two
2D maps denoted by G and S, and it calculates the minimum cost of converting the
estimated probability distribution of the saliency map S into the probability distribu-
tion of the GT map denoted by G. Therefore, a low EMD corresponds to a high-quality
saliency map. In saliency prediction, EMD represents the minimum cost of converting
the probability distribution of the saliency map into human-eye attention maps called
the fixation map.
• Kullback–Leibler (KL) Divergence: KL divergence is a general information theory
measurement corresponding to the difference between two probability distributions.
The calculation formula of KL is
G
KL(P, G) = ∑ Gi log(ε + ) (3)
i
ε + Pi
Similar to other distribution-based measures, KL divergence takes the predicted
saliency map (P) and the true value view (G) as the input and evaluates the loss of informa-
tion where P is used to approximate G, ε is the regularization constant. Furthermore, KL
divergence is an asymmetric dissimilarity measure. A low score indicates that the saliency
map is close to the true value.
• (6) Similarity Metric (SIM): SIM measures the similarity between two distributions.
After normalizing the input map, SIM is calculated as the sum of the minimum values
at each pixel. The calculation formula of SIM is
SIM(P, G) = ∑ min(Pi , Gi ) (4)
i
Given the predicted significance map (P) and the true value view (G), a SIM of 1 means
that the distribution is the same, whereas a SIM of 0 means no overlap. SIM can penalize
predictions that fail to consider all true densities.
In general, these evaluation measures are complementary. A good model should be
good under a variety of evaluation measures, because these measures reflect different
aspects of the saliency map. Usually, a variety of evaluation measures are selected when
evaluating the model. As a widely used measure of location based, AUC is essential. At
the same time, a variety of other measures such as CC, SIM and other distribution-based
measures should be selected to reflect other salient map factors such as relatively saliency
region or similarity.
Thus far, we have summarized the abovementioned six common evaluation mea-
sures based on whether they are appropriate as probability distribution, similarity, and
continuous GT tools for statistics and classification. The details are shown in Table 1.
Appl. Sci. 2022, 12, 309 12 of 22
AUC- AUC-
Model Name sAUC SIM EMD CC NSS KL
Judd Borji
infinite humans 0.92 0.88 0.81 1 0 1 3.29 0
Deep Gaze II [68] 0.88 0.86 0.72 0.46 3.98 0.52 1.29 0.96
EML-NET [77] 0.88 0.77 0.7 0.68 1.84 0.79 2.47 0.84
DeepFix [66] 0.87 0.8 0.71 0.67 2.04 0.78 2.26 0.63
SALICON [69] 0.87 0.85 0.74 0.6 2.62 0.74 2.12 0.54
SAM-ResNet [75] 0.87 0.78 0.7 0.68 2.15 0.78 2.34 1.27
SAM-VGG [75] 0.87 0.78 0.71 0.67 2.14 0.77 2.3 1.13
SalGAN [76] 0.86 0.81 0.72 0.63 2.29 0.73 2.04 1.07
ML-Net [74] 0.85 0.75 0.7 0.59 2.63 0.67 2.05 1.1
Deep Gaze I [63] 0.84 0.83 0.66 0.39 4.97 0.48 1.22 1.23
SalNet [71] 0.83 0.82 0.69 0.52 3.31 0.58 1.51 0.81
eDN [62] 0.82 0.81 0.62 0.41 4.56 0.45 1.14 1.1
Judd Model [61] 0.81 0.8 0.6 0.42 4.45 0.47 1.18 1.12
GBVS [28] 0.81 0.8 0.63 0.48 3.51 0.48 1.24 0.87
AIM [41] 0.77 0.75 0.66 0.4 4.73 0.31 0.79 1.18
IttiKoch2 [4] 0.75 0.74 0.63 0.44 4.26 0.37 0.97 1.03
SUN saliency [54] 0.67 0.66 0.61 0.38 5.1 0.25 0.68 1.27
Thus far, the CAT2000 dataset comprises a total of 31 evaluated models, 10 of which
are neural network-based models. The obtained results are shown in Table 3.
Appl. Sci. 2022, 12, 309 13 of 22
AUC- AUC-
Model Name sAUC SIM EMD CC NSS KL
Judd Borji
infinite humans 0.9 0.84 0.62 1 0 1 2.85 0
SAM-ResNet [75] 0.88 0.8 0.58 0.77 1.04 0.89 2.38 0.56
SAM-VGG [75] 0.88 0.79 0.58 0.76 1.07 0.89 2.38 0.54
MSI-Net [82] 0.88 0.82 0.59 0.75 1.07 0.87 2.3 0.36
EML-NET [77] 0.87 0.79 0.59 0.75 1.05 0.88 2.38 0.96
DeepFix [66] 0.87 0.81 0.58 0.74 1.15 0.87 2.28 0.37
BMS [49] 0.85 0.84 0.59 0.61 1.95 0.67 1.67 0.83
eDN [62] 0.85 0.84 0.55 0.52 2.64 0.54 1.3 0.97
iSEEL [106] 0.84 0.81 0.59 0.62 1.78 0.66 1.67 0.92
Judd Model [61] 0.84 0.84 0.56 0.46 3.6 0.54 1.3 0.94
EYMOL [107] 0.83 0.76 0.51 0.61 1.91 0.72 1.78 1.67
LDS [108] 0.83 0.79 0.56 0.58 2.09 0.62 1.54 0.79
FES [109] 0.82 0.76 0.54 0.57 2.24 0.64 1.61 2.1
Aboudib Magn [110] 0.81 0.77 0.55 0.58 2.1 0.64 1.57 1.41
GBVS [28] 0.8 0.79 0.58 0.51 2.99 0.5 1.23 0.8
Context-Aware saliency
0.77 0.76 0.6 0.5 3.09 0.42 1.07 1.04
[111]
IttiKoch2 [4] 0.77 0.76 0.59 0.48 3.44 0.42 1.06 0.92
AWS [112] 0.76 0.75 0.61 0.49 3.36 0.42 1.09 0.94
AIM [41] 0.76 0.75 0.6 0.44 3.69 0.36 0.89 1.13
WMAP [113] 0.75 0.69 0.6 0.47 3.28 0.38 1.01 1.65
Torralba saliency [51] 0.72 0.71 0.58 0.45 3.44 0.33 0.85 1.6
Murray model [72] 0.7 0.7 0.59 0.43 3.79 0.3 0.77 1.14
SUN saliency [54] 0.7 0.69 0.57 0.43 3.42 0.3 0.77 2.22
Achanta [36] 0.57 0.55 0.52 0.33 4.46 0.11 0.29 2.31
IttiKoch [4] 0.56 0.53 0.52 0.34 4.66 0.09 0.25 6.71
Table 2 shows the results of the MIT300 dataset. The AUC-Judd index is arranged in
descending order. The top models are all based on deep learning. EML-NET performed best,
and it got the highest scores under a variety of measures. Based on the AUC-Judd measure,
DeepGaze II and EML-NET are in the top two ranks with a score of 0.88. DeepGaze II ranks
first in AUC-Borji with a score of 0.86. Based on the sAUC measure, SALICON performed
best with a score of 0.74. The rankings produced by different evaluation methods vary
greatly. DeepGaze II and DeepFix perform well in AUC, but other scores are average.
Although SAM-ResNet, SAM-VGG, EML-NET and SalGAN did not get the highest score
in AUC, these models are outstanding.
Table 3 shows the results of the CAT2000 dataset. AUC-Judd is arranged in descending
order. Based on the AUC-Judd measure, SAM-ResNet, MSI-Net and SAM-VGG are tied in
the top rank with 0.88 (infinite-humans score of 0.90). In the classic model, the performance
of BMS is the superior one. Its AUC-Borji score is the highest, and other scores are almost
higher than eDN. In general, the models that perform well on the MIT300 dataset also
perform well on the CAT2000 dataset.
The saliency maps of the model over the CAT2000 database are shown in Figure 4.
Appl. Sci. 2022, 12, 309 14 of 22
AUC-Judd, sAUC, NSS, CC, and SIM are used as the five evaluation measures to
judge the performance of the model over the DHF1K dataset. The average is taken after
calculating the score for each frame. The evaluation results are mainly based on the public
results of the DHF1K dataset. The model performance is shown in Table 4.
Appl. Sci. 2022, 12, 309 15 of 22
AUC-
Model Name sAUC CC NSS SIM
Judd
Static DVA [78] 0.86 0.595 0.358 2.013 0.262
Models SALICON [69] 0.857 0.59 0.327 1.901 0.232
JuntingNet [71] 0.855 0.592 0.331 1.775 0.201
Shallow-Net [71] 0.833 0.529 0.295 1.509 0.182
GBVS [28] 0.828 0.554 0.283 1.474 0.186
ITTI [4] 0.774 0.553 0.233 1.207 0.162
Dynamic ACLNet [92] 0.89 0.601 0.434 2.354 0.315
Models OM-CMM [90] 0.856 0.583 0.344 1.911 0.256
Two-stream [85] 0.834 0.581 0.325 1.632 0.197
FANG [114] 0.819 0.537 0.273 1.539 0.198
RUDOY [115] 0.769 0.501 0.285 1.498 0.214
STRA [88] 0.895 0.663 0.458 2.588 0.355
AWS-D [116] 0.703 0.513 0.174 0.94 0.157
PQFT [117] 0.699 0.562 0.137 0.749 0.139
OBDL [118] 0.638 0.5 0.117 0.495 0.171
SEO [119] 0.635 0.499 0.07 0.334 0.142
STRA -Net ranks first in all ratings, followed by ACLNet. Among the dynamic models,
OM-CNN outperforms the other types. Among the static models, the performance of
SALICON is superior. The results indicate that the performance of the deep model is better
than adding time information to the classic model.
of SALICON improved the result, the gap remains to be an order of magnitude with
respect to datasets in related fields (e.g., ImageNet). The JFT-300M dataset recently
collected by Sun et al. [121] contains 300 million images, and it performs the target
recognition model that is trained on this dataset well. The difference in performance
between the use of eye tracking data and similar SALICON data collected with mouse
clicks is clearly controversial.
2. Multi-modal approaches: With the development of saliency prediction in the dynamic
field, an increasing number of features in different modes, such as vision, hearing,
and subtitles, can be used to train models. This multi-modal feature input mode has
proven to be an effective way to improve model performance. Coutrot et al. [122] used
audio data to help video prediction. The shared attention proposed by Gorji et al. [79]
could effectively improve model performance.
3. Visualization: The black box model of deep learning is difficult to present in a manner
that humans can understand. However, saliency prediction itself is a representation of
visual concepts. Visualized CNNs have many benefits for understanding models, in-
cluding the meaning of filters, visual patterns, or visual concepts. Bylinskii et al. [123]
designed a visual dataset and found that a specific type of database may be better for
training. Visualization can help us better understand a model, and it also brings the
possibility of proposing better models and databases.
4. Understand high-level semantics: The deep saliency models are good at extracting
common features, such as humans and textures, among others. The saliency pre-
dictor can also be used to handle these features. However, as shown in Figure 5,
the most interesting or significant parts of an image are not necessarily all of these
features. Human visual models often entail a reasoning process based on sensory
stimuli. To establish the reason behind the relative importance of image regions on
the saliency model, researchers can use higher-level features, such as emotions, gaze
direction, and body posture. Moreover, aiming to approach the human-level saliency
prediction, researchers need to carry out cognitive attention research to help over-
come the aforementioned limitations. A few useful explorations have been offered.
For example, Zhao [98] showed through his experimental results that emotion has
a priority effect. Nonetheless, the existing saliency model still cannot fully explain
the high-level semantics in the scene. The concept of “semantic gap” and the process
of determining the relative importance of objects still cannot be resolved; moreover,
whether the saliency in natural scenes is guided by objects or low-level features is a
matter of debate [124]. The research on the saliency prediction task is closely related
to cognitive disciplines, and its findings can help to improve the subsequent various
visual research.
With the great success of the deep model in saliency prediction, new developments
in deep learning have also provided the possibility for new applications and tasks of
saliency models. For example, Aksoy et al. [16] proposed a novel attention-based model
for making braking decisions and other driving decisions like steering and acceleration.
Appl. Sci. 2022, 12, 309 17 of 22
Jia et al. [19] proposed a multimodal salient wave detection network for sleep staging called
SalientSleepNet, which translated the time series classification problem into a saliency
detection problem and applies it to sleep stage classification. Wei et al. [125] used a
saliency model to pursue their research on autism spectrum disorder (ASD). They found
that children with ASD, particularly autism, were informed by special objects and less
on social objects (e.g., face), and the application of the verification model of obviousness
is helpful in monitoring and evaluating their condition. O’Shea et al. [126] proposed a
model for detecting seizure events from raw electroencephalogram (EEG) signals with less
dependency on the availability of precise clinical labels. This work opens new avenues
for the application of deep learning to neonatal EEG. Theism et al. [127] used a fully
connected network and Fisher pruning to increase the saliency calculation speed by 10
times as a means of providing ideas for applications with high real-time requirements.
Fan et al. [128] proposed a model to detect shared attention in videos to infer shared
attention in third-person social scene videos, which were significant for studying human
social interactions. They proposed a new video dataset VACATION [129] and a spatial-
temporal graph reasoning model to explicitly represent the diverse gaze interactions in the
social scenes and to infer atomic-level gaze communication by message passing.
9. Conclusions
The development of visual saliency prediction tasks has produced numerous methods,
and all of them have played an important role in various research directions. Deep networks
can automatically capture features and effectively combine feature extraction and saliency
prediction. Furthermore, performance can be significantly improved with respect to the
classic model that uses handcrafted features. However, the features extracted by the
deep saliency model may not fully represent the salient objects and regions in an image,
especially in complex scenes that contain advanced information, such as emotion, text,
or symbolic information. In view of further improving the performance of the model,
the reasoning process of HVS must be imitated to realize the discrimination of relatively
important areas in the scene.
In this review, we have summarized the literature about saliency prediction, including
the early psychological and physiological mechanisms, the classic models affected by
this task, the introduction of visual saliency models based on deep learning, and the
data comparisons and summaries in the static and dynamic fields. The reasons for the
superiority and the limitations of the saliency model are also analyzed, and the ways of
improvement and possible development directions are identified. Although the visual
saliency model based on deep learning has made great progress, there is still room for
exploration in the aspects of visualization and multi-modality and the understanding of
high-level semantics, especially the research on attention mechanisms and the application
related to cognitive science.
Author Contributions: Conceptualization, F.Y., C.C. and R.X.; investigation, F.Y. and P.X.; resources,
S.Q. and C.C.; writing—original draft preparation, F.Y.; writing—review and editing, R.X.; supervi-
sion, Z.W.; project administration, R.X.; funding acquisition, R.X. and Z.W. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by National Key Research and Development Program
(2019YFB2101902), National Natural Science Foundation of China (62176268), Non-profit Central
Research Institute Fund of Chinese Academy of Medical Sciences (2020-JKCS-008), Major Science and
Technology Project of Zhejiang Province Health Commission (WKJ-ZJ-2112), and the Fundamental
Research Funds for the Central Universities (FRF-BD-20-11AFRF-DF-20-05).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2022, 12, 309 18 of 22
References
1. Itti, L.; Koch, C. Computational modelling of visual attention. Nat. Rev. Neurosci. 2001, 2, 194–203. [CrossRef] [PubMed]
2. Sziklai, G.C. Some studies in the speed of visual perception. IRE Trans. Inf. Theory 1956, 76, 125–128. [CrossRef]
3. Koch, K.; Mclean, J.; Segev, R.; Freed, M.A.; Michael, I.I.; Balasubramanian, V.; Sterling, P. How Much the Eye Tells the Brain. Curr.
Biol. 2006, 16, 1428–1434. [CrossRef] [PubMed]
4. Itti, L. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. 1998, 20, 1254–1259. [CrossRef]
5. Han, J.; Ngan, K.N.; Li, M.; Zhang, H.J. Unsupervised extraction of visual attention objects in color images. IEEE Trans. Circuits
Syst. Video Technol. 2005, 16, 141–145. [CrossRef]
6. Jung, C.; Kim, C. A Unified Spectral-Domain Approach for Saliency Detection and Its Application to Automatic Object Segmenta-
tion. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2012, 21, 1272–1283. [CrossRef]
7. Siagian, C.; Itti, L. Biologically Inspired Mobile Robot Vision Localization. IEEE Trans. Robot. 2009, 25, 861–873. [CrossRef]
8. Koch, C.; Ullman, S. Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry. Hum. Neurobiol. 1987, 4,
219–227.
9. Tong, Y.; Cheikh, F.A.; Guraya, F.; Konik, H.; Trémeau, A. A Spatiotemporal Saliency Model for Video Surveillance. Cogn. Comput.
2011, 3, 241–263.
10. Itti, L. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Trans. Image Process.
2004, 13, 1304–1318. [CrossRef]
11. Monga, V.; Evans, B.L. Perceptual Image Hashing Via Feature Points: Performance Evaluation and Tradeoffs. IEEE Trans. Image
Process. A Publ. IEEE Signal Process. Soc. 2006, 15, 3452–3465. [CrossRef]
12. Wang, W.; Shen, J.; Dong, X.; Borji, A.; Yang, R. Inferring Salient Objects from Human Fixations. Inferring salient objects from
human fixations. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1913–1927. [CrossRef]
13. Wang, W.; Shen, J.; Lu, X.; Hoi, S.C.H.; Ling, H. Paying Attention to Video Object Pattern Understanding. IEEE Trans. Pattern
Anal. Mach. Intell. 2020, 43, 2413–2428. [CrossRef]
14. Wang, W.; Shen, J.; Ling, H. A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping. IEEE Trans. Pattern
Anal. Mach. Intell. 2018, 41, 1531–1544. [CrossRef]
15. Wang, K.; Ma, S.; Chen, J.; Lu, J. Salient Bundle Adjustment for Visual SLAM. IEEE Trans. Instrum. Meas. 2020, 70, 1–9. [CrossRef]
16. Aksoy, E.; Yazc, A.; Kasap, M. See, Attend and Brake: An Attention-based Saliency Map Prediction Model for End-to-End Driving.
arXiv 2020, arXiv:2002.11020.
17. Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 2016, 29,
289–297.
18. Wang, S.; Jiang, M.; Duchesne, X.; Laugeson, E.; Kennedy, D.; Adolphs, R.; Zhao, Q. Atypical Visual Saliency in Autism Spectrum
Disorder Quantified through Model-Based Eye Tracking. Neuron 2015, 88, 604–616. [CrossRef]
19. Jia, Z.; Lin, Y.; Wang, J.; Wang, X.; Xie, P.; Zhang, Y. SalientSleepNet: Multimodal Salient Wave Detection Network for Sleep
Staging. arXiv 2021, arXiv:2105.13864.
20. Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Yang, R. Salient Object Detection in the Deep Learning Era: An In-depth Survey. IEEE Trans.
Pattern Anal. Mach. Intell. 2021, 1448–1457. [CrossRef]
21. Wang, W.; Shen, J.; Cheng, M.M.; Shao, L. An Iterative and Cooperative Top-Down and Bottom-Up Inference Network for Salient
Object Detection. In Proceedings of the CVPR19, Long Beach, CA, USA, 16–20 June 2019.
22. Wang, W.; Zhao, S.; Shen, J.; Hoi, S.; Borji, A. Salient Object Detection With Pyramid Attention and Salient Edges. In Proceedings
of the CVPR19, Long Beach, CA, USA, 16–20 June 2019.
23. Zhang, J.; Dai, Y.; Yu, X.; Harandi, M.; Barnes, N.; Hartley, R. Uncertainty-Aware Deep Calibrated Salient Object Detection. arXiv
2020, arXiv:2012.06020.
24. Zhang, P.; Liu, W.; Zeng, Y.; Lei, Y.; Lu, H. Looking for the Detail and Context Devils: High-Resolution Salient Object Detection.
IEEE Trans. Image Process. 2021, 30, 3204–3216. [CrossRef]
25. Treisman, A.M.; Gelade, G. A feature-integration theory of attention. Cogn. Psychol. 1980, 12, 97–136. [CrossRef]
26. Treisman, A. Feature binding, attention and object perception. Philos. Trans. R. Soc. B Biol. Sci. 1998, 353, 1295–1306. [CrossRef]
27. Wolfe, J.M. Guided Search 2.0 A revised model of visual search. Psychon. Bull. Rev. 1994, 1, 202–238. [CrossRef]
28. Harel, J.; Koch, C.; Perona, P. Graph-Based Visual Saliency. In Proceedings of the IEEE Conference on Advances in Neural
Information Processing Systems, Vancouver, BC, Canada, 4–9 December 2006.
29. Ma, Y.F. Contrast-based image attention analysis by using fuzzy growing. In Proceedings of the 11th Annual ACM International
Conference on Multimedia, Berkeley, CA, USA, 2–8 November 2003.
30. Liu, T.; Yuan, Z.; Sun, J.; Wang, J.; Zheng, N.; Tang, X.; Shum, H.-Y. Learning to detect a salient object. IEEE Trans. Pattern Anal.
Mach. Intell. 2010, 33, 353–367.
31. Borji, A.; Itti, L. Exploiting local and global patch rarities for saliency detection. In Proceedings of the Conference on Computer
Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012.
32. Zhang, J.; Sclaroff, S. Saliency Detection: A Boolean Map Approach. In Proceedings of the 2013 IEEE International Conference on
Computer Vision, Sydney, Australia, 1–8 December 2013.
33. Zhai, Y.; Shah, M. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th ACM
International Conference on Multimedia, Santa Barbara, CA, USA, 23–27 October 2006.
Appl. Sci. 2022, 12, 309 19 of 22
34. Wei, Y.; Jie, F.; Tao, L.; Jian, S. Salient object detection by composition. In Proceedings of the IEEE International Conference on
Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011.
35. Margolin, R.; Tal, A.; Zelnik-Manor, L. What Makes a Patch Distinct? In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). Portland, OR, USA, 23–28 June 2013.
36. Achanta, R.; Hemami, S.; Estrada, F.; Su¨Sstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE
Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604.
37. Cheng, M.M.; Zhang, G.X.; Mitra, N.J.; Huang, X.; Hu, S.M. Global Contrast Based Salient Region Detection. In Proceedings of
the Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011.
38. Zhi, L.; Zhang, X.; Luo, S.; Meur, O.L. Superpixel-Based Spatiotemporal Saliency Detection. IEEE Trans. Circuits Syst. Video
Technol. 2014, 24, 1522–1540.
39. Ren, Z.; Hu, Y.; Chia, L.T.; Rajan, D. Improved saliency detection based on superpixel clustering and saliency propagation. In
Proceedings of the Acm International Conference on Multimedia, Firenze, Italy, 25–29 October 2010.
40. Huang, G.; Pun, C.M.; Lin, C. Unsupervised video co-segmentation based on superpixel co-saliency and region merging.
Multimed. Tools Appl. 2016, 76, 12941–12964. [CrossRef]
41. Bruce, N.D.B.; Tsotsos, J.K. Saliency Based on Information Maximization. In Proceedings of the Advances in Neural Information
Processing Systems 18, Vancouver, BC, Canada, 5–8 December 2005.
42. Hou, X. Dynamic visual attention: Searching for coding length increments. In Proceedings of the Advances in Neural Information
Processing Systems (NIPS, 2008), Vancouver, BC, Canada, 8–10 December 2008; pp. 681–688.
43. Mancas, M.; Mancas-Thillou, C.; Gosselin, B.; Macq, B.M. A Rarity-Based Visual Attention Map–Application to Texture Description.
In Proceedings of the International Conference on Image Processing, ICIP 2006, Atlanta, GA, USA, 8–11 October 2006.
44. Seo, H.J.; Milanfar, P. Nonparametric bottom-up saliency detection by self-resemblance. In Proceedings of the 2009 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL, USA, 20–25 June 2009; pp. 45–52.
45. Rosenholtz, R.; Nagy, A.L.; Bell, N.R. The effect of background color on asymmetries in color search. J. Vis. 2004, 4, 224–240.
[CrossRef] [PubMed]
46. Hou, X.; Zhang, L. Saliency Detection: A Spectral Residual Approach. In Proceedings of the IEEE Conference on Computer
Vision & Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007.
47. Guo, C.; Qi, M.; Zhang, L. Spatio-temporal Saliency detection using phase spectrum of quaternion fourier transform. In
Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008.
48. Holtzman-Gazit, M.; Zelnik-Manor, L.; Yavneh, I. Salient Edges: A Multi Scale Approach. In Proceedings of the 11th European
Conference on Computer Vision, Crete, Greece, 5–11 September 2010; p. 4310.
49. Sclaroff, J. Exploiting Surroundedness for Saliency Detection: A Boolean Map Approach. IEEE Comput. Soc. 2016, 38, 889–902.
50. Borji, A.; Itti, L. State-of-the-Art in Visual Attention Modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 185–207. [CrossRef]
51. Oliva, A.; Torralba, A.; Castelhano, M.S.; Henderson, J.M. Top-down control of visual attention in object detection. In Proceedings
of the International Conference on Image Processing, Barcelona, Spain, 14–18 September 2003; pp. I 253–256.
52. Ehinger, K.A.; Hidalgo-Sotelo, B.; Torralba, A.; Oliva, A. Modelling search for people in 900 scenes: A combined source model of
eye guidance. Vis. Cogn. 2009, 17, 945–978. [CrossRef]
53. Xie, Y.; Lu, H.; Yang, M.H. Bayesian Saliency via Low and Mid Level Cues. IEEE Trans. Image Process. A Publ. IEEE Signal Process.
Soc. 2013, 22, 1689–1698.
54. Zhang, L.; Tong, M.; Marks, H.; Tim, K.; Shan, H.; Cottrell, G. SUN: A Bayesian framework for saliency using natural statistics. J.
Vis. 2008, 8, 32. [CrossRef]
55. Gao, D.; Vasconcelos, N. Discriminant Saliency for Visual Recognition from Cluttered Scenes. In Proceedings of the Advances in
Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004], Vancouver, BC, Canada, 12–18
December 2004.
56. Gao, D.; Vasconcelos, N. Decision-Theoretic Saliency: Computational Principles, Biological Plausibility, and Implications for
Neurophysiology and Psychophysics. Neural Comput. 2014, 21, 239–271. [CrossRef]
57. Kim, H.; Kim, Y.; Sim, J.Y.; Kim, C.S. Spatiotemporal saliency in dynamic scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 32,
171–177.
58. Gu, E.; Wang, J.; Badler, N.I. Generating Sequence of Eye Fixations Using Decision-theoretic Attention Model. In Proceedings of
the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–26 June 2005.
59. Kienzle, W.; Franz, M.O.; Scholkopf, B.; Wichmann, F.A. Center-surround patterns emerge as optimal predictors for human
saccade targets. J. Vis. 2009, 9, 1–15. [CrossRef]
60. Peters, R.J.; Itti, L. Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention.
In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007.
61. Judd, T.; Ehinger, K.; Durand, F.; Torralba, A. Learning to Predict Where Humans Look. In Proceedings of the IEEE 12th
International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, 27 September–4 October 2009.
62. Vig, E.; Dorr, M.; Cox, D. Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images. In
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28
June 2014.
Appl. Sci. 2022, 12, 309 20 of 22
63. Kümmerer, M.; Theis, L.; Bethge, M. Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet. arXiv
2014, arXiv:1411.1045.
64. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
65. Jia, D.; Wei, D.; Socher, R.; Li, L.J.; Kai, L.; Li, F.F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009
IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
66. Kruthiventi, S.; Ayush, K.; Babu, R.V. DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations.
IEEE Trans. Image Process. 2017, 26, 4446–4456. [CrossRef]
67. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556.
68. Kümmerer, M.; Wallis, T.; Bethge, M. DeepGaze II: Reading fixations from deep features trained on object recognition. arXiv 2016,
arXiv:1610.01563.
69. Ming, J.; Huang, S.; Duan, J.; Qi, Z. SALICON: Saliency in Context. In Proceedings of the Computer Vision & Pattern Recognition,
Boston, MA, USA, 7–12 June 2015.
70. Azam, S.; Gilani, S.O.; Jeon, M.; Yousaf, R.; Kim, J.-B. A Benchmark of Computational Models of Saliency to Predict Human
Fixations in Videos. In VISIGRAPP (4: VISAPP); SCITEPRESS—Science and Technology Publications, Lda.: Setúbal, Portugal,
2016; pp. 134–142.
71. Pan, J.; Mcguinness, K.; Sayrol, E.; O’Connor, N.; Giro-I-Nieto, X. Shallow and Deep Convolutional Networks for Saliency
Prediction. arXiv 2016, arXiv:1603.00845.
72. Jetley, S.; Murray, N.; Vig, E. End-to-end saliency mapping via probability distribution prediction. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5753–5761.
73. Liu, N.; Han, J. A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection. IEEE Trans.
Image Process. A Publ. IEEE Signal Process. Soc. 2018, 27, 3264–3274. [CrossRef]
74. Cornia, M.; Baraldi, L.; Serra, G.; Cucchiara, R. A Deep Multi-Level Network for Saliency Prediction. In Proceedings of the
International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016.
75. Marcella, C.; Lorenzo, B.; Giuseppe, S.; Rita, C. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
IEEE Trans. Image Process. 2016, 27, 5142–5154.
76. Pan, J.; Canton, C.; Mcguinness, K.; O’Connor, N.E.; Giro-I-Nieto, X. SalGAN: Visual Saliency Prediction with Generative
Adversarial Networks. arXiv 2017, arXiv:1701.01081.
77. Jia, S.; Bruce, N.D.B. EML-NET:An Expandable Multi-Layer NETwork for Saliency Prediction. arXiv 2018, arXiv:1805.01047.
78. Wenguan; Wang; Jianbing; Shen. Deep Visual Attention Prediction. IEEE Trans. Image Process. 2017, 27, 2368–2378.
79. Gorji, S.; Clark, J.J. Attentional Push: A Deep Convolutional Network for Augmenting Image Salience with Shared Attention
Modeling in Social Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, USA, 21–26 July 2017.
80. Dodge, S.; Karam, L. Visual Saliency Prediction Using a Mixture of Deep Neural Networks. IEEE Trans. Image Process. 2017, 27,
4080–4090. [CrossRef]
81. Mahdi, A.; Qin, J.; Crosby, G. DeepFeat: A bottom-up and top-down saliency model based on deep features of convolutional
neural networks. IEEE Trans. Cogn. Dev. Syst. 2019, 12, 54–63. [CrossRef]
82. Aka, B.; Msa, B.; Kd, C.; Rgab, D. Contextual encoder–decoder network for visual saliency prediction. Neural Netw. 2020, 129,
261–270.
83. Gao, D.; Mahadevan, V.; Vasconcelos, N. The discriminant center-surround hypothesis for bottom-up saliency. In Proceedings
of the Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural
Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007.
84. Seo, H.J.; Milanfar, P. Using local regression kernels for statistical object detection. In Proceedings of the IEEE International
Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008.
85. Bak, C.; Kocak, A.; Erdem, E.; Erdem, A. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimed.
2017, 20, 1688–1698. [CrossRef]
86. Chaabouni, S.; Benois-Pineau, J.; Amar, C.B. Transfer learning with deep networks for saliency prediction in natural vide. In
Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016.
87. Leifman, G.; Rudoy, D.; Swedish, T.; Bayro-Corrochano, E.; Raskar, R. Learning Gaze Transitions from Depth to Improve Video
Saliency Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29
October 2017.
88. Lai, Q.; Wang, W.; Sun, H.; Shen, J. Video Saliency Prediction using Spatiotemporal Residual Attentive Networks. IEEE Trans.
Image Process. 2019, 29, 1113–1126. [CrossRef] [PubMed]
89. Bazzani, L.; Larochelle, H.; Torresani, L. Recurrent Mixture Density Network for Spatiotemporal Visual Attention. arXiv 2016,
arXiv:1603.08199.
90. Jiang, L.; Xu, M.; Wang, Z. Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM. arXiv
2017, arXiv:1709.06316.
Appl. Sci. 2022, 12, 309 21 of 22
91. Gorji, S.; Clark, J.J. Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push. In
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA,
18–23 June 2018.
92. Wang, W.; Shen, J.; Fang, G.; Cheng, M.M.; Borji, A. Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA,
18–23 June 2018.
93. Zhang, K. A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction. IEEE Trans. Image Process. 2020, 30,
572–587. [CrossRef]
94. Xu, M.; Yang, L.; Tao, X.; Duan, Y.; Wang, Z. Saliency Prediction on Omnidirectional Image With Generative Adversarial Imitation
Learning. IEEE Trans. Image Process. 2021, 30, 2087–2102. [CrossRef]
95. Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.H. Saliency Detection via Graph-Based Manifold Ranking. In Proceedings of the
Computer Vision & Pattern Recognition, Portland, OR, USA, 23–28 June 2013.
96. Borji, A.; Itti, L. CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research. arXiv 2015, arXiv:1505.03581.
97. Borji, A.; Tavakoli, H.R.; Sihite, D.N.; Itti, L. Analysis of Scores, Datasets, and Models in Visual Saliency Prediction. In Proceedings
of the IEEE International Conference on Computer Vision, Columbus, OH, USA, 23–28 June 2014.
98. Fan, S.; Shen, Z.; Ming, J.; Koenig, B.L.; Qi, Z. Emotional Attention: A Study of Image Sentiment and Visual Attention. In
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June
2018.
99. Mital, P.K.; Smith, T.J.; Hill, R.L.; Henderson, J.M. Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion.
Cogn. Comput. 2011, 3, 5–24. [CrossRef]
100. Mathe, S.; Sminchisescu, C. Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition. IEEE
Trans. Pattern Anal. Mach. Intell. 2015, 37, 1408–1424. [CrossRef]
101. Jiang, L.; Xu, M.; Liu, T.; Qiao, M.; Wang, Z. Deepvs: A deep learning based video saliency prediction approach. In Proceedings
of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 602–617.
102. Judd, T.; Durand, F.; Torralba, A. A Benchmark of Computational Models of Saliency to Predict Human Fixations; Technical Report
MIT-CSAIL-TR-2012-001; MIT Libraries: Cambridge, MA, USA, 2012.
103. Borji, A.; Sihite, D.N.; Itti, L. Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative
Study. IEEE Trans. Image Process. 2013, 22, 55–69. [CrossRef]
104. Peters, R.J.; Iyer, A.; Itti, L.; Koch, C. Components of bottom-up gaze allocation in natural images. Vis. Res. 2005, 45, 2397–2416.
[CrossRef]
105. Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40,
99–121. [CrossRef]
106. Tavakoli, H.R.; Borji, A.; Laaksonen, J.; Rahtu, E. Exploiting inter-image similarity and ensemble of extreme learners for fixation
prediction using deep features. Neurocomputing 2017, 244, 10–18. [CrossRef]
107. Zanca, D.; Gori, M. Variational Laws of Visual Attention for Dynamic Scenes. In Proceedings of the NIPS 2017, Long Beach, CA,
USA, 4–9 December 2017.
108. Shu, F.; Jia, L.; Tian, Y.; Huang, T.; Chen, X. Learning Discriminative Subspaces on Random Contrasts for Image Saliency Analysis.
IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 1095–1108.
109. Tavakoli, H.R.; Rahtu, E.; Heikkilä, J. Fast and efficient saliency detection using sparse sampling and kernel density estimation. In
Proceedings of the Scandinavian Conference on Image Analysis, Ystad, Sweden, 23–27 May 2011; pp. 666–675.
110. Aboudib, A.; Gripon, V.; Coppin, G. A model of bottom-up visual attention using cortical magnification. In Proceedings of the
IEEE International Conference on Acoustics, South Brisbane, QLD, Australia, 19–24 April 2015.
111. Goferman, S.; Zelnik-Manor, L.; Tal, A. Context-aware saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34,
1915–1926. [CrossRef]
112. Garcia-Diaz, A.; Leboran, V.; Fdez-Vidal, X.R.; Pardo, X.M. On the relationship between optical variability, visual saliency, and
eye fixations: A computational approach. J. Vis. 2012, 12, 17. [CrossRef]
113. Lopez-Garcia, F.; Fdez-Vidal, X.R.; Pardo, X.M.; Dosil, R. Scene recognition through visual attention and image features: A
comparison between sift and surf approaches. Object Recognit. 2011, 4, 185–200.
114. Fang, Y.; Wang, Z.; Lin, W. Video Saliency Incorporating Spatiotemporal Cues and Uncertainty Weighting. In Proceedings of the
2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013.
115. Rudoy, D.; Dan, B.G.; Shechtman, E.; Zelnik-Manor, L. Learning Video Saliency from Human Gaze Using Candidate Selection. In
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013.
116. Leboran, V.; Garcia-Diaz, A.; Fdez-Vidal, X.R.; Pardo, X.M. Dynamic whitening saliency. IEEE Trans. Pattern Anal. Mach. Intell.
2016, 39, 893–907. [CrossRef]
117. Dedieu, J.F.; Gazin, C.; Rigolet, M.; Galibert, F. A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its
Applications in Image and Video Compression. Oncogene 1988, 3, 523–529.
118. Khatoonabadi, S.H.; Vasconcelos, N.; Bajic, I.V.; Shan, N.Y. How many bits does it take for a stimulus to be salient? In Proceedings
of the 2015 IEEE Conference on Computer Vision & Pattern Recognition, Boston, MA, USA, 7–12 June 2015.
119. Seo, H.J.; Milanfar, P. Static and space-time visual saliency detection by self-resemblance. J. Vis. 2009, 9, 15. [CrossRef]
Appl. Sci. 2022, 12, 309 22 of 22
120. Bruce, N.D.B.; Wloka, C.; Frosst, N.; Rahman, S.; Tsotsos, J.K. On computational modeling of visual saliency: Examining what’s
right, and what’s left. Vis. Res. 2015, 116, 95–112. [CrossRef]
121. Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings
of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017.
122. Coutrot, A.; Guyader, N. How saliency, faces, and sound influence gaze in dynamic social scenes. J. Vis. 2014, 14, 5. [CrossRef]
123. Bylinskii, Z.; Alsheikh, S.; Madan, S.; Recasens, A.; Zhong, K.; Pfister, H.; Durand, F.; Oliva, A. Understanding Infographics
through Textual and Visual Tag Prediction. arXiv 2017, arXiv:1709.09215.
124. Stoll, J.; Thrun, M.; Nuthmann, A.; Einhäuser, W. Overt attention in natural scenes: Objects dominate features. Vis. Res. An. Int. J.
Vis. Sci. 2015, 107, 36–48. [CrossRef]
125. Wei, W.; Liu, Z.; Huang, L.; Nebout, A.; Meur, O.L. Saliency Prediction via Multi-Level Features and Deep Supervision for
Children with Autism Spectrum Disorder. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo
Workshops (ICMEW), Shanghai, China, 8–12 July 2019.
126. O’Shea, A.; Lightbody, G.; Boylan, G.; Temko, A. Neonatal seizure detection from raw multi-channel EEG using a fully
convolutional architecture. arXiv 2021, arXiv:2105.13854. [CrossRef]
127. Theis, L.; Korshunova, I.; Tejani, A.; Huszár, F. Faster gaze prediction with dense networks and Fisher pruning. arXiv 2018,
arXiv:1801.05787.
128. Fan, L.; Chen, Y.; Wei, P.; Wang, W.; Zhu, S.C. Inferring Shared Attention in Social Scene Videos. In Proceedings of the IEEE CVPR,
Salt Lake City, UT, USA, 18–23 June 2018.
129. Fan, L.; Wang, W.; Huang, S.; Tang, X.; Zhu, S.C. Understanding Human Gaze Communication by Spatio-Temporal Graph
Reasoning. arVix 2019, arXiv:1909.02144.