Spatial-Temporal Knowledge-Embedded
Spatial-Temporal Knowledge-Embedded
Spatial-Temporal Knowledge-Embedded
Spatial-Temporal Knowledge-Embedded
Transformer for Video Scene Graph Generation
Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin
Caption: A woman takes a bottle, then eats medicine and drinks water.
Abstract—Video scene graph generation (VidSGG) aims to
identify objects in visual scenes and infer their relationships not contact in front of
for a given video. It requires not only a comprehensive un-
derstanding of each object scattered on the whole scene but person look at
arXiv:2309.13237v3 [cs.CV] 15 Dec 2023
also a deep dive into their temporal motions and interactions. in front of in front of
Inherently, object pairs and their relationships enjoy spatial
touch eat
co-occurrence correlations within each image and temporal
consistency/transition correlations across different images, which cup not look at sandwich
can serve as prior knowledge to facilitate VidSGG model learning drink from
and inference. In this work, we propose a spatial-temporal person look at cup
knowledge-embedded transformer (STKET) that incorporates not look at
the prior spatial-temporal knowledge into the multi-head cross-
attention mechanism to learn more representative relationship no look at
representations. Specifically, we first learn spatial co-occurrence person eat sandwich
and temporal transition correlations in a statistical manner. Then,
we design spatial and temporal knowledge-embedded layers look at
that introduce the multi-head cross-attention mechanism to fully not look at
explore the interaction between visual representation and the person drink from cup
knowledge to generate spatial- and temporal-embedded represen-
tations, respectively. Finally, we aggregate these representations look at
for each subject-object pair to predict the final semantic labels eat
and their relationships. Extensive experiments show that STKET
person look at sandwich
outperforms current competing algorithms by a large margin,
e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different not look at
settings over current algorithms.
Index Terms—Video Scene Graph Generation, Spatial- Fig. 1. Qualitative results of our proposed STKET. The green color indicates
visual relationships that emerge in the current image. The gray color indicates
Temporal Knowledge Learning, Vision and Language visual relationships that disappeared in the current image.
I. I NTRODUCTION
Scene graph generation (SGG) [1]–[3] aims to depict a spatial interactions among semantic objects pose a significant
visual scene as a structured graph, where nodes correspond challenge to learning visual relationships in static images [18]–
to semantic objects and edges refer to the corresponding rela- [20]. An even more substantial challenge exists in the task of
tionships. It is treated as a promising approach to bridge the generation scene graphs in videos (VidSGG) [21]–[23], as it
significant gap between vision and natural language domains, further requires exploring temporal motions and interactions,
due to its capacity to represent accurately the semantics of hence making VidSGG a formidable yet unresolved task.
visual contents in the holistic scene. Recently, lots of efforts Current works [24], [25] primarily focus on aggregating
have been dedicated to demonstrating its efficacy in numerous object-level visual information from spatial and temporal
visual reasoning tasks, including video action recognition [4]– perspectives to learn relationship representation for VidSGG.
[6], video segmentation [7]–[9], video captioning [10]–[13], In contrast, humans rely on not only visual cues but also
and video question answering [14]–[17]. However, the intricate accumulated prior knowledge of spatial-temporal correlations
This work was supported in part by National Key R&D Program of to discern ambiguous visual relationships. As illustrated in
China under Grant No. 2021ZD0111601, National Natural Science Foundation Figure 1, prior knowledge consists of two aspects. 1) Spatial
of China (NSFC) under Grant No. 62206060, 62272494, and 61836012, co-occurrence correlations: the relationship between certain
GuangDong Basic and Applied Basic Research Foundation under Grant No.
SL2022A04J01626, 2023A1515012845, and 2023A1515011374, and Funda- object categories tends toward specific interactions. Given the
mental Research Funds for the Central Universities, Sun Yat-sen University, subject and object of person and sandwich, their relationship
under Grant No. 23ptpy111, GuangDong Province Key Laboratory of Infor- tends to be eating or holding instead of wiping. 2) Temporal
mation Security Technology. (Corresponding author: Tianshui Chen)
Tao Pu, Hefeng Wu, and Liang Lin are with the School of Computer Science consistency/transition correlations: the relationship of a given
and Engineering, Sun Yat-Sen University, Guangzhou 510006, China. (e-mail: pair tend to be consistent across continuous video clip or have
putao3@mail2.sysu.edu.cn, wuhefeng@mail.sysu.edu.cn, linliang@ieee.org) a high probability of transit to another specific relationship.
Tianshui Chen and Yongyi Lu are with the School of Information Engi-
neering, Guangdong University of Technology, Guangzhou 510006, China. Given the subject and object of person and cup with the
(e-mail: tianshuichen@gmail.com, yylu1989@gmail.com). relationship of holding in the current frame, it is likely to
IEEE TRANSACTIONS ON IMAGE PROCESSING 2
keep the holding relationship or transit to the relationship II. R ELATED W ORK
of drinking. Consequently, integrating these correlations can A. Image Scene Graph Generation
effectively regularize spatial prediction space within each
Over the past decade, scene graph generation (SGG) [27]–
image and sequential variation space across temporal frames,
[30] has attracted considerable interest across various commu-
thereby reducing ambiguous predictions.
nities due to its ability to precisely present the semantics of
In this work, we find that initializing spatial-temporal
visual contents of complex visual scenes. It aims to identify all
knowledge embeddings with statistical correlations can bet-
objects and their visual relationships within an image, neces-
ter guide model learning spatial-temporal correlations, and
sitating visual perception and natural language understanding.
the multi-head cross-attention mechanisms can better inte-
Hence, much effort has been invested in aligning visual and
grate spatial-temporal correlations with visual information. To
semantic spaces for relationship representation learning [31]–
this end, we propose a novel spatial-temporal knowledge-
[33]. Further research has highlighted the importance of each
embedded transformer (STKET), which introduces multi-
subject-object pair for inferring ambiguous relationships. Xu
head cross-attention layers to incorporate the prior spatial-
et al. [1] propose to iteratively refine predictions by passing
temporal knowledge for aggregating spatial-temporal contex-
contextual messages, and Zellers et al. [34] capture a global
tual information. Specifically, we first initialize the spatial
context using a bidirectional LSTM. Their impressive per-
co-occurrence and temporal transition correlations via statis-
formance demonstrates that spatial context is crucial for rec-
tical matrices from the training set and then embed these
ognizing visual relationships. Therefore, subsequent research
correlations into learnable spatial-temporal knowledge repre-
underlines the role of spatial context in generating scene
sentations. Then, we design a spatial-knowledge embedded
graphs. To this end, many works adopt graph convolutional
layer to exploit the within-image co-occurrence correlation
networks [35] or similar architectures to pass messages among
to guide aggregating spatial contextual information and a
different objects. Chen et al. [36] built a graph that associates
temporal-knowledge embedded layer to incorporate cross-
detected objects according to these statistical correlations and
image transition correlations to help extract temporal contex-
employs it to learn the context among different objects to
tual information, which can generate spatial- and temporal-
regularize prediction. Tang et al. [37] design a dynamic tree
embedded relationship representations, respectively. Finally,
structure to encode the context among different object regions
we aggregate both spatial- and temporal-embedded relation-
efficiently. With the impressive progress of transformer [38],
ship representations of each object pair to predict the seman-
more and more works propose utilizing this kind of model to
tic labels and relationships. Compared with current leading
learn more representative features from spatial context. Cong
competitors, STKET enjoys two appealing advantages: 1)
et al. [39] introduce an encoder-decoder architecture to reason
Integrating these correlations can help to better aggregate
about the visual feature context and visual relationships, using
spatial and temporal contextual information and thus learn
different types of attention mechanisms with coupled subject
more representative relationship representation to facilitate
and object queries. Kundu et al. [40] propose contextualized
VidSGG. 2) Incorporating these correlations can effectively
relational reasoning using a two-stage transformer-based archi-
regularize relationship prediction, which can evidently reduce
tecture for effective reasoning over cluttered, complex seman-
the dependencies on training samples and thus dramatically
tic structures. Despite impressive progress on static images,
improve the performance, especially for the relationships with
leading algorithms may suffer significant performance drops
limited samples.
when applied to recognize dynamic visual relationships in
The contributions of this work can be summarized in
videos because it requires an in-depth exploration of temporal
three folds. First, we propose a spatial-temporal knowledge-
consistency/transition correlations across different images.
embedded transformer (STKET) that incorporates spatial co-
occurrence and temporal transition correlations to guide ag-
gregating spatial and temporal contextual information, which, B. Video Scene Graph Generation
on the one hand, facilitates learning more representative re- Building on successfully exploring the spatial context within
lationship representation and, on the other hand, regularizes images, researchers have explored the spatial context and
predication space to reduce the dependencies on training temporal correlation simultaneously in video scene graph gen-
samples. To our knowledge, this is the first attempt to explicitly eration. With the advent of ImageNet-VidVRD [21], a bench-
integrate spatial and temporal knowledge to promote VidSGG. mark of video visual relation detection, numerous approaches
Second, we introduce unified multi-head cross-attention mech- [22], [23], [41]–[43] have been proposed to employ object-
anisms to integrate the spatial and temporal correlations via the tracking mechanisms to dive into temporal correlations among
spatial and temporal knowledge-embedded layers, respectively. different image frames. Qian et al. [22] propose to use a graph
Finally, we conduct experiments on the Action Genome dataset convolution network to pass messages and conduct reasoning
[26] to demonstrate the superiority of STKET. It obtains an in the fully-connected spatial-temporal graphs. Similarly, Tsai
obvious performance improvement over current state-of-the- et al. [41] design a gated spatiotemporal energy graph that
art algorithms, especially for the relationships with limited exploits the statistical dependency between relational entities
samples, e.g., with the 9.98%-23.98% R50 improvement for spatially and temporally. Recently, Teng et al. [43] propose
the top-10 least frequency relationships compared with the a new detect-to-track paradigm by decoupling the context
previous best-performing algorithm. Codes are available at modeling for relation prediction from the complicated low-
https://github.com/HCPLab-SYSU/STKET. level entity tracking. Zheng et al. [23] propose a unified one-
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
stage model that exploits static queries and recurrent queries knowledge-guided graph routing framework, which unifies
to enable efficient object pair tracking with spatiotemporal prior knowledge of statistical label correlations with deep
contexts. However, introducing object-tracking models results neural networks for multi-label few-shot learning. Pu et al.
in high computational cost and memory consumption and [55] introduce the prior knowledge between action unit and
quickly overwhelms the valuable information due to many facial expression to facilitate facial expression recognition.
irrelevant frames, leading to sub-optimal performance. Yang et al. [57] incorporate the prior semantic knowledge
An alternative line of research tries to address the task into a deep reinforcement learning framework to address
of video scene graph generation based on detected object the semantic navigation task. Chen et al. [36] incorporate
proposals. Compared with tracking-based VidSGG algorithms, statistical correlations into deep neural networks to facilitate
this kind of method focuses on modeling the temporal context scene graph generation, in which these statistical correlations
in the narrow sliding window, avoiding the prediction shift between object pairs and their relationships can effectively
resulting from the inconsistency among tracking proposals. regularize semantic space and make prediction less ambiguous.
Recently, Cong et al. [25] propose a strong baseline that adopts However, most of these works merely consider prior spatial
a spatial encoder and a temporal decoder to extract implicitly knowledge of statistic images, and VidSGG involves spatial-
spatial-temporal contexts. This work demonstrates that the temporal contextual information. In this work, we propose to
temporal correlation is crucial for inferring dynamic visual learn spatial-temporal knowledge and incorporate it into the
relationships. Li et al. [24] propose a novel anticipatory pre- multi-head cross-attention mechanism to learn more represen-
training paradigm to model the temporal correlation implicitly. tative relationship representations to facilitate VidSGG.
Wang et al. [44] propose to explore temporal continuity
by extracting the entire co-occurrence patterns. Kumar et III. M ETHOD
al. [45] further propose to spatially and temporally localize Overview. In this section, we first introduce the preliminary
subjects and objects connected via an unseen predicate with of video scene graph generation and then describe the process
the help of only a few support set videos sharing the com- of extracting spatial co-occurrence and temporal transition
mon predicate. These works underscore the critical role of correlations from the dataset. Finally, the details of our
temporal correlation in inferring dynamic visual relationships proposed framework, Spatial-Temporal Knowledge-Embedded
but overlook the essential prior knowledge of spatial-temporal Transformer (STKET), are given.
correlations. Differently, we propose to incorporate spatial-
temporal knowledge with multi-head cross-attention layers to A. Preliminary
learn more representative and discriminative feature represen-
tation to facilitate VidSGG. Notation. Given a video V = {I1 , I2 , ..., IT }, VidSGG aims to
Notably, exploiting an object detector to facilitate video detect all visual relationships between objects, presented as a
understanding is common. Recently, Wang et al. [46] propose triplet <subject, predicate, object >, in each frame to generate
a novel SAOA framework to introduce the spatial location a scene graph sequence G = {G1 , G2 , ..., GT }, where Gt is
provided by the detector for egocentric action recognition, the corresponding scene graph of the frame It . Specifically,
N (t)
which aims to reason the interaction between humans and define Gt = {Bt , Ot , Rt }, where Bt = {b1t , b2t , ..., bt },
N (t) K(t)
objects from an egocentric perspective. Although aligning Ot = {o1t , o2t , ..., ot } and Rt = {rt1 , rt2 , ..., rt } indicate
local object features and location proposals to capture the the bounding box set, the object set and the predicate set,
spatial context, the SAOA framework ignores the crucial respectively. In the frame It , N (t) is the number of objects,
temporal correlation across continuous frames, which is the and K(t) is the number of relationships between all objects.
main distinction between it and our proposed STKET. Relationship Representation. For the frame It , we employ
Faster R-CNN [58] to provides visual feature representation
N (t) N (t)
{vt1 , ..., vt }, bounding boxes {b1t , ..., bt } and object
C. Knowledge Representation Learning N (t)
category distribution {d1t , ..., dt } of object proposals. The
k
Recent advances in deep learning have allowed neural relationship representation xt of the relationship between the
networks to learn potent representations from raw training i-th and j-th object proposals contains visual appearances,
data for various tasks [38], [47]–[49]. However, solely using spatial information, and semantic embeddings, which can be
these vanilla networks may achieve poor performance, espe- formulated as
cially in weakly supervised learning [50], [51] and domain
xkt =< fs (vti ), fo (vtj ), fu (φ(ui,j i j i j
t ⊕ fbox (bt , bt ))), sj , st >,
adaption [52]. To address this challenge, lots of efforts have
been made to integrate domain prior knowledge into deep where <, > is concatenation operation, φ is flattening op-
representation learning, resulting in remarkable progress in eration and ⊕ is element-wise addition. fs , fo are imple-
numerous computer vision tasks, such as few-shot image mented respectively by one fully-connected layer which maps
recognition, facial expression recognition, visual navigation, a 2048-dimension vector to a 512-dimension vector, and fu
scene graph generation [36], [53]–[57]. Specifically, Peng et is implemented by one fully-connected layer which maps a
al. [53] propose a novel knowledge transfer network that 12544-dimension vector to a 512-dimension vector. ui,j t ∈
jointly incorporates visual feature learning, knowledge infer- R256×7×7 is the feature map of the corresponding union
ring, and classifier learning to fully explore prior knowledge box generated by RoIAlign [59] while fbox is the function
in the few-shot task. Similarly, Chen et al. [54] propose a transforming the bounding boxes of subject and object to
IEEE TRANSACTIONS ON IMAGE PROCESSING 4
RCNN
Temporal Knowledge Spatial-Temporal
Faster
Spatial Knowledge
-Embedded Layer -Embedded Layer Aggregation
Q Q
Multi-Head
Attention
Attention
FFN
FFN
K K
V V
Fig. 2. An overall illustration of the proposed STKET framework (left) and its corresponding outputs (right). It first exploits spatial and temporal knowledge-
embedded layers to incorporate the spatial co-occurrence and temporal transition correlations into the multi-head cross-attention mechanism to learn spatial-
and temporal-embedded representations. Then, it employs a spatial-temporal aggregation module to aggregate these Spatialrepresentations
Co-occurrence for each object pair to
predict the final semantic labels and relationships. Matrix
an entire feature with the same shape as ui,j t . The semantic Step 1. choose the frame contain the target pair
embedding vectors sit , sjt ∈ R200 are determined by the object
category distribution of subject and object. For brevity, we
denote all relationship representations in the frame It by
K(t) Step 2. updating matrix
Xt = {x1t , ..., xt }. according to their predicates
����
B. Spatial-Temporal Knowledge Representation ��,�
Spatial Co-occurrence Step 3. mapping the matrix as spatial context embedding
When inferring visual relationships, Matrix humans leverage not
only visual cues but also accumulated prior knowledge [60]. (a)
This approach has been validated on various vision tasks [54],
Step 1. choose the frame contain the target pair Step 1. choose the frame pair contain the target pair
[56], [61]. Inspired by this, we propose to distill the prior
spatial-temporal knowledge directly from the training set for
facilitating the VidSGG task. Specifically, we learn spatial co-
Step 2. updating matrix Step 2. updating
occurrence and temporal transition correlations in a statistical matrix according to
according to their predicates
manner, as shown in Figure 4. their predicates and
sequence
Spatial Prior Knowledge. Between different���� object category �,�
����
��,� ��
pairs, a large difference exists in the spatial co-occurrence
Step 3. mapping the matrix as spatial context embedding
correlation of its relationships. To account for this, we con- Step 3. mapping the matrix as temporal context embedding
struct a spatial co-occurrence matrix E i,j ∈ RC for i-th object (b)
category and j-th object category by measuring the frequency Fig. 3. Illustration of learning (a) spatial prior knowledge representation and
of each predicate in the relation set of this pair. C is the (b) temporal prior knowledge representation. For better readability, we take
total number of types of relationships. For example, let us the object category pair of person and cup as an example.
consider the object category pair of person and cup (assume
their indices are i and j, respectively). At each frame, if both
tionship representation in the frame It , the generating process
objects are present, the total number of co-occurrences of this
can be formulated as
pair N i,j is incremented by 1. Next, if their predicates contain
holding (assume the index is x), then ei,j x is incremented by skt = fspa (E s(k,t),o(k,t) ), (2)
1. Finally, between person and cup, the spatial co-occurrence
where s(k, t) and o(k, t) denote the category of subject and ob-
probability of each predicate is calculated as:
ject of k-th relationship representation in the frame It , fspa (·)
ei,j i,j i,j
x ∈ {1, ..., C}. is implemented by four fully-connected layers which map a
x = ex /N , (1)
C-dimension vector to a 1936-dimension vector. For brevity,
To distill spatial co-occurrence correlations, the model we denote the corresponding spatial knowledge embeddings
K(t)
learns the spatial knowledge embedding based on the spatial for all object pairs in the frame It by St = {s1t , ..., st }.
co-occurrence matrix. Specifically, the model generates the Since the distribution of real-world relationships is seri-
corresponding spatial knowledge embedding skt for k-th rela- ously unbalanced, directly utilizing these spatial knowledge
IEEE TRANSACTIONS ON IMAGE PROCESSING 5
hold
hold
cup
person
hold
drink from drink from
Fig. 4. An example of building spatial co-occurrence and temporal transition matrix. Given the video frames and their corresponding relation annotations
(left), we first increase the spatial co-occurrence probability of the corresponding predicates (i.e., “hold” and “drink from”). Then, for any two consecutive
frames, we take the predicate in the previous and current frames as the row and column index separately and increase the temporal transition probability of
the corresponding transition pair (i.e., “hold → drink from”).
embeddings may degrade the model performance on the less layers which map a C-dimension vector to a 1936-dimension
frequent relationships. Thus, we train these embeddings to vector. Specifically, we utilize the spatial contextualized rep-
predict predicates by using the binary cross-entropy loss as resentation to coarsely predict the relation labels. For brevity,
the objective function: we denote the corresponding temporal knowledge embeddings
K(t)
T K(t) for all object pairs in the frame It by Tt = {t1t , ..., tt }.
Since complicated predicate transition exists for different
X X
Lspk = ℓ(f (skt ), ytk ), (3)
t=1 k=1 object category pairs, the temporal knowledge embedding
may contain inaccurate temporal correlation, resulting in sub-
where ℓ(·, ·) denote the binary cross-entropy loss function,
optimal performance. Therefore, we train these embeddings
f (·) denote the predicate classifier, ytk denote ground-truth
to predict predicates at the next frame by using the binary
predicate labels of k-th relationships in the frame It .
cross-entropy loss as the objective function:
Temporal Prior Knowledge. In daily life, the interaction
between people and objects is characterized by temporal T
X −1 K(t)
X
transitions. To identify relationships at different stages, we Ltpk = ℓ(f (tkt ), yt+1
k
), (6)
construct a temporal transition matrix Ê i,j ∈ RC×C for i- t=1 i=1
th object category and j-th object category. C is the total where we assume k-th relationship representation in the frame
number of types of relationships. For instance, let us consider It and frame It+1 belong the same subject-object pair for
the object category pair of person and cup as an example easily understanding.
(assume the indices are i and j, respectively). If the contacting
predicate between person and cup in the previous frame is
holding (assume the index is x) and in the current frame is C. Knowledge-Embedded Attention Layer
i,j
drinking (assume the index is y), then Êx,y is incremented Between object pairs and their relationships, there are appar-
by 1. Finally, between person and cup, the temporal transition ent spatial co-occurrence correlations within each image and
probability of predicates is calculated as: strong temporal transition correlations across different images.
êi,j i,j i,j
x, y ∈ {1, ..., C}. Thus, we propose incorporating spatial-temporal knowledge
x,y = êx,y /ex , (4)
into the multi-head cross-attention mechanism to learn spatial-
To explore temporal transition correlations, the model learns and temporal-embedded representations.
the corresponding temporal knowledge embedding based on Spatial knowledge often encapsulates information about
the temporal transition matrix. Specifically, given the predicted positions, distances, and relationships among entities. On
object labels and relation labels, the model generates the the other hand, temporal knowledge concerns the sequence,
corresponding temporal knowledge embedding tkt for k-th duration, and intervals between actions. Given their unique
relationship representation in the frame It , the generating properties, treating them separately allows specialized model-
process can be formulated as ing to capture the inherent patterns more accurately. Therefore,
s(k,t),o(k,t) we design spatial and temporal knowledge-embedded layers
tkt = ftem (Êr(k,t) ), (5)
that thoroughly explore the interaction between visual repre-
where s(k, t) and o(k, t) denote the category of subject and ob- sentation and spatial-temporal knowledge.
ject of k-th relationship representation in the frame It , r(k, t) Exploring Spatial Context. As shown in Figure 2, we first
denotes the predicate of k-th relationship representation in employ the spatial knowledge-embedded layers (SKEL) to
the frame It , ftem (·) is implemented by four fully-connected explore spatial co-occurrence correlations within each image.
IEEE TRANSACTIONS ON IMAGE PROCESSING 6
Specifically, we take all relationship representations of the cur- D. Spatial-Temporal Aggregation Module
K(t)
rent frame It as the input, i.e., FS0,t = Xt = {x1t , ..., xt }. As aforementioned, SKEL explores the spatial co-
Then, the SKEL incorporates corresponding spatial knowledge occurrence correlations within each image, and TKEL explores
embeddings in the keys and queries to fuse the information the temporal transition correlations across different images.
from the relationship representation and its corresponding Though fully exploring the interaction between visual rep-
spatial prior. Since we stack Ns identical spatial knowledge- resentation and spatial-temporal knowledge, these two layers
embedded layers, the input of the n-th layer is the output of generate spatial- and temporal-embedded representations, re-
(n − 1)-th layer. Thus, the queries Q, keys K, values V and spectively. To explore the long-term context information, we
output of the n-th SKEL is presented as: further design the spatial-temporal aggregation (STA) module
to aggregate these representations for each object pair to
Q = WQ FS(n−1),t + St , (7)
predict the final semantic labels and their relationships. It takes
K= WK FS(n−1),t + St , (8) the spatial- and temporal-embedded relationship representa-
V = WV FS(n−1),t , (9) tions of the identical subject-object pair in different frames as
the input. Specifically, we concatenate these representations of
FSn,t = Attspa. (Q, K, V), (10) the same object pair to generate context representation:
where the WQ , WK and WV denote the linear transforma- ckt = cat(fk,t
S T
, fk,t ), (16)
tions, the Attspa. (·) denote cross-attention layer in the SKEL.
S T
For simplicity, we remove the subscript n and denote the final where fk,t and fk,t denote the spatial- and temporal-embedded
output of the SKEL as FSt . representation for the k-th relationship in the frame It . And
Parsing Temporal Correlation. As claimed in prior works then, to find the same subject-object pair in different frames,
[24], [25], the dynamic visual relation can be easily recognized we adopt the predicted object label and the IoU (i.e., inter-
with the given temporal information in the previous frame. section over union) to match the same subject-object pairs
Thus, we design the temporal knowledge-embedded layer detected in frames {It−τ +1 , ..., It } (more details in the sup-
(TKEL) to explore the temporal correlation in the current plementary material). Thus, the input of k-th relationship in
frame It and the previous frame It−1 . At first, we adopt a the frame It in the spatial-temporal aggregation module is
sliding window over the sequence of spatial contextualized presented as
C
representation [FS1 , ..., FST ], and the input of the frame It in fk,t = [ckt−τ +1 , ..., ckt ], (17)
TKEL is presented as: where we assume k-th relationship representation in frame
{It−τ +1 , ..., It } belong same subject-object pair for easily
FT0,t = [FSt−1 , FSt ], t ∈ {2, ..., T }. (11)
understanding. And its corresponding output of the spatial-
Then, TKEL incorporates corresponding spatial and tem- temporal aggregation module is presented as:
′
poral knowledge embeddings in the keys and queries to fuse C
Q = WQ fk,t + Ef , (18)
the information from the relationship representation and its C ′
where Ef is the learned frame encoding vector, the Atttem. (·) E. Loss Function
denote cross-attention layer in TKEL and the WQ , WK and In the real world, there exist different kinds of relationships
WV denote the linear transformations. The intention of adding between two objects at the same time. Thus, we introduce
[Tt−1 , St ] into the queries and keys is to incorporate the the binary cross-entropy loss as the objective function for
temporal prior about the previous frame and the spatial prior predicate classification as follows:
about the current frame. In this way, the TKEL can effectively X X
ℓ(r) = log(ϕ(r, p)) + log(1 − ϕ(r, q)). (22)
capture spatiotemporal context in the sliding window.
p∈P + q∈P −
Considering that the relationships in a frame have various
representations in different batches, we choose the earliest For a subject-object pair r, P + are the annotated predicates,
representation appearing in the sliding window. For simplicity, while P − is the set of the predicates not in the annotation.
we remove the subscript n and denote the final output of TKEL ϕ(r, p) indicates the computed confidence score of the p-th
as FTt . predicate.
IEEE TRANSACTIONS ON IMAGE PROCESSING 7
TABLE I
C OMPARISON WITH STATE - OF - THE - ART SCENE GRAPH GENERATION METHODS ON ACTION G ENOME [26].
TABLE II
C OMPARISON WITH STATE - OF - THE - ART SCENE GRAPH GENERATION METHODS ON ACTION G ENOME [26].
During training, we adopt the binary cross-entropy loss as the predicates of each object pair. In the SG Gen, an object
the objective function for supervising SKEL, TKEL, and STA bounding box is considered to be correctly detected only
and denote the corresponding loss by ℓs (r), ℓt (r) and ℓc (r) if the predicted object bounding box has at least 0.5 IoU
respectively. Therefore, the final classification loss is defined (Intersection over Union) overlap with the ground-truth object
as summing the three losses over all samples, formulated as bounding box. We use No Constraint strategy of generating a
T K(t) scene graph to evaluate the models. This strategy allows each
subject-object pair to have multiple predicates simultaneously.
X X
Lcls = [ℓs (r) + ℓt (r) + ℓc (r)]. (23)
t=1 r=1
Evaluation Metric All tasks are evaluated with the Recall@K
(short as R@K) metric (K = [10, 20, 50]), which measures the
Therefore, the total objective is formulated as:
ratio of correct instances among the top-K predicted instances
L = Lcls + Lspk + Ltpk . (24) with the highest confidence. We also report the results by using
the mean Recall@K (short as mR@K) metric that averages
In all experiments, we set the loss weights of the three losses
R@K over all relationships.
to be equal, primarily to ensure that each loss component has
Training Details Following previous works [24], [25], we
an equal contribution to the overall optimization objective.
adopt the Faster RCNN [58] with a ResNet-101 [47] backbone
as the object detector. We first train the detector on the training
IV. E XPERIMENTS
set of Action Genome [26] and get 24.6 mAP at 0.5 IoU
A. Experiment Setting with COCO metrics. The detector is applied to all baselines
Dataset As the most widely used dataset for evaluating video for fair comparisons. The parameters of the object detector
scene graph generation, the Action Genome [26] contains (the object classifier excluded) are fixed when training scene
476,229 bounding boxes of 35 object classes (without the graph generation models. Per-class non-maximal suppression
person) and 1,715,568 instances of 26 relationship classes at 0.4 IoU (Intersection over Union) is applied to reduce region
annotated for 234, 253 frames. These 26 relationships are proposals provided by RPN.
subdivided into three different types: (1) attention, (2) spatial, We use an AdamW [62] optimizer with initial learning rate
and (3) contact whose number of categories are 3, 6, and 17, 2e−5 and batch size 1 to train our model. Moreover, gradient
respectively. In all experiments, we use the same training and clipping is applied with a maximal norm of 5. All experiments
testing split in previous works [24], [25]. are implemented by PyTorch [63]. In the spatial-temporal
Task Following prior arts [25], [36], [43], we evaluate our aggregation module, we set the size of the sliding window
proposed method and other state-of-the-art methods under τ to 4. In KEAL, we stack two identical spatial knowledge-
three kinds of experiment setups: Predicate Classification embedded layers to explore spatial co-occurrence correlations
(Pred Cls): predict the predicates of object pairs with given and then stack two identical temporal knowledge-embedded
ground truth bounding boxes and category labels. Scene layers to temporal transition correlations. The cross-attention
Graph Classification (SG Cls): predict both the predicates and self-attention layers in our proposed framework have 8
and the category labels of objects with given ground-truth heads with d = 1936 and dropout = 0.1. In SKEL and TKEL,
bounding boxes. Scene Graph Generation (SG Gen): simul- the 1936-dimension input is projected to 2048-dimension by
taneously detects objects appearing in the image and predicts the feed-forward network, then projected to 1936-dimension
IEEE TRANSACTIONS ON IMAGE PROCESSING 8
Fig. 5. Qualitative results in SG Gen task with top-10 confident predictions. The blue and red colors indicate correct relationships and objects, respectively.
The gray colors indicate wrong relationships and objects. For fairness, the same object detector is used for all methods.
again after ReLU activation. In STA, the 3872-dimension input [25], TPI [44], and APT [24]. Previous works [25] also adapt
is projected to 1936-dimension. image-based SGG algorithms to address the VidSGG task by
Frame Encoding Unlike the SKEL, the TKEL adopts a applying the inference process to each frame. We also follow
sliding window over the sequence of spatial contextualized these works to include the image-based SGG algorithms for
representation as its input. Therefore, we need to introduce more comprehensive evaluations and comparisons, including
the learned frame encoding Ef into the TKLE to help it fully VCTREE [37], KERN [36], ReIDN [27], GPS-Net [28].
understand the temporal dependencies among different frames. We first present the comparison on R@K in Table I. As
Specifically, we construct the frame encodings Ef using shown, recent video-based algorithms (e.g., STTran, TPI, and
learned embedding parameters. Specifically, Ef = [e1 , e2 ], APT) obtain quite a marginal improvement over the image-
where e1 and e2 ∈ R1936 are the learned vectors. Simi- based SGG algorithms as they further introduce temporal
′
larly, we also introduce frame encodings Ef in STA, where contextual information. By introducing spatial-temporal prior
′ ′ ′
Ef = [e1 , ..., eτ ], and each encoding is the learned vectors knowledge to guide aggregating spatial and temporal contex-
with a length of 3872. tual information, the proposed STKET framework consistently
Pair Tracking As described in the manuscript, STA takes outperforms in nearly all settings. For example, it improves the
the spatial- and temporal-embedded representation of the same R@10 from 78.5% to 82.6%, 55.1% to 57.1%, and 25.7% to
subject-object pair in different frames as input. We first use 27.9% on the three tasks, with the improvement of 4.1%, 2.0%
the predicted object labels to distinguish different pairs to and 2.2%, respectively. It also obtains similar improvement on
match the subject-object pairs detected in different frames. If R@20 and R@50 metrics.
multiple entities of the same category exist, we calculate the
Considering the mR@K metric offers a better performance
intersection over union (IoU) between the two objects across
measure under uneven distribution [36], we also present the
different images to match the subject-object pair. Specifically,
comparison results on this metric. As presented in Table
we compute the IoU between the bounding box of the target
II, STKET obtains even more significant improvement, en-
object in the previous frame and that of each object with the
hancing mR@50 by 8.1%, 4.7%, and 2.1% compared with
same category label in the current frame. If the IoU is higher
the current best-performing APT algorithm. To highlight the
than 0.8, we consider them to be the same object. We choose
necessity of introducing the mR@K metric, we present the
the one with the highest IoU if there are multiple candidates.
distribution across different relationships on the AG dataset
in Figure 6a. As depicted, the distribution of relationships is
B. Comparison with State-Of-The-Art Methods exceedingly long-tailed, in which the top-10 most frequent
To evaluate the effectiveness of STKET, we compare it with relationships occupy 90.9% samples while the top-10 least
existing state-of-the-art VidSGG algorithms, including STTran frequency relationships merely occupy 2.1%. In Figure 6b, we
IEEE TRANSACTIONS ON IMAGE PROCESSING 9
25.0
20.63
22.5 0 0
20.0 1 1
15.66
2 2
Proportion (%)
17.5
3 3
4 4 0.8
12.72
5 0.8 5
11.72
15.0 6
7
6
7
12.5 8 8
Relationship Index
Relationship Index
8.518
9 9 0.6
10.0
10 0.6 10
5.659
11 11
4.716
12 12
4.229
7.5
behind 3.759
13 13
sitting on 3.287
unsure 3.139
14 14
in 1.047
leaning on 0.933
other 0.709
standing on 0.617
wearing 0.548
above 0.376
drinking from 0.355
covered by 0.330
carrying 0.325
lying on 0.275
eating 0.261
16 16
writing on 0.089
wiping 0.063
have it on back 0.025
twisting 0.007
17 17
2.5 18
19
18
19
0.0 20
21
0.2 20
21
0.2
in front of
not looking at
holding
looking at
not contacting
on the side of
beneath
touching
22 22
23 23
24 24
25 25
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Relationship Index Relationship Index
1.0
(a) 0.8
0.6
Entropy
100 STTran
0.4
-2.52
APT
+2.96
Ours
0.27
0.26
0.25
0.25
0.25
0.25
0.25
0.24
0.23
0.23
0.21
-0.49
-1.08
0.21
80
+3.04
0.2
carrying 0.11
sitting on 0.10
lying on 0.09
standing on 0.08
leaning on 0.07
+14.91
0.04
wiping 0.03
eating 0.02
covered by 0.02
twisting 0.02
wearing 0.02
have it on back 0.01
+2.76
writing on 0.00
0.0
+6.20
-0.69
60
+2.77
R@50 (%)
unsure
in front of
looking at
not looking at
above
beneath
behind
on the side of
in
holding
not contacting
other
touching
40
+4.38
+0.06
+1.05
+4.28
+4.28
+6.03
+8.96
+1.48
+0.47
+12.53
+9.98
+19.41
+6.42
+16.16
+15.70
+17.85
20
+17.63
+11.90
+20.42
+19.67
+23.58
+8.21
+10.09
leaning on +1.10
+17.03
+30.65
+14.94
+23.98
eating +23.98
+2.92
writing on +15.75
+11.97
wiping +12.10
standing on +13.90
wearing +14.08
+16.05
other +5.75
unsure
covered by
carrying
not looking at
holding
looking at
not contacting
on the side of
beneath
touching
behind
sitting on
in
above
lying on
(right). For brevity, we print the relation index instead of the relation label
(see the corresponding labels below figure, i.e., 0 denotes “looking at” and
25 denotes “writing on”). Bottom: The entropy of spatial-temporal knowledge
across different predicates.
(b)
Fig. 6. (a) The distribution of different relationships on Action Genome [26].
(b) The R@50 results in Pred Cls task of our method, STTran and APT on occupy 0.355% of the total samples, while other methods
Action Genome [26].
miss it. This success is largely due to our method’s explicit
incorporation of spatial-temporal correlations, which helps to
lessen the dependency on training samples significantly.
further provide the detailed performance of each relationship
to understand the performance variance across different rela-
tionships better. Evidently, current algorithms like STTran and
C. Ablative Study
APT deliver competitive performance for relationships with
abundant training samples (e.g., “in front of”, “not looking at”) In this section, we conduct comprehensive experiments to
but falter significantly for relationships with limited training analyze the actual contribution of each crucial component.
samples (e.g., “wiping”, “twisting”). In contrast, for the top- Here, we mainly present the R@10, R@20, mR@10, and
10 least frequent relationships, STKET improves the R@50 mR@20 on Pred Cls and SG Cls as they can better describe
from 9.98% to 23.98% compared to the second-best APT and the performance.
from 12.10% to 30.65% compared to the baseline STTran. 1) Analysis of STKET: As aforementioned, STKET inte-
These comparisons demonstrate STKET can effectively regu- grates spatial-temporal knowledge into the multi-head cross-
larize VidSGG training and thus reduce the dependencies on attention mechanism to aggregate spatial and temporal con-
training samples by explicitly incorporating spatial-temporal textual information. In this way, it effectively regularizes
prior knowledge. spatial prediction space within each image and sequential
In Figure 5, we visualize the qualitative results of our variation space across temporal frames, thereby reducing am-
method and current leading VidSGG methods (i.e., STTran biguous predictions. To verify the effectiveness of exploring
and APT). The results show that existing state-of-the-art these spatial-temporal correlations, we implement the baseline
VidSGG algorithms tend to predict numerous false-positive STTran method for comparison purposes, setting the layers of
relationships, while our method successfully predicts nearly all its spatial encoder and temporal decoder to two for fair com-
relationships with high accuracy. This highlights the strength parisons. As shown in Table III, the baseline STTran method
of integrating spatial-temporal knowledge in discerning dy- obtains the R@10 and R@20 values of 77.8% and 94.1% on
namic relationships, especially within intricate interactions. Pred Cls and the R@10 and R@20 values of 53.8% and 63.6%
For instance, across all frames, our STKET accurately identi- on SG Cls. By incorporating spatial-temporal correlations to
fies the attentional relationship among person, cup, and laptop, regularize training and inference, the STKET boosts the R@10
where current leading VidSGG algorithms stumble. It is also and R@20 values to 82.6% and 96.3% on Pred Cls and the
noteworthy that our method not only performs well on high- R@10 and R@20 values to 57.1% and 65.3% on SG Cls.
frequency relationships but also on low-frequency relation- Similarly, it consistently outperforms the baseline method on
ships. For instance, in the last frame, our method correctly other metrics(i.e., the mR@10 and mR@20 metrics), as shown
predicts the predicate of “drink from” whose samples merely in Table III.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10
TABLE III
C OMPARISON OF THE BASELINE STT RAN METHOD (STT RAN ), THE BASELINE STT RAN METHOD MERELY USING THE SPATIAL ENCODER (STT RAN SE),
THE BASELINE STT RAN METHOD MERELY USING THE TEMPORAL DECODER (STT RAN TD), OUR FRAMEWORK MERELY USING SKEL (O UR SKEL),
OUR FRAMEWORK MERELY USING SKEL WITHOUT THE LOSS Lspk (O URS SKEL W / O Lspk ), OUR FRAMEWORK MERELY USING TKEL (O UR TKEL),
OUR FRAMEWORK MERELY USING TKEL WITHOUT THE LOSS Ltpk (O URS TKEL W / O Ltpk ), OUR FRAMEWORK REMOVING STA (O URS W / O STA) AND
OUR FRAMEWORK (O URS ).
Further, to emphasize the effectiveness of spatial-temporal of 0.9% and 1.3% on Pred Cls and SG Cls, respectively. These
knowledge, we conduct qualitative experiments that visualize results demonstrate that the spatial co-occurrence correlation
two relationship transition matrices of different subject-object within each image can help to regularize spatial prediction
pairs (i.e., “man-tv” and “man-cup”) in Figure 7, where a space within each image effectively.
lighter color indicates a higher probability of relationship In the SKEL module, the loss Lspk helps to learn accurate
transition. It’s worth noting that transition probabilities vary spatial knowledge embedding, thereby effectively regularizing
widely depending on the predicates and object categories spatial prediction space within each image. To evaluate the
involved, e.g., the transition likelihood for common predicates contribution of this loss, we conduct experiments that only
like “looking at” is much higher than for rare predicates such utilize the SKEL module without the loss (namely, “Ours
as “writing on”. This variance within the transition matrix SKEL w/o Lspk ”) for comparison purposes. As presented in
highlights the importance of incorporating spatial-temporal Table III, it decreases the performance by 0.7%/0.4% on Pred
knowledge for effective relationship prediction regularization. Cls with R@10/20 and 0.6%/0.4% on SG Cls with R@10/20.
Additionally, we evaluate the entropy of spatial-temporal Similarly, it degrades performance by 0.6%/0.5% on Pred Cls
knowledge embeddings generated by our STKET model. with mR@10/20 and 0.8%/0.7% on SG Cls with mR@10/20.
Lower entropy values in these embeddings, particularly those 3) Analysis of TKEL: To evaluate the actual contribution
for infrequent relationship categories like “eating”, “writing of the TKEL module, we conduct experiments that remove
on”, and “twisting”, indicate a significant amount of prior the SKEL and STA module in STKET while only using
information. This suggests that these embeddings offer sub- this module (namely, “Our TKEL”) and compare it with the
stantial prediction regularization, potentially explaining why baseline STTran method merely using its temporal decoder
our STKET model excels over existing leading algorithms in (namely, “STTran TD”) for comparison purposes. Different
predicting less common relationships. from “STTran SE” which explores the spatial context within
the single frame, “STTran TD” introduces a sliding window
Since the STKET framework consists of three complemen-
to capture the temporal dependencies between frames and
tary modules, i.e., the SKEL module, the TKEL module, and
thus achieves obvious performance improvement, as presented
the STA module, in the following, we further conduct more
in Table III. Specifically, it improves the performance by
ablation experiments to analyze the actual contribution of each
1.4%/2.1% on Pred Cls with R@10/20 and 1.5%/0.9% on
module for a more in-depth understanding.
SG Cls with R@10/20, which demonstrates the importance
2) Analysis of SKEL: To evaluate the actual contribution of temporal correlations in recognizing dynamic visual rela-
of the SKEL module, we compare the performance of our tionships. However, “Our TKEL” performs better than this
STKEL framework merely using this module (namely, “Ours baseline “STTran TD”, with an R@10/20 improvement of
SKEL”) with the performance of the baseline STTran method 4.5% and 4.4% on Pred Cls and an R@10/20 improvement
merely using its spatial encoder (namely, “STTran SE”). As of 2.9% and 1.4% on SG Cls. Furthermore, “Our TKEL”
shown in Table III, the SKEL module improves the R@10 consistently outperforms the baseline “STTran TD” in terms
from 75.3% to 78.0% and the R@20 from 91.1% to 94.3% on of the mR@10 and mR@20 metrics, i.e., it provides an
Pred Cls, with improvements of 2.7% and 3.2%, respectively. R@10/20 improvement of 4.2% and 5.9% on Pred Clsand an
Similarly, it improves the R@10 from 51.6% to 54.2% and the mR@10/20 improvement of 4.3% and 3.6% on SG Cls. These
R@20 from 62.1% to 64.1% on SG Cls, with improvements results demonstrate that integrating temporal knowledge can be
of 2.6% and 2.0%, respectively. It is worth noting that the helpful in regularizing model learning correct temporal corre-
SKEL module achieves not only performance improvement lations, thereby effectively reducing ambiguous predictions.
on the R@10 and R@20 metrics but also on the mR@10 and In the TKEL module, the loss Ltpk helps to learn accurate
mR@20 metrics. Specifically, this module obtains an mR@10 temporal knowledge embedding, thereby effectively regulariz-
improvement of 1.7% and 1.8% and an mR@20 improvement ing sequential variation space across temporal frames. To eval-
IEEE TRANSACTIONS ON IMAGE PROCESSING 11
GT: person-carry-cloth
70 GT: person-lean on-sofa
Our: person-sit on-sofa Our: person-hold-cloth
65
person
R@50(%)
60 cloth
55 SG Cls SG Gen
sofa
50 person
45
1 2 3 4
(a) Impact of sliding window in TKEL
70
65 Fig. 9. Instances of failure cases resulted from relationship imbalance. For
R@50(%)
55 SG Cls SG Gen that have higher frequency instead of the ground truth predicate, which occurs
less frequently in the training set.
50
45 is one). Besides, it is worth noting that a large frame number
1 2 4 6 8 does not always mean better performance because long-term
frame sequences contain complicated temporal contexts, which
(b) Impact of sliding window in STA
may mislead the model.
Fig. 8. Ablation analysis of the frame number of sliding window in (a) TKEL
and (b) STA modules. Evaluated on Action Genome [26].
D. Limitation
As aforementioned, our STKET framework leverages
uate the contribution of this loss, we conduct experiments that
spatial-temporal prior knowledge to effectively regularize re-
merely use TKEL without the loss (namely, “Ours TKEL w/o
lationship predictions, thereby lessening reliance on training
Ltpk ”) for comparison purposes. As presented, it decreases the
samples and mitigating the imbalance problem in the Action
performance by 0.5%/0.4% on Pred Cls with R@10/20 and
Genome. The significant performance boost within tailed
0.7%/0.7% on SG Cls with R@10/20. Similarly, it degrades
relationship categories demonstrates it, as shown in Figure 6b.
performance by 0.5%/0.7% on Pred Cls with mR@10/20 and
Nevertheless, in extreme scenarios where subject-predicate-
0.7%/0.4% on SG Cls with mR@10/20.
object triplets are exceedingly rare, the STKET framework
4) Analysis of STA: Considering the sliding window may
struggles to provide accurate statistical regularization and thus
result in many irrelevant representations, which may easily
may predict false predicates. As illustrated in 9, the bias from
overwhelm the valuable information, it is difficult for the
skewed long-tail distribution leads to a common but incorrect
TKEL module to capture the long-term context of each
predicate “sit on” instead of the rare but correct predicate
subject-object pair. Thus, we design the STA module that
“lean on”, in the relationship of “person-lean on-sofa”. The
aggregates spatial- and temporal-embedded representations of
same type of misclassification also occur with ”person-carry-
the identical subject-object pair across different frames to ex-
cloth”. For these challenges, we conjecture that integrating
plore the long-term context. To evaluate its actual contribution,
spatial-temporal knowledge from various distributions (e.g.,
we conduct experiments that remove STA (namely, “Ours w/o
ImageNet-VidVRD [21], Home Action Genome [64]) can
STA”). As shown in Table III, it decreases the performance
provide more accurate and robust regularization. Furthermore,
by 0.7%/0.5% on Pred Cls with R@10/20 and 0.6%/0.4%
we argue that leveraging the abundant contextual information
on SG Cls with R@10/20. It also decreases performance by
in web text data could enhance the regularization of such rare
0.7%/0.8% on Pred Cls with mR@10/20 and 0.6%/0.5% on
relationships.
SG Cls with mR@10/20.
5) Analysis of Sliding Window: In the TKEL and STA
modules, we introduce the sliding window to explore the V. C ONCLUSION
temporal context contained in previous frames, in which the In this work, we propose to explore spatial-temporal
frame number is a crucial threshold that controls the sequence prior knowledge to facilitate VidSGG via a spatial-temporal
length. Setting it to a small value may miss temporal corre- knowledge-embedded transformer (STKET) framework. It
lations, while setting it to a large value may introduce much contains the spatial and temporal knowledge embedded layers
irrelevant information and result in high computation costs. To that integrate spatial co-occurrence correlations to guide aggre-
figure out the optimal settings, we conduct experiments with gating spatial contextual information and temporal transition
different frame numbers. As shown in Figure 8, introducing correlations to help extract temporal contextual information.
more previous frames can significantly improve performance In this way, it can, on the one hand, aggregate spatial and
when there is only the current frame (i.e., the frame number temporal information to learn more representative relationship
IEEE TRANSACTIONS ON IMAGE PROCESSING 12
representation and, on the other hand, effectively regularize [21] X. Shang, T. Ren, J. Guo, H. Zhang, and T.-S. Chua, “Video visual
relationship prediction and thus reduce the dependencies on relation detection,” in Proceedings of the 25th ACM international
conference on Multimedia, 2017, pp. 1300–1308.
training samples. Extensive experiments illustrate its superior- [22] X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, and J. Xiao, “Video relation
ity over current state-of-the-art algorithms. detection with spatio-temporal graph,” in Proceedings of the 27th ACM
International Conference on Multimedia, 2019, pp. 84–93.
[23] S. Zheng, S. Chen, and Q. Jin, “Vrdformer: End-to-end video visual
R EFERENCES relation detection with transformers,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
[1] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by 18 836–18 846.
iterative message passing,” in Proceedings of the IEEE conference on [24] Y. Li, X. Yang, and C. Xu, “Dynamic scene graph generation via
computer vision and pattern recognition, 2017, pp. 5410–5419. anticipatory pre-training,” in Proceedings of the IEEE/CVF Conference
[2] G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene graph on Computer Vision and Pattern Recognition, 2022, pp. 13 874–13 883.
generation with hierarchical context,” IEEE Transactions on Neural [25] Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y. Yang,
Networks and Learning Systems, vol. 32, no. 2, pp. 909–915, 2020. “Spatial-temporal transformer for dynamic scene graph generation,” in
[3] L. Tao, L. Mi, N. Li, X. Cheng, Y. Hu, and Z. Chen, “Predicate Proceedings of the IEEE/CVF International Conference on Computer
correlation learning for scene graph generation,” IEEE Transactions on Vision, 2021, pp. 16 372–16 382.
Image Processing, vol. 31, pp. 4173–4185, 2022. [26] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions
as compositions of spatio-temporal scene graphs,” in Proceedings of the
[4] Z. Tu, H. Li, D. Zhang, J. Dauwels, B. Li, and J. Yuan, “Action-stage
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
emphasized spatiotemporal vlad for video action recognition,” IEEE
2020, pp. 10 236–10 247.
Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019.
[27] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical
[5] T. Han, W. Xie, and A. Zisserman, “Temporal alignment networks
contrastive losses for scene graph parsing,” in Proceedings of the
for long-term video,” in Proceedings of the IEEE/CVF Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Computer Vision and Pattern Recognition, 2022, pp. 2906–2916.
2019, pp. 11 535–11 543.
[6] Z. Zhou, C. Ding, J. Li, E. Mohammadi, G. Liu, Y. Yang, and Q. M. J.
[28] X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing
Wu, “Sequential order-aware coding-based robust subspace clustering
network for scene graph generation,” in Proceedings of the IEEE/CVF
for human action recognition in untrimmed videos,” IEEE Transactions
Conference on Computer Vision and Pattern Recognition, 2020, pp.
on Image Processing, vol. 32, pp. 13–28, 2023.
3746–3753.
[7] Z. Tang, Y. Liao, S. Liu, G. Li, X. Jin, H. Jiang, Q. Yu, and D. Xu, [29] G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene graph
“Human-centric spatio-temporal video grounding with visual transform- generation with hierarchical context,” IEEE Transactions on Neural
ers,” IEEE Transactions on Circuits and Systems for Video Technology, Networks and Learning Systems, vol. 32, no. 2, pp. 909–915, 2020.
vol. 32, no. 12, pp. 8238–8249, 2021. [30] Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor,
[8] Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language- and M. Volkovs, “Context-aware scene graph generation with seq2seq
bridged spatial-temporal interaction for referring video object segmen- transformers,” in Proceedings of the IEEE/CVF International Conference
tation,” in Proceedings of the IEEE/CVF Conference on Computer Vision on Computer Vision, 2021, pp. 15 931–15 941.
and Pattern Recognition, 2022, pp. 4964–4973. [31] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship
[9] T. Hui, S. Liu, Z. Ding, S. Huang, G. Li, W. Wang, L. Liu, and detection with language priors,” in European conference on computer
J. Han, “Language-aware spatial-temporal collaboration for referring vision. Springer, 2016, pp. 852–869.
video segmentation,” IEEE Transactions on Pattern Analysis and Ma- [32] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph
chine Intelligence, 2023. generation from objects, phrases and region captions,” in Proceedings of
[10] T. Nishimura, A. Hashimoto, Y. Ushiku, H. Kameko, and S. Mori, the IEEE international conference on computer vision, 2017, pp. 1261–
“State-aware video procedural captioning,” in Proceedings of the 29th 1270.
ACM International Conference on Multimedia, 2021, pp. 1766–1774. [33] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for
[11] Y. Huang, H. Xue, J. Chen, H. Ma, and H. Ma, “Semantic tag augmented scene graph generation,” in Proceedings of the European conference
xlanv model for video captioning,” in Proceedings of the 29th ACM on computer vision (ECCV), 2018, pp. 670–685.
International Conference on Multimedia, 2021, pp. 4818–4822. [34] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs:
[12] H. Wang, G. Lin, S. C. H. Hoi, and C. Miao, “Cross-modal graph Scene graph parsing with global context,” in Proceedings of the IEEE
with meta concepts for video captioning,” IEEE Transactions on Image conference on computer vision and pattern recognition, 2018, pp. 5831–
Processing, vol. 31, pp. 5150–5162, 2022. 5840.
[13] X. Hua, X. Wang, T. Rui, F. Shao, and D. Wang, “Adversarial reinforce- [35] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
ment learning with object-scene relational graph for video captioning,” convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
IEEE Transactions on Image Processing, vol. 31, pp. 2004–2016, 2022. [36] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing
[14] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura, network for scene graph generation,” in Proceedings of the IEEE/CVF
“Bert representations for video question answering,” in Proceedings of Conference on Computer Vision and Pattern Recognition, 2019, pp.
the IEEE/CVF Winter Conference on Applications of Computer Vision, 6163–6171.
2020, pp. 1556–1565. [37] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose
[15] P. Zeng, H. Zhang, L. Gao, J. Song, and H. T. Shen, “Video question dynamic tree structures for visual contexts,” in Proceedings of the
answering with prior knowledge and object-sensitive learning,” IEEE IEEE/CVF conference on computer vision and pattern recognition, 2019,
Transactions on Image Processing, vol. 31, pp. 5936–5948, 2022. pp. 6619–6628.
[16] L. Gao, Y. Lei, P. Zeng, J. Song, M. Wang, and H. T. Shen, “Hierarchical [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
representation network with auxiliary tasks for video captioning and Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
video question answering,” IEEE Transactions on Image Processing, neural information processing systems, vol. 30, 2017.
vol. 31, pp. 202–215, 2022. [39] Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer
[17] Y. Liu, X. Zhang, F. Huang, B. Zhang, and Z. Li, “Cross-attentional for scene graph generation,” IEEE Transactions on Pattern Analysis and
spatio-temporal semantic graph networks for video question answering,” Machine Intelligence, 2023.
IEEE Transactions on Image Processing, vol. 31, pp. 1684–1696, 2022. [40] S. Kundu and S. N. Aakur, “Is-ggt: Iterative scene graph generation with
[18] D. Liu, M. Bober, and J. Kittler, “Visual semantic information pursuit: A generative transformers,” in Proceedings of the IEEE/CVF Conference
survey,” IEEE transactions on pattern analysis and machine intelligence, on Computer Vision and Pattern Recognition, 2023, pp. 6292–6301.
vol. 43, no. 4, pp. 1404–1422, 2019. [41] Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and
[19] P. Xu, X. Chang, L. Guo, P.-Y. Huang, X. Chen, and A. G. Hauptmann, A. Farhadi, “Video relationship reasoning using gated spatio-temporal
“A survey of scene graph: Generation and application,” IEEE Trans. energy graph,” in Proceedings of the IEEE/CVF Conference on Com-
Neural Netw. Learn. Syst, vol. 1, 2020. puter Vision and Pattern Recognition, 2019, pp. 10 424–10 433.
[20] X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A com- [42] C. Liu, Y. Jin, K. Xu, G. Gong, and Y. Mu, “Beyond short-term
prehensive survey of scene graphs: Generation and application,” IEEE snippet: Video relation detection with spatio-temporal global context,”
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, in Proceedings of the IEEE/CVF conference on computer vision and
no. 1, pp. 1–26, 2021. pattern recognition, 2020, pp. 10 840–10 849.
IEEE TRANSACTIONS ON IMAGE PROCESSING 13
[43] Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context aggrega- action understanding,” in Proceedings of the IEEE/CVF Conference on
tion for video scene graph generation,” in Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, 2021, pp. 11 184–11 193.
International Conference on Computer Vision, 2021, pp. 13 688–13 697.
[44] S. Wang, L. Gao, X. Lyu, Y. Guo, P. Zeng, and J. Song, “Dynamic
scene graph generation via temporal prior inference,” in Proceedings
of the 30th ACM International Conference on Multimedia, 2022, pp.
5793–5801.
[45] Y. Kumar and A. Mishra, “Few-shot referring relationships in videos,” Tao Pu received a B.E. degree from the School
in Proceedings of the IEEE/CVF Conference on Computer Vision and of Computer Science and Engineering, Sun Yat-sen
Pattern Recognition, 2023, pp. 2289–2298. University, Guangzhou, China, in 2020, where he is
[46] X. Wang, L. Zhu, Y. Wu, and Y. Yang, “Symbiotic attention for egocen- currently pursuing a Ph.D. degree in computer sci-
tric action recognition with object-centric alignment,” IEEE transactions ence. He has authored and coauthored approximately
on pattern analysis and machine intelligence, 2020. 10 papers published in top-tier academic journals
[47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image and conferences, including T-PAMI, AAAI, ACM
recognition,” in Proceedings of the IEEE conference on computer vision MM, etc.
and pattern recognition, 2016, pp. 770–778.
[48] H. Li, C. Li, A. Zheng, J. Tang, and B. Luo, “Mskat: Multi-scale
knowledge-aware transformer for vehicle re-identification,” IEEE Trans-
actions on Intelligent Transportation Systems, vol. 23, no. 10, pp.
19 557–19 568, 2022.
[49] Z. Wang, J. Zhang, T. Chen, W. Wang, and P. Luo, “Restoreformer++:
Towards real-world blind face restoration from undegraded key-value
pairs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Tianshui Chen received a Ph.D. degree in computer
2023. science at the School of Data and Computer Science
[50] T. Chen, T. Pu, H. Wu, Y. Xie, and L. Lin, “Structured semantic transfer Sun Yat-sen University, Guangzhou, China, in 2018.
for multi-label recognition with partial labels,” in Proceedings of the Prior to earning his Ph.D, he received a B.E. degree
AAAI conference on artificial intelligence, vol. 36, no. 1, 2022, pp. 339– from the School of Information and Science Tech-
346. nology in 2013. He is currently an associated profes-
[51] T. Pu, T. Chen, H. Wu, and L. Lin, “Semantic-aware representation sor in the Guangdong University of Technology. His
blending for multi-label image recognition with partial labels,” in current research interests include computer vision
Proceedings of the AAAI conference on artificial intelligence, vol. 36, and machine learning. He has authored and coau-
no. 2, 2022, pp. 2091–2098. thored approximately 40 papers published in top-
[52] Y. Xie, T. Chen, T. Pu, H. Wu, and L. Lin, “Adversarial graph repre- tier academic journals and conferences, including T-
sentation adaptation for cross-domain facial expression recognition,” in PAMI, T-NNLS, T-IP, T-MM, CVPR, ICCV, AAAI, IJCAI, ACM MM, etc.
Proceedings of the 28th ACM international conference on Multimedia, He has served as a reviewer for numerous academic journals and conferences.
2020. He was the recipient of the Best Paper Diamond Award at IEEE ICME 2017.
[53] Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image
recognition with knowledge transfer,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), October 2019.
[54] T. Chen, L. Lin, R. Chen, X. Hui, and H. Wu, “Knowledge-guided
multi-label few-shot learning for general image recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, Hefeng Wu received a B.S. degree in computer sci-
no. 3, pp. 1371–1384, 2022. ence and technology and a Ph.D. degree in computer
[55] T. Pu, T. Chen, Y. Xie, H. Wu, and L. Lin, “Au-expression knowledge application technology from Sun Yat-sen University,
constrained representation learning for facial expression recognition,” in Guangzhou, China. He is currently a research pro-
2021 IEEE international conference on robotics and automation (ICRA). fessor at the School of Computer Science and Engi-
IEEE, 2021, pp. 11 154–11 161. neering, Sun Yat-sen University, China. His research
[56] T. Chen, T. Pu, H. Wu, Y. Xie, L. Liu, and L. Lin, “Cross-domain facial interests include computer vision, multimedia, and
expression recognition: A unified evaluation benchmark and adversarial machine learning. He has published works in and
graph learning,” IEEE Transactions on Pattern Analysis and Machine served as reviewers for many top-tier academic
Intelligence, vol. 44, no. 12, pp. 9887–9903, 2022. journals and conferences, including T-PAMI, T-IP,
[57] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual T-MM, CVPR, ICCV, AAAI, ACM MM, etc.
semantic navigation using scene priors,” in Proceedings of International
Conference on Learning Representations (ICLR), 2019.
[58] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” Advances in neural
information processing systems, vol. 28, 2015.
[59] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Yongyi Lu is currently an Associate Professor at
Proceedings of the IEEE international conference on computer vision, Guangdong University of Technology (GDUT). He
2017, pp. 2961–2969. received his Ph.D. in Computer Science and Engi-
[60] A. R. Vandenbroucke, J. Fahrenfort, J. Meuwese, H. Scholte, and neering at Hong Kong University of Science and
V. Lamme, “Prior knowledge about objects determines neural color Technology (HKUST) in 2018. From 2019-2022,
representation in human visual cortex,” Cerebral cortex, vol. 26, no. 4, he was a Postdoctoral Fellow in Prof. Alan Yuille’s
pp. 1401–1408, 2016. group at Johns Hopkins University. He received his
[61] C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. F. Wang, “Multi-label zero- B.E. and Master both at Sun Yat-Sen University,
shot learning with structured knowledge graphs,” in Proceedings of the Guangzhou, China in 2011 and 2014. His research
IEEE conference on computer vision and pattern recognition, 2018, pp. focuses on computer vision, medical image analysis
1576–1585. and the interdisciplinary fields of them. He has
[62] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” published over 30 peer-reviewed conference/journal articles such as CVPR,
arXiv preprint arXiv:1711.05101, 2017. ICCV, ECCV, NeurIPS, MICCAI, BMVC and T-IP, with a total citation
[63] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, of over 2500. He has won the 4th place in ImageNet DET challenge in
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An 2014 and 5th place in ImageNet VID challenge in 2015. He serves as a
imperative style, high-performance deep learning library,” Advances in reviewer of conferences/journals including IJCV, T-IP, T-MM, T-CSVT, Nature
neural information processing systems, vol. 32, 2019. Machine Intelligence, Neurocomputing, CVPR, ICCV, ECCV, NeurIPS, ICLR
[64] N. Rai, H. Chen, J. Ji, R. Desai, K. Kozuka, S. Ishizaka, E. Adeli, and BMVC. He is a member of IEEE.
and J. C. Niebles, “Home action genome: Cooperative compositional
IEEE TRANSACTIONS ON IMAGE PROCESSING 14