2412.07160v1

Motion-aware Contrastive Learning for
Temporal Panoptic Scene Graph Generation

Thong Thanh Nguyen1 , Xiaobao Wu2 , Yi Bin1 ,
Cong-Duy Nguyen2 , See-Kiong Ng1 , Anh Tuan Luu2
1
Institute of Data Science (IDS), National University of Singapore, Singapore
2
Nanyang Technological University (NTU), Singapore
arXiv:2412.07160v1 [cs.CV] 10 Dec 2024
Abstract 0.6 IPS+T - Conv

Ours
0.5
To equip artificial intelligence with a comprehensive under-
0.4
standing towards a temporal world, video and 4D panoptic
R@50
scene graph generation abstracts visual data into nodes to 0.3
represent entities and edges to capture temporal relations. 0.2
Existing methods encode entity masks tracked across tem-
0.1
poral dimensions (mask tubes), then predict their relations
with temporal pooling operation, which does not fully uti- 0.0
on standing on sitting on kicking running on opening
lize the motion indicative of the entities’ relation. To over- Predicate
come this limitation, we introduce a contrastive representa-

tion learning framework that focuses on motion pattern for Figure 1: State-of-the-art IPS+T - Convolution (Yang et al.
temporal scene graph generation. Firstly, our framework en- 2023) exhibits high R@50 scores for static relations, e.g. on,
courages the model to learn close representations for mask sitting on, and standing on, than dynamic relations, e.g. kick-
tubes of similar subject-relation-object triplets. Secondly, we ing, running on, and opening. In contrast, our method can
seek to push apart mask tubes from their temporally shuf- perform effectively on both static and dynamic relations.
fled versions. Moreover, we also learn distant representations
for mask tubes belonging to the same video but different precise entity localization and thorough scene understand-
triplets. Extensive experiments show that our motion-aware ing including background components. Subsequently, be-
contrastive framework significantly improves state-of-the-art cause the temporal dimension undoubtedly provides richer
methods on both video and 4D datasets. information than the static spatial dimension, recent works
(Yang et al. 2023, 2024) have shifted attention to the domain
Introduction of videos and 4D scenes, resulting in the tasks of panoptic
video and 4D scene graph generation.
The advent of autonomous agents, intelligent systems, and
Popular methods (Yang et al. 2023, 2024, 2022) for tem-
robots warrants a comprehensive understanding of real-
poral panoptic scene graph generation produce entity masks
world environments (Ma et al. 2022; Driess et al. 2023; Ray-
tracked across the temporal dimension, i.e. mask tubes, then
chaudhuri et al. 2023; Cheng et al. 2022; Li et al. 2023b,a).
predict temporal relations among them. To conduct relation
Such understanding encompasses beyond merely recogniz-
prediction, these methods encode the segmentation mask
ing individual entities, but also a sophisticated understand-
tubes, apply global pooling, then forward to a multi-layer
ing of their relationships. To construct a detailed understand-
perceptron for classifying their relations. However, such
ing, scene graph generation (SGG) research (Li, Yang, and
global pooling operation is well-known to be ineffective in
Xu 2022; Sudhakaran et al. 2023; Nag et al. 2023; Wang
representing temporal or motion patterns, which are use-
et al. 2024a) has sought to provide relational perspective on
ful for determining the relation among the entities. Conse-
scene understanding. In SGG frameworks, scene graphs uti-
quently, this would result in higher misclassification rates of
lize nodes to represent entities and edges to represent rela-
more dynamic relations (Wang et al. 2023; Nag et al. 2023;
tionships, constructing a comprehensive and structured un-
Zhou et al. 2022), as illustrated in Figure 1.
derstanding of visual scenes.
To encourage temporal representation learning, current
However, due to being primarily based on bounding boxes
research (Nguyen et al. 2023b; Liu et al. 2022; Zhou, Liu,
to denote entities, scene graphs fall short of replicating hu-
and Wang 2023) uses contrastive learning for videos. How-
man visual perception with a lack of granularity (Yang et al.
ever, they mainly seek to force two clips from the same
2023). To overcome this limitation, panoptic scene graph
video to be close together. As such, they mostly capture
generation (Yang et al. 2022; Zhao et al. 2023) has been pre-
the semantics of visual scenes and disregard motions (Chen
sented to expand the scope of SGG to incorporate pixel-level
et al. 2020). Moreover, instead of paying attention to precise
Copyright © 2025, Association for the Advancement of Artificial entity localization, they work upon frame-level representa-
Intelligence (www.aaai.org). All rights reserved. tions. This would inadvertently inject motions from non-
Frame index: Frame 387 Frame 393 Frame 526 Frame 563 Frame index: Frame 036 Frame 041 Frame 081 Frame 090
Video: Video:
Groundtruth child kicking ball_13 child running on grass Groundtruth person opening door person closing door
IPS+T - child_6 next_to ball_13 child_6 on grass_1 PSG4DFormer: person_2 next_to door_1 person_2 next_to door_1
Convolution
Ours: child_6 kicking ball_13 child_6 running on grass_1 Ours: person_2 opening door_1 person_2 closing door_1
Figure 2: Examples of temporal panoptic scene graph generation of state-of-the-art (Yang et al. 2023, 2024) and our method.
target entities into visual representations, which might not lationship between two events’ mask tubes for the pro-
benefit relation classification in panoptic scene graph gener- posed contrastive framework.
ation. • Comprehensive experiments demonstrate that our frame-
In this paper, to encourage representation learning to cap- work outperforms state-of-the-art methods on both natu-
ture motion patterns for temporal panoptic scene graph gen- ral and 4D video datasets, especially on recognizing dy-
eration, we propose a novel contrastive learning framework namic subject-object relations.
that focuses on mask tubes of the segmented entities. First,
we force a mask tube and the one of similar subject-relation-
object but of a different video to obtain close represen-
Related Work
tations. Since positive mask tubes originate from distinct Temporal panoptic scene graph generation. Traditional
video clips, the model cannot rely upon visual semantics to research mainly focuses on generating scene graphs for nat-
optimize the contrastive objective, but instead depends on ural videos in which nodes represent objects and edges
the motion trajectory evolution, which is our target compo- represent relations between objects. To localize object in-
nent for representation learning. Second, we propel nega- stances, the nodes are grounded by bounding boxes (Wang
tive mask tubes generated by temporally shuffling the origi- et al. 2024a; Nag et al. 2023; Pu et al. 2023). Despite
nal tubes. Moreover, we also push apart representations of progress, limitations exist in traditional video scene graph
mask tubes from the same video but belonging to differ- generation because of noisy grounding annotations of coarse
ent triplets. Because mask tubes of different triplets from a bounding box annotations and trivial relation taxonomy. Re-
common video with close visual features are separated from cent works have addressed these issues by proposing panop-
each other, we once again motivate the model to generate tic video and 4D scene graph generation (Yang et al. 2023,
representations that are less reliant upon visual semantics 2024). Classic video scene graph generation methods have
but motion-sensitive features. In addition, the visually sim- been dominated by the two-stage pipeline that consists of
ilar negative mask tubes can play a role as hard negative object detection and pairwise predicate classification (Rodin
samples, thus accelerating the contrastive learning process et al. 2024; Nag et al. 2023). This property has been general-
(Chen, Zheng, and Song 2024). ized to panoptic approaches (Yang et al. 2022, 2023, 2024),
Moreover, in order to implement our motion-aware con- in which the pipeline comprises panoptic segmentation fol-
trastive learning framework, there is a need to quantify the lowed by predicate classification step.
relationship between mask tubes. This quantification marks Video representation learning. Video representation learn-
a challenging problem as mask tubes are a sequence of ing has gained popularity in recent years (Davtyan, Sameni,
segmentation masks that span over the sequence of video and Favaro 2023; Nguyen et al. 2024d; Shen et al. 2024;
frames. Furthermore, mask tubes of two triplets might ex- Nguyen et al. 2024f). Most approaches can be catego-
hibit different lengths since two events often occur at differ- rized into two groups: pretext-based and contrastive-based.
ent speed. Unfortunately, the popular pipeline of temporal Pretext-based methods mostly leverage pretext learning
pooling and then similarity estimation flattens the tempo- tasks such as optical flow prediction (Dong and Fu 2024;
ral dimension of the mask tubes and neglects their motion Davtyan, Sameni, and Favaro 2023) and temporal order pre-
features. To resolve this problem, we consider mask tubes diction (Shen et al. 2024; Ren et al. 2024). However, these
of two triplets as two distributions and seek the optimal tasks are considerably influenced by low-level features and
transportation map between them, then utilize the transport incapable of delving into high-level semantics of the video
distance as the distance between two triplets’ tubes. Such (Nguyen et al. 2024c). The contrastive-based methods pri-
scheme of transporting can play a role of synchronizing the marily construct positive samples (Nguyen and Luu 2021;
motion states of two triplets and takes advantage of the mask Nguyen et al. 2024a, 2022, 2024e, 2023a; Wu et al. 2024)
tubes’ evolutionary trajectory. by sampling video clips from the same video (Nguyen et al.
To sum up, our contributions are as follows: 2024b; Liu et al. 2024), use various frame-based data aug-
mentation techniques (Wang et al. 2024b; Song et al. 2024;
• We propose a novel contrastive learning framework for
Rosa 2024), etc., thereby increasing the similarity of posi-
temporal panoptic scene graph generation which pulls to-
tive pairs while simultaneously decreasing the one of nega-
gether entity mask tubes with similar motion patterns and
tive pairs. Nevertheless, these methods focus on the whole
pushes away those of distinct motion patterns.
video frames and are not suitable for object masks tracked
• We utilize optimal transport distance to estimate the re- across the temporal dimension.
Problem Formulation masked cross-attention. Receiving a video V , the model pro-
Temporal panoptic scene graph generation (TPSGG) is a duces a set of queries {qi }Ni=1 , where each query qi corre-
task to generate a dynamic scene graph given an input video. sponds to one entity. Subsequently, every query is forwarded
In the generated scene graph, each node corresponds to an to two multi-layer perceptrons (MLPs) to project the queries
entity and each edge corresponds to a spatial-temporal rela- into mask classification and mask regression outputs.
tion between two entities. Formally, the input of a TPSGG Training and inference. During training, each query is
model is a video clip V , particularly V ∈ RT ×H×W ×3 for matched to a groundtruth mask through mask-based bipar-
a natural video, V ∈ RT ×H×W ×4 for a 4D RGB-D video, tite matching to calculate the segmentation loss. During in-
and V ∈ RT ×M ×6 for a 4D point cloud video, T denotes the ference, IPS+T generates panoptic segmentation masks for
number of frames, M the number of point clouds of interest, each frame, and uses the tracker to achieve N tracked mask
and the frame size H × W should remain consistent across tubes. In contrast, VPS employs two query embeddings of
the video. The output of the model is a dynamic scene graph the target and reference frame, and performs query-wised
G. The TPSGG task can be formulated as follows: similarity tracking to obtain N tracked mask tubes.
P (G|V ) = P (M, O, R|V ). (1) Relation Classification

After the segmentation step, if the relation module is
In particular, G consists of binary mask tubes M =
to be trained, we match query tubes with the annotated
{m1 , m2 , ..., mN } and entity labels O = {o1 , o2 , ..., oN }
groundtruth masks based on the tube IoU values with the
which are associated with N entities in the video, and their
groundtruth. Otherwise, we directly forward mask tubes to
relations are denoted as R = {r1 , r2 , ..., rL }. With respect to
self-attention or convolutional layers for encoding them into
entity oi , the mask tube mi ∈ {0, 1}T ×H×W composes all T ×D
hidden representations {Hi }N i=1 , Hi ∈ R , where D de-
tracked masks in all video frames, and its category oi ∈ CO .
notes the hidden dimension. Then, we construct query pairs
For all entities in frame t, their masks must not overlap, i.e.
N from every two query tubes’ representations Hi and Hj ,
mti ≤ 1H×W . The relation ri ∈ CR associates two en- i, j ∈ {1, 2, ..., N }, i ̸= j. Inspired by (Yang et al. 2023,
P
i=1 2024), in every pair, we perform global pooling over the
tities, one of which is the subject and the other is an object, temporal dimension for each mask tube representation:
with a relation class and a time period. CO and CR denote
the entity and relation set class, respectively. hi = Pooling (Hi ) , (2)
D
Methodology where hi ∈ R . Afterwards, we concatenate hi and hj , and
forward to a MLP to generate the relation category:
We firstly present the backbone pipeline to conduct temporal
panoptic scene graph generation. Then, we explain our pro- log p(ri,j ) = MLP ([hi , hj ]) . (3)
posed contrastive learning framework to facilitate motion- To train the relation classification module, we use the cross-
aware mask tube representation learning. We also present entropy loss calculated based on the predicted relation log-
the detail of our optimal transport approach to estimate the likelihood and the groundtruth. For inference, we extract the
relation between two mask tubes for the contrastive objec- relation of the highest log-likelihood.
tive. Our overall framework is illustrated in Figure 3.
Contrastive Learning for Temporal Panoptic Scene
Temporal Panoptic Segmentation Graph Generation
Given a video clip V ∈ RT ×H×W ×3 , V ∈ RT ×H×W ×4 , Our goal is to encourage mask tube representations {Hi }N
or V ∈ RT ×M ×6 , the initial step is to segment and track to become motion-aware. In the beginning, we concatenate
i=1
each pixel in a non-overlapping manner. Formally, the model

produces a set of entity masks {yi }N N the representations of two mask tubes Hisub and Hjobj , which
i=1 = {(mi , pi (c))}i=1 ,
T ×H×W have been matched to a groundtruth subject-relation-object
where mi ∈ {0, 1} denotes the tracked video mask,
triplet, to form an anchor representation Hi,j (anchor):
i.e. the mask tube, and pi (c) denotes the probability of as-
signing class c to the tube mi . N denotes the number of a
Hi,j = [Hisub , Hjobj ], (4)
entities, which consist of both foreground (thing) and back-
ground (stuff) classes. a
where Hi,j ∈ RT ×2D . Then, we propose a contrastive learn-
Segmentation module. Inspired by (Yang et al. 2023, ing framework in which we motivate the model to associate
2024), we adopt the Transformer-based encoder-decoder mask tubes based upon the motion information. The objec-
segmentation model. There are two types of segmentation tive of contrastive learning is to produce a representation
procedure: 1) image panoptic segmentation combined with a space through attracting positive pairs, i.e. H a and H p (pos-
tracker (IPS+T) and 2) video panoptic segmentation (VPS). itive), while pushing apart negative pairs, i.e. H a and H n
IPS+T procedure will process each video frame separately (negative). We accomplish this by optimizing the contrastive
and uses the tracker to connect the mask tubes across the objective, which is formulated as follows:
video frames, while VPS processes each video frame with a p
its reference frame from a nearby timestamp. esim(H ,H )

Lcont = −log N
, (5)
Both procedures are initiated by producing a set of ob- sim(H a ,H p )
Pn sim(H a ,H n )
ject queries which interacts with encoded visual patches via e + e z
z=1
Scene Graph
Panoptic
🕒 #0 - #23
Relation adult - running on - grass
Segmentation
Model
Model
🕒 #45 - #49
adult - throwing - ball
...
Video: 0020_10793023296
Contrastive Learning for Temporal Panoptic Scene Graph Generation
Positive Sampling Optimal Transport Distance
Adult - throwing - ball Adult - throwing - ball
(0020_10793023296) Optimal transport
(0028_4021064662)
Adult Ball Adult Ball Treat as
Treat as distribution Pooling
Pull Pooling distribution
Negative Sampling: Shuffle-based Negative Sampling: Triplet-based

Adult - throwing - ball Adult - throwing - ball Adult - throwing - ball Adult - holding - ball
(0020_10793023296) (0020_10793023296) (0020_10793023296) (0020_10793023296)
Adult Ball Adult Ball Adult Ball Adult Ball
Push Push
Figure 3: Framework overview of contrastive learning for temporal scene graph generation.
adult - throwing - ball adult - touching - oven
motion-aware, as the anchor H a and the negative tube H n
share visual semantics and can only be distinguished using
motion information.
Selecting strong-motion mask tubes. However, there ex-
ists a potential risk: for static relations such as on, next to,
12 12 and in, mask tubes might involve almost no motion. As a re-
9 9 sult, the shuffled tube would become identical to the anchor
Motion Motion
amplitude
6
amplitude
6
one and the model would not be able to differentiate them
3 3
and learn reasonably. To address this problem, we propose
0 0
1 2 3 1 2 3 4 a strategy to select strong-motion tubes for shuffling, which
Frame Frame
we illustrated in Figure 4.
Figure 4: Proposed strategy to select strong-motion tubes. Given a video, our aim is to select mask tubes that carry
strong motion for shuffling. To measure the motion of the
where sim denotes the similarity function defined upon a mask tube, we utilize optical flow edges (Xiao, Tighe, and
pair of mask tube representations. The formulation shows Modolo 2021). We estimate flow edges via employing a So-
that what the model learn is largely dependent upon how bel filter (Sobel et al. 2022) onto the flow magnitude map
positive and negative samples are generated. and take the median over the flow edge pixels of the en-
Positive sampling. To satisfy our motion-aware requirement tity masks. Then, we select mask tubes whose the maximum
for contrastive learning, we extract mask tube representa- value across the optical flow surpasses a threshold γ.
tions from the entities of the same subject and object cate-
gory that exhibit a similar groundtruth relation from another
video. Since two videos possess distinct visual features, the Triplet-based contrastive learning
model must rely on the shared motion pattern of similar To take advantage of motion-aware signals from triplets of
subject-relation-object triplets to associate the anchor and similar subject-relation-object category, we design a triplet-
the positive sample. based approach to create negative samples. A naive ap-
Negative sampling. For negative sampling, we design two proach would be to sample mask tubes of any distinct
strategies, which result in two contrastive approaches, i.e. subject-relation-object triplet from the anchor sample. How-
shuffle-based and triplet-based contrastive learning. ever, if we run into triplets with all distincgt subject, relation,
and object categories, the negative pair would be trivial for
Shuffle-based contrastive learning the model to distinguish, resulting in less effective learning.
In our shuffle-based approach, we create negative samples In order to create harder negative samples, we choose neg-
by utilizing a series of temporal permutations π to the anchor ative mask tubes from the same video with the anchor. We
tube, i.e. shuffling: create a multi-nomial distribution, where triplets that share
H n = π (H a ) . (6) more subject, relation, or object categories with the anchor
will be more likely to be drawn. Hence, our negative sam-
As such, the contrastive objective will force the model to ples can hold close visual semantics with the anchor sample,
propel representations of the anchor tube, which is in the and increase the likelihood that the model depends on mo-
normal order, from the shuffled tube, which exhibits a dis- tion semantics to push them apart. From contrastive learning
torted motion due to the shuffled order. This would make perspective, these samples form hard negative samples to ac-
the learned representation sensitive to frame ordering, i.e. celerate the learning process (Chen, Zheng, and Song 2024).
Algorithm 1: Computing the optimal transport distance first describe the experiment settings, covering the evalua-
Require: C = {Cl,k = c (hi,l , hj,k ) | 1 ≤ i ≤ NV , 1 ≤ j ≤ tion datasets, evaluation metrics, baseline methods, and im-
NL } ∈ RTi ×Tj , a ∈ RTi , b ∈ RTj , s, Niter plementation details. Next, we present quantitative results
dOT = ∞ of our method, then provide ablation study and careful anal-
for s = 1 to min(Ti , Tj ) do ysis to explore properties of our motion-aware contrastive
T = exp − C framework. Eventually, we conduct qualitative analysis to
τ
s concretely examine its behavior.
T= ⊤ T
( Ti )
1 ·T·1 Tj
for i = 1 to Niter
do
Experiment Settings
pa = min T1aT , 1Ti , Ta = diag (pa ) · T Datasets. We assess the effectiveness of our method on nat-
j
ural and 4D video inputs. The corresponding dataset to each
pb = min T⊤b1 , 1Tj , Tb = diag (pb ) · Ta input type is as follows:
a Ti
T= s
⊤ Tb • Open-domain Panoptic video scene graph generation
(1Ti ) ·T·1N
L (OpenPVSG) (Yang et al. 2023): OpenPVSG consists
end for ! of scene graphs and associated segmentation masks with
Tj
Ti P
respect to subject and object nodes in the scene graph.
P
dOT = min dOT , Tk,l Ck,l
k=1 l=1 The dataset comprises 400 videos, including 289 third-
end for person videos from ViDOR (Shang et al. 2019), 111 ego-
return dOT centric videos from Epic-Kitchens (Damen et al. 2022)
and Ego4D (Grauman et al. 2022).
Optimal Transport for Mask Tube Relation
• Panoptic scene graph generation for 4D (PSG4D)
Quantification
(Yang et al. 2024): The PSG4D dataset is divided into
There is one remaining problem, i.e. how to define the sim- two groups, i.e. PSG4D-GTA and PSG4D-HOI. PSG4D-
ilarity function sim for two mask tubes’ representations GTA comprises 67 third-view videos with an average
Hi and Hj . In this work, we consider two mask tubes as length of 84 seconds, 35 object categories, and 43
two discrete distributions µ and ν, whose Hi and Hj are relationship categories. On the contrary, PSG4D-HOI
Ti
P contains 2,973 videos from an egocentric perspective,
their supports, respectively. Formally, µ = ak δhi,k and
k=1 whose average duration is 20 seconds. The PSG4D-
Tj
P HOI’s videos are mostly related to indoor scenes, cov-
ν= bl δhj,l , where δhi,k and δhj,l denote the Dirac func- ering 46 object categories and 15 relationship categories.
l=1
tions centered upon hi,k and hj,l , respectively. The weights Evaluation metrics. We use the recall at K (R@K) and
1 1T mean recall at K (mR@K) metrics, which are standard met-
of the supports are a = TTii and b = Tjj . rics used in scene graph generation tasks. Both R@K and
After defining the distribution scheme, we propose the mR@K consider the top-K triplets predicted by the panop-
tube alignment optimization problem, which is to find the tic scene graph generation model. A successful recall of a
transport plan that achieves the minimum distance between predicted triplet must satisfy the following criteria: 1) cor-
µ and ν as follows: rect category labels for the subject, object, and predicate;
Tj
Ti X
X 2) a volume Intersection over Union (vIoU) greater than
dOT = DOT (µ, ν) = min Ti,j · c (hi,k , hj,l ) , or equal to 0.5 between the predicted mask tubes and the
T∈Π(a,b)
k=1 l=1 groundtruth tubes. For extensive comparison, we also report
(7) results with the vIoU threshold of 0.1.
s.t Π(a, b) = {T ∈
T ×T
R+i j | T1Ti ≤ a, T⊤ 1Tj ≤ b, Baseline methods. We compare our method with a com-
prehensive list of baseline approaches for temporal panop-
1⊤
Ti · T · 1Tj = s, 0 ≤ s ≤ min(Ti , Tj )}, tic scene graph generation: (i) IPS+T - Vanilla (Yang et al.
(8) 2023) uses image panoptic segmentation (IPS) model with
where c denotes a pre-defined distance between two vec- a tracker for segmentation, and fully-connected layers to
tors. We implement the cost distance c (hi,k , hj,l ) = 1 − separately encode temporal states of entity mask tubes; (ii)
hi,k ·hj,l IPS+T - Handcrafted filter (Yang et al. 2023) uses image
||hi,k ||2 ||hi,k ||2 as the cosine distance. As the exact optimiza- panoptic segmentation (IPS) model with a tracker for seg-
tion over the transport plan T is intractable, we adopt the mentation, and a manually-designed kernel to encode en-
Sinkhorn-based algorithm to estimate T. We delineate the tity mask tubes; (iii) IPS+T - Convolution (Yang et al.
algorithm to calculate the distance in Algorithm 1. To turn 2023) uses image panoptic segmentation (IPS) model with
the distance into similarity value, we take its negative value a tracker for segmentation, and learnable convolutional lay-
and add to a pre-defined margin α: ers to encode entity mask tubes; (iv) IPS+T - Transformer
sim (ha , hp ) = α − dOT . (9) (Yang et al. 2023) uses image panoptic segmentation model
(IPS) with a tracker for segmentation, and Transformer-
Experiments based encoder with self-attention layers to encode entity
We conduct comprehensive experiments to evaluate the ef- mask tubes; (v) VPS - Vanilla (Yang et al. 2023) is simi-
fectiveness of our motion-aware contrastive framework. We lar to IPS+T - Vanilla, but uses video panoptic segmentation
vIoU threshold = 0.5 vIoU threshold = 0.1
Method
R/mR@20 R/mR@50 R/mR@100 R/mR@20 R/mR@50 R/mR@100
IPS+T - Vanilla 3.04 / 1.35 4.61 / 2.94 5.56 / 3.33 8.28 / 5.68 14.47 / 9.92 18.24 / 11.84
IPS+T - Handcrafted filter 2.52 / 1.72 3.77 / 2.36 4.72 / 2.79 8.07 / 5.61 13.42 / 8.27 16.46 / 10.11
IPS+T - Transformer 3.88 / 2.81 5.66 / 4.12 6.18 / 4.44 9.01 / 6.69 14.88 / 11.28 17.51 / 13.20
IPS+T - Convolution 3.88 / 2.55 5.24 / 3.29 6.71 / 5.36 10.06 / 8.98 14.99 / 12.21 18.13 / 15.47
Ours - Transformer 3.98 / 2.98 5.97 / 4.20 7.44 / 5.15 10.59 / 9.56 16.98 / 12.39 22.33 / 17.47
Ours - Convolution 4.51 / 3.56 6.08 / 4.38 7.76 / 5.86 11.43 / 9.57 17.30 / 13.13 22.85 / 17.48
VPS - Vanilla 0.21 / 0.10 0.21 / 0.10 0.31 / 0.18 6.29 / 3.04 9.64 / 6.74 12.89 / 9.60
VPS - Handcrafted filter 0.42 / 0.13 0.52 / 0.50 0.94 / 0.92 5.24 / 2.84 7.65 / 7.14 9.64 / 8.22
VPS - Transformer 0.42 / 0.61 0.73 / 0.76 1.05 / 0.92 6.50 / 5.75 9.64 / 8.25 12.26 / 9.51
VPS - Convolution 0.42 / 0.25 0.63 / 0.67 0.63 / 0.67 8.07 / 7.84 11.01 / 9.78 12.89 / 10.77
Ours - Transformer 0.63 / 0.83 1.05 / 0.76 1.05 / 0.76 6.71 / 6.94 10.27 / 8.68 13.42 / 12.09
Ours - Convolution 0.84 / 0.98 1.26 / 1.22 1.26 / 1.22 8.18 / 8.00 12.90 / 11.47 14.22 / 13.59
Table 1: Experimental results on the OpenPVSG dataset.
PSG4D-GTA PSG4D-HOI
Input type Method
3DSGG 1.48 / 0.73 2.16 / 0.79 2.92 / 0.85 3.46 / 2.19 3.15 / 2.47 4.96 / 2.84
Point cloud videos PSG4DFormer 4.33 / 2.10 4.83 / 2.93 5.22 / 3.13 5.36 / 3.10 5.61 / 3.95 6.76 / 4.17
Ours 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
3DSGG 2.29 / 0.92 2.46 / 1.01 3.81 / 1.45 4.23 / 2.19 4.47 / 2.31 4.86 / 2.41
RGB-D videos PSG4DFormer 6.68 / 3.31 7.17 / 3.85 7.22 / 4.02 5.62 / 3.65 6.16 / 4.16 6.28 / 4.97
Ours 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
Table 2: Experimental results on both PSG4D-GTA and PSG4D-HOI groups of PSG4D dataset.
Method R/mR@20 R/mR@50 R/mR@100 6.075

w/o shuffle-based 4.41 / 3.43 5.90 / 4.24 7.30 / 5.79 6.050
w/o triplet-based 4.44 / 3.50 6.02 / 4.28 7.36 / 5.83
6.025
Ours 4.51 / 3.56 6.08 / 4.38 7.44 / 5.86
R@50
6.000
5.975
Table 3: Ablation results for contrastive learning approaches
5.950
on OpenPVSG dataset. We adopt the vIoU threshold of 0.5.
5.925
Tube relation quantification R/mR@20 R/mR@50 R/mR@100
Pooling - Cosine similarity 4.44 / 3.40 6.04 / 4.36 7.36 / 5.84 6 7 8 9 10
γ
Pooling - L2 4.36 / 3.37 3.77 / 5.95 7.29 / 5.80
Optimal transport 4.51 / 3.56 6.08 / 4.38 7.44 / 5.86 Figure 5: Ablation results on threshold γ.
Table 4: Ablation results for mask tube relation quantifica- for 8 epochs using AdamW optimizer with a batch size of
tion method between mask tubes on OpenPVSG dataset. 32, learning rate of 0.0001, weight decay of 0.05, and gra-
dient clipping with a max L2 norm of 0.01. In the latter
(VPS) model for panoptic segmentation; (vi) VPS - Hand- case, we utilize Video K-Net (Li et al. 2022), also initial-
crafted filter (Yang et al. 2023) is similar to IPS+T - Hand- ized from COCO-pretrained weights and fine-tuned with the
crafter filter, but uses video panoptic segmentation (VPS) same strategy as IPS+T. In the relation classification step, we
model for segmentation; (vii) VPS - Convolution (Yang conduct fine-tuning with a batch size of 32, employing the
et al. 2023) is similar to IPS+T - Convolution, but uses video Adam optimizer with a learning rate of 0.001. For 4D panop-
panoptic segmentation (VPS) model for segmentation; (viii) tic scene graph generation, we adopt the PSG4DFormer
VPS - Transformer (Yang et al. 2023) is similar to IPS+T baseline. To work with RGB-D and point cloud videos, we
- Transformer, but uses video panoptic segmentation (VPS) use an ImageNet pretrained on ResNet-101 (Russakovsky
model for segmentation; (ix) 3D-SGG (Wald et al. 2020) is et al. 2015) and the DKNet (Wu et al. 2022) as the visual en-
based on PointNet (Qi et al. 2017) and graph convolutional coder, respectively. We fine-tune the segmentation module
network (Kipf and Welling 2016) but neglects the depth di- for RGB-D and point cloud videos for 12 and 200 epochs,
mension and generates panoptic scene graphs for 4D video respectively. We use additional 100 epochs to train the re-
inputs; (x) PSG4DFormer (Yang et al. 2024) is a special- lation classification module. Based on validation, we adopt
ized model for 4D inputs, using Mask2Former (Cheng et al. a threshold γ = 9.0 and a margin α = 10.0. We set the
2022) for segmentation and a spatial-temporal Transformer maximum number of iterations Niter to 1,000.
to encode object mask tubes for relation classification.
Implementation details. For fair comparison, we experi- Main Results
ment our contrastive framework with both IPS+T and VPS Results on OpenPVSG. As shown in Table 1, we substan-
as segmentation module for panoptic video scene graph tially outperform both IPS+T - Convolution and IPS+T -
generation. In the former case, we leverage the UniTrack Transformer when we use IPS+T for segmentation. In par-
tracker (Wang et al. 2021) combined with Mask2Former ticular, using a higher vIoU threshold to filter out inaccurate
model (Cheng et al. 2022), which is initialized from the segmentation, we surpass IPS+T - Transformer by 1.3/0.7
best-performing COCO-pretrained weights and fine-tuned points of R/mR@100, while surpassing IPS+T - Convolu-
PSG4D-GTA PSG4D-HOI
Input type Method
w/o shuffle-based 5.56 / 2.92 5.57 / 2.98 6.51 / 4.36 6.56 / 4.29 6.98 / 6.25 8.76 / 6.43
Point cloud videos w/o triplet-based 5.77 / 2.93 5.59 / 3.26 6.53 / 4.39 6.67 / 4.85 7.52 / 6.31 8.84 / 6.43
Ours 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
w/o shuffle-based 8.35 / 5.34 8.76 / 5.68 8.88 / 5.53 7.00 / 5.53 7.51 / 6.02 7.56 / 7.42
RGB-D videos w/o triplet-based 9.00 / 5.46 9.71 / 5.95 9.63 / 5.82 7.12 / 6.03 8.31 / 6.51 8.24 / 7.95
Ours 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
Table 5: Ablation results for contrastive learning approaches on PSG4D dataset.
PSG4D-GTA PSG4D-HOI
Input type Tube relation quantification
Pooling - Cosine similarity 5.76 / 2.87 6.02 / 3.62 6.84 / 4.11 7.24 / 4.45 7.44 / 6.27 8.20 / 6.64
Point cloud videos Pooling - L2 5.46 / 2.78 5.38 / 3.39 6.51 / 3.86 6.72 / 4.11 6.74 / 6.05 7.96 / 6.11
Optimal transport 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
Pooling - Cosine similarity 9.03 / 5.37 9.47 / 5.86 9.70 / 6.02 7.36 / 5.43 7.93 / 6.70 8.06 / 7.42
RGB-D videos Pooling - L2 8.89 / 4.70 8.90 / 5.41 9.08 / 5.78 6.65 / 5.26 7.74 / 6.29 7.95 / 7.39
Optimal transport 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
Table 6: Ablation results for mask tube relation quantification method between mask tubes on PSG4D dataset.
tion by 0.8/1.1 points of R/mR@50. In addition, for a less further elevating the threshold results in performance degra-
strict vIoU threshold, we outperform IPS+T - Transformer dation, since there are more mask tubes eliminated, thus lim-
by 1.6/2.9 points of R/mR@20, and IPS+T - Convolution iting the effect of our motion-aware contrastive framework.
by 2.3/0.9 points of R/mR@50. These results demonstrate Effect of optimal transport distance. In this ablation, we
that our method makes a propitious contribution to tempo- compare various strategies to calculate the similarity be-
ral panoptic scene graph generation, not only to popular but tween two mask tubes. Results in Table 4 and 6 show that
also to unpopular relation classes. the proposed optimal transport achieves much higher perfor-
Results on PSG4D. Table 2 shows that our method mance for both natural and 4D video inputs. We conjecture
also achieves significantly higher performance than the that other method such as pooling then cosine similarity or
PSG4DFormer model. Specifically, when working with L2 neglects the temporal or flattens the motion nature of the
point cloud videos, on PSG4D-GTA, we outperform the entity mask tubes, thus reducing the effectiveness.
baseline method by 1.6/1.4 points. Analogously, on PSG4D-
HOI, we outperform PSG4DFormer by 2.0/2.5 points of
R/mR@50. These results indicate that our framework bears
Qualitative Analysis
a valuable impact to both egocentric and third-view videos.
We hypothesize that both video types consist of dynamic ac- We visualize examples processed by the state-of-the-art
tions among objects whose mask tube representations should models and ours in Figure 2. As can be observed, our
be polished. In addition, when working with RGB-D videos, model successfully produces mask tubes overlapping with
on PSG4D-GTA, we enhance the baseline method by 2.4/2.2 the groundtruth, and importantly predicts the correct rela-
points of R/mR@20,. Furthermore, on PSG4D-HOI, our tions of the subject-object pairs. On the other hand, base-
motion-aware contrastive learning also considerably refines line models tend to prefer more static relations, since during
PSG4DFormer by 2.0/2.4 points of R/mR@20. training they do not explicitly focus on motion-sensitive fea-
tures. Statistics in Figure 1 also substantiate our proposition,
Such results have verified the generalizability of our
in which we achieve considerably higher recalls for dynamic
motion-aware contrastive framework over natural, point
relations than baseline approaches.
cloud, and RGB-D videos.
Ablation Study Conclusion
Effect of the contrastive components. We evaluate our
framework without the assistance of either the shuffle-based In this paper, we propose a motion-aware contrastive learn-
or the triplet-based contrastive objective. As shown in Table ing framework for temporal panoptic scene graph genera-
3 and 5, the performance degrades when we both remove tion. In our framework, we learn close representations for
shuffle-based and triplet-based contrastive approaches. In temporal masks of similar entities that exhibit common re-
addition, triplet-based contrastive learning plays a more fun- lations. Moreover, we separate temporal masks from their
damental role than the shuffle-based one. We hypothesize shuffled version, and also separate temporal masks of dif-
that shuffle-based contrastive learning is better at focusing ferent subject-relation-object triplets. To quantify the rela-
on motion semantics than triplet-based one. tionship among temporal masks in the proposed contrastive
Effect of selecting strong-motion tubes. We evaluate the framework, we utilize optimal transport to preserve the tem-
impact of our strategy to filter out weak motion tubes. In Fig- poral nature among temporal entity masks. Extensive exper-
ure 5, we observe a performance boost when we increase the iments substantiate the effectiveness of our framework for
threshold to select mask tubes with strong motion. However, both natural and 4D videos.
Acknowledgements the IEEE/CVF Conference on Computer Vision and Pattern
This research/project is supported by the National Re- Recognition, 18847–18857.
search Foundation, Singapore under its AI Singapore Pro- Li, Y.; Yang, X.; and Xu, C. 2022. Dynamic scene graph
gramme (AISG Award No: AISG3-PhD-2023-08-051T). generation via anticipatory pre-training. In Proceedings of
Thong Nguyen is supported by a Google Ph.D. Fellowship the IEEE/CVF conference on computer vision and pattern
in Natural Language Processing. recognition, 13874–13883.
Liu, H.; Min, K.; Valdez, H. A.; and Tripathi, S. 2024. Con-
References trastive Language Video Time Pre-training. arXiv preprint
Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; and arXiv:2406.02631.
Hu, W. 2020. Graph convolutional network with structure Liu, K.; Li, Y.; Xu, Y.; Liu, S.; and Liu, S. 2022. Spatial
pooling and joint-wise channel attention for action recogni- focus attention for fine-grained skeleton-based action tasks.
tion. Pattern Recognition, 103: 107321. IEEE Signal Processing Letters, 29: 1883–1887.
Chen, Z.; Zheng, T.; and Song, M. 2024. Curriculum Ma, X.; Yong, S.; Zheng, Z.; Li, Q.; Liang, Y.; Zhu, S.-C.;
Negative Mining For Temporal Networks. arXiv preprint and Huang, S. 2022. Sqa3d: Situated question answering in
arXiv:2407.17070. 3d scenes. arXiv preprint arXiv:2210.07474.
Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Gird- Nag, S.; Min, K.; Tripathi, S.; and Roy-Chowdhury, A. K.
har, R. 2022. Masked-attention mask transformer for univer- 2023. Unbiased scene graph generation in videos. In Pro-
sal image segmentation. In Proceedings of the IEEE/CVF ceedings of the IEEE/CVF Conference on Computer Vision
conference on computer vision and pattern recognition, and Pattern Recognition, 22803–22813.
1290–1299. Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan,
Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; L. A. 2023a. Improving multimodal sentiment anal-
Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; ysis: Supervised angular margin-based contrastive learn-
Price, W.; et al. 2022. Epic-kitchens-100. International ing for enhanced fusion representation. arXiv preprint
Journal of Computer Vision, 130: 33–55. arXiv:2312.02227.
Davtyan, A.; Sameni, S.; and Favaro, P. 2023. Efficient Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024a.
video prediction via sparsely conditioned flow matching. In Kdmcse: Knowledge distillation multimodal sentence em-
Proceedings of the IEEE/CVF International Conference on beddings with adaptive angular margin contrastive learning.
Computer Vision, 23263–23274. arXiv preprint arXiv:2403.17486.
Dong, Q.; and Fu, Y. 2024. MemFlow: Optical Flow Es- Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le,
timation and Prediction with Memory. In Proceedings of K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024b.
the IEEE/CVF Conference on Computer Vision and Pattern Meta-optimized Angular Margin Contrastive Framework for
Recognition, 19068–19078. Video-Language Representation Learning. arXiv preprint
Driess, D.; Xia, F.; Sajjadi, M. S.; Lynch, C.; Chowdhery, A.; arXiv:2407.03788.
Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J. Z.;
2023. Palm-e: An embodied multimodal language model. Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024c. Video-
arXiv preprint arXiv:2303.03378. Language Understanding: A Survey from Model Architec-
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, ture, Model Training, and Data Perspectives. arXiv preprint
A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; arXiv:2406.05615.
et al. 2022. Ego4d: Around the world in 3,000 hours of ego- Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for
centric video. In Proceedings of the IEEE/CVF Conference neural topic model. Advances in neural information pro-
on Computer Vision and Pattern Recognition, 18995–19012. cessing systems, 34: 11974–11986.
Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi- Nguyen, T.; Wu, X.; Dong, X.; Le, K. M.; Hu, Z.; Nguyen,
fication with graph convolutional networks. arXiv preprint C.-D.; Ng, S.-K.; and Luu, A. T. 2024d. READ-PVLA:
arXiv:1609.02907. Recurrent Adapter with Partial Video-Language Alignment
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; for Parameter-Efficient Transfer Learning in Low-Resource
Chen, K.; Liu, Z.; and Loy, C. C. 2023a. Transformer- Video-Language Modeling. In Proceedings of the AAAI
based visual segmentation: A survey. arXiv preprint Conference on Artificial Intelligence, volume 38, 18824–
arXiv:2304.09854. 18832.
Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; and Loy, Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-
C. C. 2023b. Tube-Link: A flexible cross tube framework K.; and Tuan, L. A. 2023b. Demaformer: Damped ex-
for universal video segmentation. In Proceedings of the ponential moving average transformer with energy-based
IEEE/CVF International Conference on Computer Vision, modeling for temporal language grounding. arXiv preprint
13923–13933. arXiv:2312.02549.
Li, X.; Zhang, W.; Pang, J.; Chen, K.; Cheng, G.; Tong, Y.; Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.;
and Loy, C. C. 2022. Video k-net: A simple, strong, and and Luu, A. T. 2024e. Topic Modeling as Multi-Objective
unified baseline for video segmentation. In Proceedings of Contrastive Optimization. arXiv preprint arXiv:2402.07577.
Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020.
and Bing, L. 2022. Adaptive contrastive learning on multi- Learning 3d semantic scene graphs from 3d indoor recon-
modal transformer for review helpfulness predictions. arXiv structions. In Proceedings of the IEEE/CVF Conference on
preprint arXiv:2211.03524. Computer Vision and Pattern Recognition, 3961–3970.
Nguyen, T. T.; Hu, Z.; Wu, X.; Nguyen, C.-D. T.; Ng, S.- Wang, G.; Li, Z.; Chen, Q.; and Liu, Y. 2024a. OED: To-
K.; and Luu, A. T. 2024f. Encoding and Controlling Global wards One-stage End-to-End Dynamic Scene Graph Gen-
Semantics for Long-form Video Question Answering. arXiv eration. In Proceedings of the IEEE/CVF Conference on
preprint arXiv:2405.19723. Computer Vision and Pattern Recognition, 27938–27947.
Pu, T.; Chen, T.; Wu, H.; Lu, Y.; and Lin, L. 2023. Spatial- Wang, W.; Luo, Y.; Chen, Z.; Jiang, T.; Yang, Y.; and Xiao,
temporal knowledge-embedded transformer for video scene J. 2023. Taking a closer look at visual relation: Unbiased
graph generation. IEEE Transactions on Image Processing. video scene graph generation with decoupled label learning.
IEEE Transactions on Multimedia.
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet:
Wang, Y.; Yuan, S.; Jian, X.; Pang, W.; Wang, M.; and
Deep learning on point sets for 3d classification and segmen-
Yu, N. 2024b. HaVTR: Improving Video-Text Retrieval
tation. In Proceedings of the IEEE conference on computer
Through Augmentation Using Large Foundation Models.
vision and pattern recognition, 652–660.
arXiv preprint arXiv:2404.05083.
Raychaudhuri, S.; Campari, T.; Jain, U.; Savva, M.; and Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.; and
Chang, A. X. 2023. Reduce, reuse, recycle: Modular multi- Bertinetto, L. 2021. Do different tracking tasks require dif-
object navigation. arXiv preprint arXiv:2304.03696, 2. ferent appearance models? Advances in Neural Information
Ren, S.; Zhu, H.; Wei, C.; Li, Y.; Yuille, A.; and Xie, Processing Systems, 34: 726–738.
C. 2024. ARVideo: Autoregressive Pretraining for Self- Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T.
Supervised Video Representation Learning. arXiv preprint 2024. Modeling Dynamic Topics in Chain-Free Fashion by
arXiv:2405.15160. Evolution-Tracking Contrastive Learning and Unassociated
Rodin, I.; Furnari, A.; Min, K.; Tripathi, S.; and Farinella, Word Exclusion. arXiv preprint arXiv:2405.17957.
G. M. 2024. Action Scene Graphs for Long-Form Un- Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; and Zhong, W.
derstanding of Egocentric Videos. In Proceedings of the 2022. 3d instances as 1d kernels. In European Conference
IEEE/CVF Conference on Computer Vision and Pattern on Computer Vision, 235–252. Springer.
Recognition, 18622–18632. Xiao, F.; Tighe, J.; and Modolo, D. 2021. Modist: Motion
Rosa, K. D. 2024. Video Enriched Retrieval Augmented distillation for self-supervised video representation learning.
Generation Using Aligned Video Captions. arXiv preprint arXiv preprint arXiv:2106.09703, 3.
arXiv:2405.17706. Yang, J.; Ang, Y. Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu,
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Z. 2022. Panoptic scene graph generation. In European
Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Conference on Computer Vision, 178–196. Springer.
et al. 2015. Imagenet large scale visual recognition chal- Yang, J.; Cen, J.; Peng, W.; Liu, S.; Hong, F.; Li, X.; Zhou,
lenge. International journal of computer vision, 115: 211– K.; Chen, Q.; and Liu, Z. 2024. 4d panoptic scene graph
252. generation. Advances in Neural Information Processing Sys-
tems, 36.
Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; and Chua, T.-
S. 2019. Annotating objects and relations in user-generated Yang, J.; Peng, W.; Li, X.; Guo, Z.; Chen, L.; Li, B.; Ma, Z.;
videos. In Proceedings of the 2019 on International Confer- Zhou, K.; Zhang, W.; Loy, C. C.; et al. 2023. Panoptic video
ence on Multimedia Retrieval, 279–287. scene graph generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
Shen, H.; Shi, L.; Xu, W.; Cen, Y.; Zhang, L.; and An, G. 18675–18685.
2024. Patch Spatio-Temporal Relation Prediction for Video
Zhao, C.; Shen, Y.; Chen, Z.; Ding, M.; and Gan, C. 2023.
Anomaly Detection. arXiv preprint arXiv:2403.19111.
Textpsg: Panoptic scene graph generation from textual de-
Sobel, I.; Duda, R.; Hart, P.; and Wiley, J. 2022. scriptions. In Proceedings of the IEEE/CVF International
Sobel-feldman operator. Preprint at https://www. re- Conference on Computer Vision, 2839–2850.
searchgate. net/profile/Irwin-Sobel/publication/285159837. Zhou, H.; Liu, Q.; and Wang, Y. 2023. Learning discrimi-
Accessed, 20. native representations for skeleton based action recognition.
Song, X.; Li, Z.; Chen, S.; Cai, X.-Q.; and Demachi, K. In Proceedings of the IEEE/CVF Conference on Computer
2024. An Animation-based Augmentation Approach for Ac- Vision and Pattern Recognition, 10608–10617.
tion Recognition from Discontinuous Video. arXiv preprint Zhou, L.; Zhou, Y.; Lam, T. L.; and Xu, Y. 2022. Context-
arXiv:2404.06741. aware mixture-of-experts for unbiased scene graph genera-
Sudhakaran, G.; Dhami, D. S.; Kersting, K.; and Roth, S. tion. arXiv preprint arXiv:2208.07109.
2023. Vision relation transformer for unbiased scene graph
generation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, 21882–21893.

2412.07160v1

Uploaded by

Copyright:

Available Formats

2412.07160v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2412.07160v1

Uploaded by

Copyright:

Available Formats

Motion-aware Contrastive Learning for

Temporal Panoptic Scene Graph Generation

Abstract 0.6 IPS+T - Conv

come this limitation, we introduce a contrastive representa-

P (G|V ) = P (M, O, R|V ). (1) Relation Classification

each pixel in a non-overlapping manner. Formally, the model

its reference frame from a nearby timestamp. esim(H ,H )

Negative Sampling: Shuffle-based Negative Sampling: Triplet-based

Table 1: Experimental results on the OpenPVSG dataset.

Method R/mR@20 R/mR@50 R/mR@100 6.075

Table 5: Ablation results for contrastive learning approaches on PSG4D dataset.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.