2412.07160v1
2412.07160v1
2412.07160v1
R@50
scene graph generation abstracts visual data into nodes to 0.3
represent entities and edges to capture temporal relations. 0.2
Existing methods encode entity masks tracked across tem-
0.1
poral dimensions (mask tubes), then predict their relations
with temporal pooling operation, which does not fully uti- 0.0
on standing on sitting on kicking running on opening
lize the motion indicative of the entities’ relation. To over- Predicate
Video: Video:
Groundtruth child kicking ball_13 child running on grass Groundtruth person opening door person closing door
IPS+T - child_6 next_to ball_13 child_6 on grass_1 PSG4DFormer: person_2 next_to door_1 person_2 next_to door_1
Convolution
Ours: child_6 kicking ball_13 child_6 running on grass_1 Ours: person_2 opening door_1 person_2 closing door_1
Figure 2: Examples of temporal panoptic scene graph generation of state-of-the-art (Yang et al. 2023, 2024) and our method.
target entities into visual representations, which might not lationship between two events’ mask tubes for the pro-
benefit relation classification in panoptic scene graph gener- posed contrastive framework.
ation. • Comprehensive experiments demonstrate that our frame-
In this paper, to encourage representation learning to cap- work outperforms state-of-the-art methods on both natu-
ture motion patterns for temporal panoptic scene graph gen- ral and 4D video datasets, especially on recognizing dy-
eration, we propose a novel contrastive learning framework namic subject-object relations.
that focuses on mask tubes of the segmented entities. First,
we force a mask tube and the one of similar subject-relation-
object but of a different video to obtain close represen-
Related Work
tations. Since positive mask tubes originate from distinct Temporal panoptic scene graph generation. Traditional
video clips, the model cannot rely upon visual semantics to research mainly focuses on generating scene graphs for nat-
optimize the contrastive objective, but instead depends on ural videos in which nodes represent objects and edges
the motion trajectory evolution, which is our target compo- represent relations between objects. To localize object in-
nent for representation learning. Second, we propel nega- stances, the nodes are grounded by bounding boxes (Wang
tive mask tubes generated by temporally shuffling the origi- et al. 2024a; Nag et al. 2023; Pu et al. 2023). Despite
nal tubes. Moreover, we also push apart representations of progress, limitations exist in traditional video scene graph
mask tubes from the same video but belonging to differ- generation because of noisy grounding annotations of coarse
ent triplets. Because mask tubes of different triplets from a bounding box annotations and trivial relation taxonomy. Re-
common video with close visual features are separated from cent works have addressed these issues by proposing panop-
each other, we once again motivate the model to generate tic video and 4D scene graph generation (Yang et al. 2023,
representations that are less reliant upon visual semantics 2024). Classic video scene graph generation methods have
but motion-sensitive features. In addition, the visually sim- been dominated by the two-stage pipeline that consists of
ilar negative mask tubes can play a role as hard negative object detection and pairwise predicate classification (Rodin
samples, thus accelerating the contrastive learning process et al. 2024; Nag et al. 2023). This property has been general-
(Chen, Zheng, and Song 2024). ized to panoptic approaches (Yang et al. 2022, 2023, 2024),
Moreover, in order to implement our motion-aware con- in which the pipeline comprises panoptic segmentation fol-
trastive learning framework, there is a need to quantify the lowed by predicate classification step.
relationship between mask tubes. This quantification marks Video representation learning. Video representation learn-
a challenging problem as mask tubes are a sequence of ing has gained popularity in recent years (Davtyan, Sameni,
segmentation masks that span over the sequence of video and Favaro 2023; Nguyen et al. 2024d; Shen et al. 2024;
frames. Furthermore, mask tubes of two triplets might ex- Nguyen et al. 2024f). Most approaches can be catego-
hibit different lengths since two events often occur at differ- rized into two groups: pretext-based and contrastive-based.
ent speed. Unfortunately, the popular pipeline of temporal Pretext-based methods mostly leverage pretext learning
pooling and then similarity estimation flattens the tempo- tasks such as optical flow prediction (Dong and Fu 2024;
ral dimension of the mask tubes and neglects their motion Davtyan, Sameni, and Favaro 2023) and temporal order pre-
features. To resolve this problem, we consider mask tubes diction (Shen et al. 2024; Ren et al. 2024). However, these
of two triplets as two distributions and seek the optimal tasks are considerably influenced by low-level features and
transportation map between them, then utilize the transport incapable of delving into high-level semantics of the video
distance as the distance between two triplets’ tubes. Such (Nguyen et al. 2024c). The contrastive-based methods pri-
scheme of transporting can play a role of synchronizing the marily construct positive samples (Nguyen and Luu 2021;
motion states of two triplets and takes advantage of the mask Nguyen et al. 2024a, 2022, 2024e, 2023a; Wu et al. 2024)
tubes’ evolutionary trajectory. by sampling video clips from the same video (Nguyen et al.
To sum up, our contributions are as follows: 2024b; Liu et al. 2024), use various frame-based data aug-
mentation techniques (Wang et al. 2024b; Song et al. 2024;
• We propose a novel contrastive learning framework for
Rosa 2024), etc., thereby increasing the similarity of posi-
temporal panoptic scene graph generation which pulls to-
tive pairs while simultaneously decreasing the one of nega-
gether entity mask tubes with similar motion patterns and
tive pairs. Nevertheless, these methods focus on the whole
pushes away those of distinct motion patterns.
video frames and are not suitable for object masks tracked
• We utilize optimal transport distance to estimate the re- across the temporal dimension.
Problem Formulation masked cross-attention. Receiving a video V , the model pro-
Temporal panoptic scene graph generation (TPSGG) is a duces a set of queries {qi }Ni=1 , where each query qi corre-
task to generate a dynamic scene graph given an input video. sponds to one entity. Subsequently, every query is forwarded
In the generated scene graph, each node corresponds to an to two multi-layer perceptrons (MLPs) to project the queries
entity and each edge corresponds to a spatial-temporal rela- into mask classification and mask regression outputs.
tion between two entities. Formally, the input of a TPSGG Training and inference. During training, each query is
model is a video clip V , particularly V ∈ RT ×H×W ×3 for matched to a groundtruth mask through mask-based bipar-
a natural video, V ∈ RT ×H×W ×4 for a 4D RGB-D video, tite matching to calculate the segmentation loss. During in-
and V ∈ RT ×M ×6 for a 4D point cloud video, T denotes the ference, IPS+T generates panoptic segmentation masks for
number of frames, M the number of point clouds of interest, each frame, and uses the tracker to achieve N tracked mask
and the frame size H × W should remain consistent across tubes. In contrast, VPS employs two query embeddings of
the video. The output of the model is a dynamic scene graph the target and reference frame, and performs query-wised
G. The TPSGG task can be formulated as follows: similarity tracking to obtain N tracked mask tubes.
Panoptic
🕒 #0 - #23
Relation adult - running on - grass
Segmentation
Model
Model
🕒 #45 - #49
adult - throwing - ball
...
Video: 0020_10793023296
Contrastive Learning for Temporal Panoptic Scene Graph Generation
Positive Sampling Optimal Transport Distance
Adult - throwing - ball Adult - throwing - ball
(0020_10793023296) Optimal transport
(0028_4021064662)
Adult Ball Adult Ball Treat as
Treat as distribution Pooling
Pull Pooling distribution
Push Push
Figure 3: Framework overview of contrastive learning for temporal scene graph generation.
adult - throwing - ball adult - touching - oven
motion-aware, as the anchor H a and the negative tube H n
share visual semantics and can only be distinguished using
motion information.
Selecting strong-motion mask tubes. However, there ex-
ists a potential risk: for static relations such as on, next to,
12 12 and in, mask tubes might involve almost no motion. As a re-
9 9 sult, the shuffled tube would become identical to the anchor
Motion Motion
amplitude
6
amplitude
6
one and the model would not be able to differentiate them
3 3
and learn reasonably. To address this problem, we propose
0 0
1 2 3 1 2 3 4 a strategy to select strong-motion tubes for shuffling, which
Frame Frame
we illustrated in Figure 4.
Figure 4: Proposed strategy to select strong-motion tubes. Given a video, our aim is to select mask tubes that carry
strong motion for shuffling. To measure the motion of the
where sim denotes the similarity function defined upon a mask tube, we utilize optical flow edges (Xiao, Tighe, and
pair of mask tube representations. The formulation shows Modolo 2021). We estimate flow edges via employing a So-
that what the model learn is largely dependent upon how bel filter (Sobel et al. 2022) onto the flow magnitude map
positive and negative samples are generated. and take the median over the flow edge pixels of the en-
Positive sampling. To satisfy our motion-aware requirement tity masks. Then, we select mask tubes whose the maximum
for contrastive learning, we extract mask tube representa- value across the optical flow surpasses a threshold γ.
tions from the entities of the same subject and object cate-
gory that exhibit a similar groundtruth relation from another
video. Since two videos possess distinct visual features, the Triplet-based contrastive learning
model must rely on the shared motion pattern of similar To take advantage of motion-aware signals from triplets of
subject-relation-object triplets to associate the anchor and similar subject-relation-object category, we design a triplet-
the positive sample. based approach to create negative samples. A naive ap-
Negative sampling. For negative sampling, we design two proach would be to sample mask tubes of any distinct
strategies, which result in two contrastive approaches, i.e. subject-relation-object triplet from the anchor sample. How-
shuffle-based and triplet-based contrastive learning. ever, if we run into triplets with all distincgt subject, relation,
and object categories, the negative pair would be trivial for
Shuffle-based contrastive learning the model to distinguish, resulting in less effective learning.
In our shuffle-based approach, we create negative samples In order to create harder negative samples, we choose neg-
by utilizing a series of temporal permutations π to the anchor ative mask tubes from the same video with the anchor. We
tube, i.e. shuffling: create a multi-nomial distribution, where triplets that share
H n = π (H a ) . (6) more subject, relation, or object categories with the anchor
will be more likely to be drawn. Hence, our negative sam-
As such, the contrastive objective will force the model to ples can hold close visual semantics with the anchor sample,
propel representations of the anchor tube, which is in the and increase the likelihood that the model depends on mo-
normal order, from the shuffled tube, which exhibits a dis- tion semantics to push them apart. From contrastive learning
torted motion due to the shuffled order. This would make perspective, these samples form hard negative samples to ac-
the learned representation sensitive to frame ordering, i.e. celerate the learning process (Chen, Zheng, and Song 2024).
Algorithm 1: Computing the optimal transport distance first describe the experiment settings, covering the evalua-
Require: C = {Cl,k = c (hi,l , hj,k ) | 1 ≤ i ≤ NV , 1 ≤ j ≤ tion datasets, evaluation metrics, baseline methods, and im-
NL } ∈ RTi ×Tj , a ∈ RTi , b ∈ RTj , s, Niter plementation details. Next, we present quantitative results
dOT = ∞ of our method, then provide ablation study and careful anal-
for s = 1 to min(Ti , Tj ) do ysis to explore properties of our motion-aware contrastive
T = exp − C framework. Eventually, we conduct qualitative analysis to
τ
s concretely examine its behavior.
T= ⊤ T
( Ti )
1 ·T·1 Tj
for i = 1 to Niter
do
Experiment Settings
pa = min T1aT , 1Ti , Ta = diag (pa ) · T Datasets. We assess the effectiveness of our method on nat-
j
ural and 4D video inputs. The corresponding dataset to each
pb = min T⊤b1 , 1Tj , Tb = diag (pb ) · Ta input type is as follows:
a Ti
T= s
⊤ Tb • Open-domain Panoptic video scene graph generation
(1Ti ) ·T·1N
L (OpenPVSG) (Yang et al. 2023): OpenPVSG consists
end for ! of scene graphs and associated segmentation masks with
Tj
Ti P
respect to subject and object nodes in the scene graph.
P
dOT = min dOT , Tk,l Ck,l
k=1 l=1 The dataset comprises 400 videos, including 289 third-
end for person videos from ViDOR (Shang et al. 2019), 111 ego-
return dOT centric videos from Epic-Kitchens (Damen et al. 2022)
and Ego4D (Grauman et al. 2022).
Optimal Transport for Mask Tube Relation
• Panoptic scene graph generation for 4D (PSG4D)
Quantification
(Yang et al. 2024): The PSG4D dataset is divided into
There is one remaining problem, i.e. how to define the sim- two groups, i.e. PSG4D-GTA and PSG4D-HOI. PSG4D-
ilarity function sim for two mask tubes’ representations GTA comprises 67 third-view videos with an average
Hi and Hj . In this work, we consider two mask tubes as length of 84 seconds, 35 object categories, and 43
two discrete distributions µ and ν, whose Hi and Hj are relationship categories. On the contrary, PSG4D-HOI
Ti
P contains 2,973 videos from an egocentric perspective,
their supports, respectively. Formally, µ = ak δhi,k and
k=1 whose average duration is 20 seconds. The PSG4D-
Tj
P HOI’s videos are mostly related to indoor scenes, cov-
ν= bl δhj,l , where δhi,k and δhj,l denote the Dirac func- ering 46 object categories and 15 relationship categories.
l=1
tions centered upon hi,k and hj,l , respectively. The weights Evaluation metrics. We use the recall at K (R@K) and
1 1T mean recall at K (mR@K) metrics, which are standard met-
of the supports are a = TTii and b = Tjj . rics used in scene graph generation tasks. Both R@K and
After defining the distribution scheme, we propose the mR@K consider the top-K triplets predicted by the panop-
tube alignment optimization problem, which is to find the tic scene graph generation model. A successful recall of a
transport plan that achieves the minimum distance between predicted triplet must satisfy the following criteria: 1) cor-
µ and ν as follows: rect category labels for the subject, object, and predicate;
Tj
Ti X
X 2) a volume Intersection over Union (vIoU) greater than
dOT = DOT (µ, ν) = min Ti,j · c (hi,k , hj,l ) , or equal to 0.5 between the predicted mask tubes and the
T∈Π(a,b)
k=1 l=1 groundtruth tubes. For extensive comparison, we also report
(7) results with the vIoU threshold of 0.1.
s.t Π(a, b) = {T ∈
T ×T
R+i j | T1Ti ≤ a, T⊤ 1Tj ≤ b, Baseline methods. We compare our method with a com-
prehensive list of baseline approaches for temporal panop-
1⊤
Ti · T · 1Tj = s, 0 ≤ s ≤ min(Ti , Tj )}, tic scene graph generation: (i) IPS+T - Vanilla (Yang et al.
(8) 2023) uses image panoptic segmentation (IPS) model with
where c denotes a pre-defined distance between two vec- a tracker for segmentation, and fully-connected layers to
tors. We implement the cost distance c (hi,k , hj,l ) = 1 − separately encode temporal states of entity mask tubes; (ii)
hi,k ·hj,l IPS+T - Handcrafted filter (Yang et al. 2023) uses image
||hi,k ||2 ||hi,k ||2 as the cosine distance. As the exact optimiza- panoptic segmentation (IPS) model with a tracker for seg-
tion over the transport plan T is intractable, we adopt the mentation, and a manually-designed kernel to encode en-
Sinkhorn-based algorithm to estimate T. We delineate the tity mask tubes; (iii) IPS+T - Convolution (Yang et al.
algorithm to calculate the distance in Algorithm 1. To turn 2023) uses image panoptic segmentation (IPS) model with
the distance into similarity value, we take its negative value a tracker for segmentation, and learnable convolutional lay-
and add to a pre-defined margin α: ers to encode entity mask tubes; (iv) IPS+T - Transformer
sim (ha , hp ) = α − dOT . (9) (Yang et al. 2023) uses image panoptic segmentation model
(IPS) with a tracker for segmentation, and Transformer-
Experiments based encoder with self-attention layers to encode entity
We conduct comprehensive experiments to evaluate the ef- mask tubes; (v) VPS - Vanilla (Yang et al. 2023) is simi-
fectiveness of our motion-aware contrastive framework. We lar to IPS+T - Vanilla, but uses video panoptic segmentation
vIoU threshold = 0.5 vIoU threshold = 0.1
Method
R/mR@20 R/mR@50 R/mR@100 R/mR@20 R/mR@50 R/mR@100
IPS+T - Vanilla 3.04 / 1.35 4.61 / 2.94 5.56 / 3.33 8.28 / 5.68 14.47 / 9.92 18.24 / 11.84
IPS+T - Handcrafted filter 2.52 / 1.72 3.77 / 2.36 4.72 / 2.79 8.07 / 5.61 13.42 / 8.27 16.46 / 10.11
IPS+T - Transformer 3.88 / 2.81 5.66 / 4.12 6.18 / 4.44 9.01 / 6.69 14.88 / 11.28 17.51 / 13.20
IPS+T - Convolution 3.88 / 2.55 5.24 / 3.29 6.71 / 5.36 10.06 / 8.98 14.99 / 12.21 18.13 / 15.47
Ours - Transformer 3.98 / 2.98 5.97 / 4.20 7.44 / 5.15 10.59 / 9.56 16.98 / 12.39 22.33 / 17.47
Ours - Convolution 4.51 / 3.56 6.08 / 4.38 7.76 / 5.86 11.43 / 9.57 17.30 / 13.13 22.85 / 17.48
VPS - Vanilla 0.21 / 0.10 0.21 / 0.10 0.31 / 0.18 6.29 / 3.04 9.64 / 6.74 12.89 / 9.60
VPS - Handcrafted filter 0.42 / 0.13 0.52 / 0.50 0.94 / 0.92 5.24 / 2.84 7.65 / 7.14 9.64 / 8.22
VPS - Transformer 0.42 / 0.61 0.73 / 0.76 1.05 / 0.92 6.50 / 5.75 9.64 / 8.25 12.26 / 9.51
VPS - Convolution 0.42 / 0.25 0.63 / 0.67 0.63 / 0.67 8.07 / 7.84 11.01 / 9.78 12.89 / 10.77
Ours - Transformer 0.63 / 0.83 1.05 / 0.76 1.05 / 0.76 6.71 / 6.94 10.27 / 8.68 13.42 / 12.09
Ours - Convolution 0.84 / 0.98 1.26 / 1.22 1.26 / 1.22 8.18 / 8.00 12.90 / 11.47 14.22 / 13.59
PSG4D-GTA PSG4D-HOI
Input type Method
R/mR@20 R/mR@50 R/mR@100 R/mR@20 R/mR@50 R/mR@100
3DSGG 1.48 / 0.73 2.16 / 0.79 2.92 / 0.85 3.46 / 2.19 3.15 / 2.47 4.96 / 2.84
Point cloud videos PSG4DFormer 4.33 / 2.10 4.83 / 2.93 5.22 / 3.13 5.36 / 3.10 5.61 / 3.95 6.76 / 4.17
Ours 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
3DSGG 2.29 / 0.92 2.46 / 1.01 3.81 / 1.45 4.23 / 2.19 4.47 / 2.31 4.86 / 2.41
RGB-D videos PSG4DFormer 6.68 / 3.31 7.17 / 3.85 7.22 / 4.02 5.62 / 3.65 6.16 / 4.16 6.28 / 4.97
Ours 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
Table 2: Experimental results on both PSG4D-GTA and PSG4D-HOI groups of PSG4D dataset.
R@50
6.000
5.975
Table 3: Ablation results for contrastive learning approaches
5.950
on OpenPVSG dataset. We adopt the vIoU threshold of 0.5.
5.925
Tube relation quantification R/mR@20 R/mR@50 R/mR@100
Pooling - Cosine similarity 4.44 / 3.40 6.04 / 4.36 7.36 / 5.84 6 7 8 9 10
γ
Pooling - L2 4.36 / 3.37 3.77 / 5.95 7.29 / 5.80
Optimal transport 4.51 / 3.56 6.08 / 4.38 7.44 / 5.86 Figure 5: Ablation results on threshold γ.
Table 4: Ablation results for mask tube relation quantifica- for 8 epochs using AdamW optimizer with a batch size of
tion method between mask tubes on OpenPVSG dataset. 32, learning rate of 0.0001, weight decay of 0.05, and gra-
dient clipping with a max L2 norm of 0.01. In the latter
(VPS) model for panoptic segmentation; (vi) VPS - Hand- case, we utilize Video K-Net (Li et al. 2022), also initial-
crafted filter (Yang et al. 2023) is similar to IPS+T - Hand- ized from COCO-pretrained weights and fine-tuned with the
crafter filter, but uses video panoptic segmentation (VPS) same strategy as IPS+T. In the relation classification step, we
model for segmentation; (vii) VPS - Convolution (Yang conduct fine-tuning with a batch size of 32, employing the
et al. 2023) is similar to IPS+T - Convolution, but uses video Adam optimizer with a learning rate of 0.001. For 4D panop-
panoptic segmentation (VPS) model for segmentation; (viii) tic scene graph generation, we adopt the PSG4DFormer
VPS - Transformer (Yang et al. 2023) is similar to IPS+T baseline. To work with RGB-D and point cloud videos, we
- Transformer, but uses video panoptic segmentation (VPS) use an ImageNet pretrained on ResNet-101 (Russakovsky
model for segmentation; (ix) 3D-SGG (Wald et al. 2020) is et al. 2015) and the DKNet (Wu et al. 2022) as the visual en-
based on PointNet (Qi et al. 2017) and graph convolutional coder, respectively. We fine-tune the segmentation module
network (Kipf and Welling 2016) but neglects the depth di- for RGB-D and point cloud videos for 12 and 200 epochs,
mension and generates panoptic scene graphs for 4D video respectively. We use additional 100 epochs to train the re-
inputs; (x) PSG4DFormer (Yang et al. 2024) is a special- lation classification module. Based on validation, we adopt
ized model for 4D inputs, using Mask2Former (Cheng et al. a threshold γ = 9.0 and a margin α = 10.0. We set the
2022) for segmentation and a spatial-temporal Transformer maximum number of iterations Niter to 1,000.
to encode object mask tubes for relation classification.
Implementation details. For fair comparison, we experi- Main Results
ment our contrastive framework with both IPS+T and VPS Results on OpenPVSG. As shown in Table 1, we substan-
as segmentation module for panoptic video scene graph tially outperform both IPS+T - Convolution and IPS+T -
generation. In the former case, we leverage the UniTrack Transformer when we use IPS+T for segmentation. In par-
tracker (Wang et al. 2021) combined with Mask2Former ticular, using a higher vIoU threshold to filter out inaccurate
model (Cheng et al. 2022), which is initialized from the segmentation, we surpass IPS+T - Transformer by 1.3/0.7
best-performing COCO-pretrained weights and fine-tuned points of R/mR@100, while surpassing IPS+T - Convolu-
PSG4D-GTA PSG4D-HOI
Input type Method
R/mR@20 R/mR@50 R/mR@100 R/mR@20 R/mR@50 R/mR@100
w/o shuffle-based 5.56 / 2.92 5.57 / 2.98 6.51 / 4.36 6.56 / 4.29 6.98 / 6.25 8.76 / 6.43
Point cloud videos w/o triplet-based 5.77 / 2.93 5.59 / 3.26 6.53 / 4.39 6.67 / 4.85 7.52 / 6.31 8.84 / 6.43
Ours 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
w/o shuffle-based 8.35 / 5.34 8.76 / 5.68 8.88 / 5.53 7.00 / 5.53 7.51 / 6.02 7.56 / 7.42
RGB-D videos w/o triplet-based 9.00 / 5.46 9.71 / 5.95 9.63 / 5.82 7.12 / 6.03 8.31 / 6.51 8.24 / 7.95
Ours 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
PSG4D-GTA PSG4D-HOI
Input type Tube relation quantification
R/mR@20 R/mR@50 R/mR@100 R/mR@20 R/mR@50 R/mR@100
Pooling - Cosine similarity 5.76 / 2.87 6.02 / 3.62 6.84 / 4.11 7.24 / 4.45 7.44 / 6.27 8.20 / 6.64
Point cloud videos Pooling - L2 5.46 / 2.78 5.38 / 3.39 6.51 / 3.86 6.72 / 4.11 6.74 / 6.05 7.96 / 6.11
Optimal transport 5.88 / 3.45 6.31 / 3.70 7.31 / 4.70 7.28 / 5.09 7.62 / 6.49 9.18 / 6.85
Pooling - Cosine similarity 9.03 / 5.37 9.47 / 5.86 9.70 / 6.02 7.36 / 5.43 7.93 / 6.70 8.06 / 7.42
RGB-D videos Pooling - L2 8.89 / 4.70 8.90 / 5.41 9.08 / 5.78 6.65 / 5.26 7.74 / 6.29 7.95 / 7.39
Optimal transport 9.07 / 5.52 9.73 / 6.32 9.73 / 6.32 7.63 / 6.09 8.36 / 6.94 8.53 / 8.29
Table 6: Ablation results for mask tube relation quantification method between mask tubes on PSG4D dataset.
tion by 0.8/1.1 points of R/mR@50. In addition, for a less further elevating the threshold results in performance degra-
strict vIoU threshold, we outperform IPS+T - Transformer dation, since there are more mask tubes eliminated, thus lim-
by 1.6/2.9 points of R/mR@20, and IPS+T - Convolution iting the effect of our motion-aware contrastive framework.
by 2.3/0.9 points of R/mR@50. These results demonstrate Effect of optimal transport distance. In this ablation, we
that our method makes a propitious contribution to tempo- compare various strategies to calculate the similarity be-
ral panoptic scene graph generation, not only to popular but tween two mask tubes. Results in Table 4 and 6 show that
also to unpopular relation classes. the proposed optimal transport achieves much higher perfor-
Results on PSG4D. Table 2 shows that our method mance for both natural and 4D video inputs. We conjecture
also achieves significantly higher performance than the that other method such as pooling then cosine similarity or
PSG4DFormer model. Specifically, when working with L2 neglects the temporal or flattens the motion nature of the
point cloud videos, on PSG4D-GTA, we outperform the entity mask tubes, thus reducing the effectiveness.
baseline method by 1.6/1.4 points. Analogously, on PSG4D-
HOI, we outperform PSG4DFormer by 2.0/2.5 points of
R/mR@50. These results indicate that our framework bears
Qualitative Analysis
a valuable impact to both egocentric and third-view videos.
We hypothesize that both video types consist of dynamic ac- We visualize examples processed by the state-of-the-art
tions among objects whose mask tube representations should models and ours in Figure 2. As can be observed, our
be polished. In addition, when working with RGB-D videos, model successfully produces mask tubes overlapping with
on PSG4D-GTA, we enhance the baseline method by 2.4/2.2 the groundtruth, and importantly predicts the correct rela-
points of R/mR@20,. Furthermore, on PSG4D-HOI, our tions of the subject-object pairs. On the other hand, base-
motion-aware contrastive learning also considerably refines line models tend to prefer more static relations, since during
PSG4DFormer by 2.0/2.4 points of R/mR@20. training they do not explicitly focus on motion-sensitive fea-
tures. Statistics in Figure 1 also substantiate our proposition,
Such results have verified the generalizability of our
in which we achieve considerably higher recalls for dynamic
motion-aware contrastive framework over natural, point
relations than baseline approaches.
cloud, and RGB-D videos.
Ablation Study Conclusion
Effect of the contrastive components. We evaluate our
framework without the assistance of either the shuffle-based In this paper, we propose a motion-aware contrastive learn-
or the triplet-based contrastive objective. As shown in Table ing framework for temporal panoptic scene graph genera-
3 and 5, the performance degrades when we both remove tion. In our framework, we learn close representations for
shuffle-based and triplet-based contrastive approaches. In temporal masks of similar entities that exhibit common re-
addition, triplet-based contrastive learning plays a more fun- lations. Moreover, we separate temporal masks from their
damental role than the shuffle-based one. We hypothesize shuffled version, and also separate temporal masks of dif-
that shuffle-based contrastive learning is better at focusing ferent subject-relation-object triplets. To quantify the rela-
on motion semantics than triplet-based one. tionship among temporal masks in the proposed contrastive
Effect of selecting strong-motion tubes. We evaluate the framework, we utilize optimal transport to preserve the tem-
impact of our strategy to filter out weak motion tubes. In Fig- poral nature among temporal entity masks. Extensive exper-
ure 5, we observe a performance boost when we increase the iments substantiate the effectiveness of our framework for
threshold to select mask tubes with strong motion. However, both natural and 4D videos.
Acknowledgements the IEEE/CVF Conference on Computer Vision and Pattern
This research/project is supported by the National Re- Recognition, 18847–18857.
search Foundation, Singapore under its AI Singapore Pro- Li, Y.; Yang, X.; and Xu, C. 2022. Dynamic scene graph
gramme (AISG Award No: AISG3-PhD-2023-08-051T). generation via anticipatory pre-training. In Proceedings of
Thong Nguyen is supported by a Google Ph.D. Fellowship the IEEE/CVF conference on computer vision and pattern
in Natural Language Processing. recognition, 13874–13883.
Liu, H.; Min, K.; Valdez, H. A.; and Tripathi, S. 2024. Con-
References trastive Language Video Time Pre-training. arXiv preprint
Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; and arXiv:2406.02631.
Hu, W. 2020. Graph convolutional network with structure Liu, K.; Li, Y.; Xu, Y.; Liu, S.; and Liu, S. 2022. Spatial
pooling and joint-wise channel attention for action recogni- focus attention for fine-grained skeleton-based action tasks.
tion. Pattern Recognition, 103: 107321. IEEE Signal Processing Letters, 29: 1883–1887.
Chen, Z.; Zheng, T.; and Song, M. 2024. Curriculum Ma, X.; Yong, S.; Zheng, Z.; Li, Q.; Liang, Y.; Zhu, S.-C.;
Negative Mining For Temporal Networks. arXiv preprint and Huang, S. 2022. Sqa3d: Situated question answering in
arXiv:2407.17070. 3d scenes. arXiv preprint arXiv:2210.07474.
Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Gird- Nag, S.; Min, K.; Tripathi, S.; and Roy-Chowdhury, A. K.
har, R. 2022. Masked-attention mask transformer for univer- 2023. Unbiased scene graph generation in videos. In Pro-
sal image segmentation. In Proceedings of the IEEE/CVF ceedings of the IEEE/CVF Conference on Computer Vision
conference on computer vision and pattern recognition, and Pattern Recognition, 22803–22813.
1290–1299. Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan,
Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; L. A. 2023a. Improving multimodal sentiment anal-
Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; ysis: Supervised angular margin-based contrastive learn-
Price, W.; et al. 2022. Epic-kitchens-100. International ing for enhanced fusion representation. arXiv preprint
Journal of Computer Vision, 130: 33–55. arXiv:2312.02227.
Davtyan, A.; Sameni, S.; and Favaro, P. 2023. Efficient Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024a.
video prediction via sparsely conditioned flow matching. In Kdmcse: Knowledge distillation multimodal sentence em-
Proceedings of the IEEE/CVF International Conference on beddings with adaptive angular margin contrastive learning.
Computer Vision, 23263–23274. arXiv preprint arXiv:2403.17486.
Dong, Q.; and Fu, Y. 2024. MemFlow: Optical Flow Es- Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le,
timation and Prediction with Memory. In Proceedings of K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024b.
the IEEE/CVF Conference on Computer Vision and Pattern Meta-optimized Angular Margin Contrastive Framework for
Recognition, 19068–19078. Video-Language Representation Learning. arXiv preprint
Driess, D.; Xia, F.; Sajjadi, M. S.; Lynch, C.; Chowdhery, A.; arXiv:2407.03788.
Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J. Z.;
2023. Palm-e: An embodied multimodal language model. Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024c. Video-
arXiv preprint arXiv:2303.03378. Language Understanding: A Survey from Model Architec-
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, ture, Model Training, and Data Perspectives. arXiv preprint
A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; arXiv:2406.05615.
et al. 2022. Ego4d: Around the world in 3,000 hours of ego- Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for
centric video. In Proceedings of the IEEE/CVF Conference neural topic model. Advances in neural information pro-
on Computer Vision and Pattern Recognition, 18995–19012. cessing systems, 34: 11974–11986.
Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi- Nguyen, T.; Wu, X.; Dong, X.; Le, K. M.; Hu, Z.; Nguyen,
fication with graph convolutional networks. arXiv preprint C.-D.; Ng, S.-K.; and Luu, A. T. 2024d. READ-PVLA:
arXiv:1609.02907. Recurrent Adapter with Partial Video-Language Alignment
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; for Parameter-Efficient Transfer Learning in Low-Resource
Chen, K.; Liu, Z.; and Loy, C. C. 2023a. Transformer- Video-Language Modeling. In Proceedings of the AAAI
based visual segmentation: A survey. arXiv preprint Conference on Artificial Intelligence, volume 38, 18824–
arXiv:2304.09854. 18832.
Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; and Loy, Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-
C. C. 2023b. Tube-Link: A flexible cross tube framework K.; and Tuan, L. A. 2023b. Demaformer: Damped ex-
for universal video segmentation. In Proceedings of the ponential moving average transformer with energy-based
IEEE/CVF International Conference on Computer Vision, modeling for temporal language grounding. arXiv preprint
13923–13933. arXiv:2312.02549.
Li, X.; Zhang, W.; Pang, J.; Chen, K.; Cheng, G.; Tong, Y.; Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.;
and Loy, C. C. 2022. Video k-net: A simple, strong, and and Luu, A. T. 2024e. Topic Modeling as Multi-Objective
unified baseline for video segmentation. In Proceedings of Contrastive Optimization. arXiv preprint arXiv:2402.07577.
Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020.
and Bing, L. 2022. Adaptive contrastive learning on multi- Learning 3d semantic scene graphs from 3d indoor recon-
modal transformer for review helpfulness predictions. arXiv structions. In Proceedings of the IEEE/CVF Conference on
preprint arXiv:2211.03524. Computer Vision and Pattern Recognition, 3961–3970.
Nguyen, T. T.; Hu, Z.; Wu, X.; Nguyen, C.-D. T.; Ng, S.- Wang, G.; Li, Z.; Chen, Q.; and Liu, Y. 2024a. OED: To-
K.; and Luu, A. T. 2024f. Encoding and Controlling Global wards One-stage End-to-End Dynamic Scene Graph Gen-
Semantics for Long-form Video Question Answering. arXiv eration. In Proceedings of the IEEE/CVF Conference on
preprint arXiv:2405.19723. Computer Vision and Pattern Recognition, 27938–27947.
Pu, T.; Chen, T.; Wu, H.; Lu, Y.; and Lin, L. 2023. Spatial- Wang, W.; Luo, Y.; Chen, Z.; Jiang, T.; Yang, Y.; and Xiao,
temporal knowledge-embedded transformer for video scene J. 2023. Taking a closer look at visual relation: Unbiased
graph generation. IEEE Transactions on Image Processing. video scene graph generation with decoupled label learning.
IEEE Transactions on Multimedia.
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet:
Wang, Y.; Yuan, S.; Jian, X.; Pang, W.; Wang, M.; and
Deep learning on point sets for 3d classification and segmen-
Yu, N. 2024b. HaVTR: Improving Video-Text Retrieval
tation. In Proceedings of the IEEE conference on computer
Through Augmentation Using Large Foundation Models.
vision and pattern recognition, 652–660.
arXiv preprint arXiv:2404.05083.
Raychaudhuri, S.; Campari, T.; Jain, U.; Savva, M.; and Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.; and
Chang, A. X. 2023. Reduce, reuse, recycle: Modular multi- Bertinetto, L. 2021. Do different tracking tasks require dif-
object navigation. arXiv preprint arXiv:2304.03696, 2. ferent appearance models? Advances in Neural Information
Ren, S.; Zhu, H.; Wei, C.; Li, Y.; Yuille, A.; and Xie, Processing Systems, 34: 726–738.
C. 2024. ARVideo: Autoregressive Pretraining for Self- Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T.
Supervised Video Representation Learning. arXiv preprint 2024. Modeling Dynamic Topics in Chain-Free Fashion by
arXiv:2405.15160. Evolution-Tracking Contrastive Learning and Unassociated
Rodin, I.; Furnari, A.; Min, K.; Tripathi, S.; and Farinella, Word Exclusion. arXiv preprint arXiv:2405.17957.
G. M. 2024. Action Scene Graphs for Long-Form Un- Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; and Zhong, W.
derstanding of Egocentric Videos. In Proceedings of the 2022. 3d instances as 1d kernels. In European Conference
IEEE/CVF Conference on Computer Vision and Pattern on Computer Vision, 235–252. Springer.
Recognition, 18622–18632. Xiao, F.; Tighe, J.; and Modolo, D. 2021. Modist: Motion
Rosa, K. D. 2024. Video Enriched Retrieval Augmented distillation for self-supervised video representation learning.
Generation Using Aligned Video Captions. arXiv preprint arXiv preprint arXiv:2106.09703, 3.
arXiv:2405.17706. Yang, J.; Ang, Y. Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu,
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Z. 2022. Panoptic scene graph generation. In European
Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Conference on Computer Vision, 178–196. Springer.
et al. 2015. Imagenet large scale visual recognition chal- Yang, J.; Cen, J.; Peng, W.; Liu, S.; Hong, F.; Li, X.; Zhou,
lenge. International journal of computer vision, 115: 211– K.; Chen, Q.; and Liu, Z. 2024. 4d panoptic scene graph
252. generation. Advances in Neural Information Processing Sys-
tems, 36.
Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; and Chua, T.-
S. 2019. Annotating objects and relations in user-generated Yang, J.; Peng, W.; Li, X.; Guo, Z.; Chen, L.; Li, B.; Ma, Z.;
videos. In Proceedings of the 2019 on International Confer- Zhou, K.; Zhang, W.; Loy, C. C.; et al. 2023. Panoptic video
ence on Multimedia Retrieval, 279–287. scene graph generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
Shen, H.; Shi, L.; Xu, W.; Cen, Y.; Zhang, L.; and An, G. 18675–18685.
2024. Patch Spatio-Temporal Relation Prediction for Video
Zhao, C.; Shen, Y.; Chen, Z.; Ding, M.; and Gan, C. 2023.
Anomaly Detection. arXiv preprint arXiv:2403.19111.
Textpsg: Panoptic scene graph generation from textual de-
Sobel, I.; Duda, R.; Hart, P.; and Wiley, J. 2022. scriptions. In Proceedings of the IEEE/CVF International
Sobel-feldman operator. Preprint at https://www. re- Conference on Computer Vision, 2839–2850.
searchgate. net/profile/Irwin-Sobel/publication/285159837. Zhou, H.; Liu, Q.; and Wang, Y. 2023. Learning discrimi-
Accessed, 20. native representations for skeleton based action recognition.
Song, X.; Li, Z.; Chen, S.; Cai, X.-Q.; and Demachi, K. In Proceedings of the IEEE/CVF Conference on Computer
2024. An Animation-based Augmentation Approach for Ac- Vision and Pattern Recognition, 10608–10617.
tion Recognition from Discontinuous Video. arXiv preprint Zhou, L.; Zhou, Y.; Lam, T. L.; and Xu, Y. 2022. Context-
arXiv:2404.06741. aware mixture-of-experts for unbiased scene graph genera-
Sudhakaran, G.; Dhami, D. S.; Kersting, K.; and Roth, S. tion. arXiv preprint arXiv:2208.07109.
2023. Vision relation transformer for unbiased scene graph
generation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, 21882–21893.