Desire-Gs: 4D Street Gaussians For Static-Dynamic Decomposition and Surface Reconstruction For Urban Driving Scenes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition

and Surface Reconstruction for Urban Driving Scenes

Chensheng Peng∗ Chengwei Zhang∗ Yixiao Wang Chenfeng Xu Yichen Xie


Wenzhao Zheng Kurt Keutzer Masayoshi Tomizuka Wei Zhan
UC Berkeley
arXiv:2411.11921v1 [cs.CV] 18 Nov 2024

Figure 1. DeSiRe-GS. We present a 4D street gaussian splatting representation for self-supervised static-dynamic decomposition and
high-fidelity surface reconstruction without the requirement for extra 3D annotations such as bounding boxes.

Abstract the introduced geometric regularizations, our method are


able to address the over-fitting issues caused by data sparsity
We present DeSiRe-GS, a self-supervised gaussian splat- in autonomous driving, reconstructing physically plausible
ting representation, enabling effective static-dynamic decom- Gaussians that align with object surfaces rather than float-
position and high-fidelity surface reconstruction in complex ing in air. Furthermore, we introduce temporal cross-view
driving scenarios. Our approach employs a two-stage opti- consistency to ensure coherence across time and viewpoints,
mization pipeline of dynamic street Gaussians. In the first resulting in high-quality surface reconstruction. Comprehen-
stage, we extract 2D motion masks based on the observation sive experiments demonstrate the efficiency and effectiveness
that 3D Gaussian Splatting inherently can reconstruct only of DeSiRe-GS, surpassing prior self-supervised arts and
the static regions in dynamic environments. These extracted achieving accuracy comparable to methods relying on ex-
2D motion priors are then mapped into the Gaussian space in ternal 3D bounding box annotations. Code is available at
a differentiable manner, leveraging an efficient formulation https://github.com/chengweialan/DeSiRe-
of dynamic Gaussians in the second stage. Combined with GS
∗ Equal Contribution.

1
1. Introduction and walls. We also couple the normal and scale of each
Gaussian, which can be optimized jointly to improve surface
Modeling driving scenes [11, 28] is essential for au- reconstruction quality.
tonomous driving applications, as it facilitates real-world
To further address the overfitting issue, we propose tempo-
simulation and supports scene understanding [46]. An ef-
ral geometrical cross-view consistency, which significantly
fective scene representation enables a system to efficiently
enhances the model’s geometric awareness and accuracy by
perceive and reconstruct dynamic driving environments. Re-
aggregating information from different views across time.
cent 3D Gaussian Splatting (3DGS) [16] has emerged as a
These strategies allow us to achieve state-of-the-art recon-
prominent 3D representation that can be optimized through
struction quality, surpassing other Gaussian splatting ap-
2D supervision. It has gained popularity due to its explicit
proaches in the field of autonomous driving.
nature, high efficiency, and rendering speed.
Overall, DeSiRe-GS makes the following contributions:
While 3D Gaussian Splatting (3DGS) has demonstrated
strong performance in static object-centric reconstructions, • We propose to extract motion information easily from
the original 3DGS struggles to handle dynamic objects in un- appearance differences based on a simple observation that
bounded street views, which are common in real-world sce- 3DGS cannot successfully model the dynamic regions.
narios, particularly for autonomous driving applications. It is • We then distill the extracted 2D motion priors in local
unable to effectively model dynamic regions, leading to blur- frames into global gaussian space, using time-varying
ring artifacts due to the Gaussian model’s time-independent Gaussians in a differentiable manner.
parameterization. As a result, 4D-GS [35] is proposed, mod- • We introduce effective 3D regularizations and temporal
eling the dynamics with a Hexplane encoder. The Hexplane cross-view consistency to generate physically reasonable
[1] works well on object-level datasets, but struggles with Gaussian ellipsoids, further enhancing high-quality de-
driving scenes because of the unbounded areas in outdoor composition and reconstruction.
environments. Instead, we choose to reformulate the original We demonstrate DeSiRe-GS’s capability of effective
static Gaussian model as time-dependent variables with mi- static-dynamic decomposition and high-fidelity surface re-
nor changes, ensuring the efficiency of handling large-scale construction across various challenging datasets [11, 28].
driving scenes.
In this paper, we present DeSiRe-GS, a purely Gaus- 2. Related Work
sian Splatting-based representation, which facilitates self-
supervised static-dynamic decomposition and high-quality Urban Scene Reconstruction. Recent advancements in
surface reconstruction in driving scenarios. For static- novel view synthesis, such as Neural Radiance Field (NeRF)
dynamic decomposition, existing methods such as Driving- [21] and 3D Gaussian Splatting (3DGS) [16], have signifi-
Gaussian [47] and Street Gaussians [38], rely on explicit cantly advanced urban scene reconstruction. Many studies
3D bounding boxes, which significantly simplifies the de- [22, 24, 27, 29, 30, 36, 39] have integrated NeRF into work-
composition problem, since dynamic Gaussians in a moving flows for autonomous driving. Urban Radiance Fields [27]
bounding box can be simply removed. Without the 3D anno- combine lidar and RGB data, while Block-NeRF [29] and
tations, some recent self-supervised methods like PVG [5] Mega-NeRF [30] partition large scenes for parallel train-
and S3Gaussian [15] have attempted to achieve decomposi- ing. However, dynamic environments pose challenges. NSG
tion but fall short in performance, as they treat all Gaussians [24], use neural scene graphs to decompose dynamic scenes,
as dynamic, relying on indirect supervision to learn motion and SUDS [31] introduces a multi-branch hash table for
patterns. However, our proposed method can achieve ef- 4D scene representation. Self-supervised approaches like
fective self-supervised decomposition, based on a simple EmerNeRF [39] and RoDUS [22] can effectively address
observation that dynamic regions reconstructed from 3DGS dynamic scene challenges. EmerNeRF capturing object cor-
are blurry—quite different from the ground truth images. De- respondences via scene flow estimation, and RoDUS utilizes
spite the absence of 3D annotations, DeSiRe-GS produces a robust kernel-based training strategy combined with se-
results comparable to, or better than, approaches that use mantic supervision.
explicit bounding boxes for decomposition. In 3DGS-based urban reconstruction, recent works [5, 6,
Another challenge in applying 3D Gaussian Splatting 15, 38, 46, 47] have gained attention. StreetGaussians [38]
(3DGS) to autonomous driving is the sparse nature of im- models static and dynamic scenes separately using spherical
ages, which is more pronounced compared to object-centric harmonics, while DrivingGaussian [47] introduces specific
reconstruction tasks. This sparsity often leads 3DGS to over- modules for static background and dynamic object recon-
fit on the limited number of observations, resulting in inaccu- struction. OmniRe [6] unifies static and dynamic object
rate geometry learning. Inspired by 2D Gaussian Splatting reconstruction via dynamic Gaussian scene graphs. How-
(2DGS) [14], we aim to generate flatter, disk-shaped Gaus- ever, [6, 38, 46] all require additional 3D bounding boxes
sians to better align with the surfaces of objects like roads which are sometimes difficult to obtain.

2
Figure 2. Pipeline of DeSiRe-GS. To tackle the challenges in self-supervised street scene decomposition. The entire pipeline is optimized
without extra annotations in a self-supervised manner, leading to superior scene decomposition ability and rendering quality.

Static Dynamic Decomposition. Several approaches technologies, neural implicit representations have shown
seek to model the deformation of dynamic and static compo- promise for high-fidelity surface reconstruction. Approaches
nents. D-NeRF [26], Nerfiles [25], Deformable GS [40] and like [19, 33, 41, 43] train neural signed distance functions
4D-GS [35] extend the vanilla NeRF or 3DGS by incorpo- (SDF) to represent scenes. StreetSurf [13] proposes disen-
rating a deformation field. They compute the canonical-to- tangling close and distant views for better implicit surface
observation transformation and separate static and dynamic reconstruction in urban settings, while [27] steps further
components through the deformation network. However, using sparse lidar to enhance depth details.
applying such methods to large-scale driving scenarios is 3D GS has renewed interest in explicit geometric recon-
challenging due to the substantial computational resources struction, with recent works [2, 3, 9, 12, 14, 32, 42] focusing
needed for learning dense deformation parameters, and the on geometric regularization techniques. SuGaR [12] aligns
inaccurate decomposition leads to suboptimal peformance. Gaussian ellipsoids to object surfaces through introducing
For autonomous driving scenarios, NSG [24] models and additional regularization term, while 2DGS [14] directly
dynamic and static parts as nodes in neural scene graphs replaces 3D ellipsoids with 2D discs and utilizes the trun-
but requires additional 3D annotations. Other NeRF-based cated signed distance function (TSDF) to fuse depth maps,
methods [22, 31, 39] leverage a multi-branch structure to enabling noise-free surface reconstruction. PGSR [2] intro-
train time-dependent and time-invariant features separately. duces single- and multi-view regularization for multi-view
3DGS-based methods, such as [5, 15, 38, 47], also focus consistency. GSDF [42] and NeuSG [3] combine 3D Gaus-
on static-dynamic separation but still face limitations. [15] sians with neural implicit SDFs to enhance surface details.
utilizes a deformation network with a hexplane temporal- TrimGS [9] refines surface structures by trimming inaccu-
spatial encoder, requiring extensive computation. PVG [5] rate geometry, maintaining compatibility with earlier meth-
assigns attributes like velocity and lifespan to each Gaussian, ods like 3DGS and 2DGS. While these approaches excel in
distinguishing static from dynamic ones. Yet, the separation small-scale reconstruction, newer works like [4, 7, 10] aim
remains incomplete and lacks thoroughness. to address large-scale urban scenes. [4] adopts a large-scene
Neural Surface Reconstruction. Traditional methods partitioning strategy for reconstruction, while RoGS [10]
for Neural Surface Reconstruction focus more on real geom- proposes 2D Gaussian surfel representation which aligns
etry structures. With the rise of neural radiance field (NeRF) with physical characteristics of road surfaces.

3
3. Preliminary 4. DeSiRe-GS
3D Gaussian Splatting: 3D Gaussian Splatting (3DGS) As shown in Fig. 2, the training process is divided into
[16] employs a collection of colored ellipsoids, G = {g} two stages. We first extract 2D motion masks by calculat-
to explicitly represent 3D scenes. Each Gaussian g = ing the feature difference between the rendered image and
{µ, s, r, o, c} is defined by the following learnable at- GT image. In the second stage, we distill the 2D motion
tributes: a position center µ ∈ R3 , a covariance matrix information into Gaussian space using PVG [5], enabling
Σ ∈ R3×3 , an opacity scalar o, and a color vector c, which the rectification of inaccurate attributes for each Gaussian in
is modeled using spherical harmonics. The distribution of a differentiable manner.
3D Gaussians is mathematically described as:
  4.1. Dynamic Mask Extraction (stage I)
1 ⊤ −1
G(x) = exp − (x − µ) Σ (x − µ) . (1) During the first stage, we observe that 3D Gaussian Splat-
2
ting (3DGS) performs effectively in reconstructing static
The covariance matrix Σ can be formulated as follows: elements, such as parked cars and buildings in a driving
Σ = RSS⊤ R⊤ , where S denotes a diagonal scaling ma- scene. However, it struggles to accurately reconstruct dy-
trix, while R is a rotation matrix, parameterized as a scaling namic regions, as the original 3DGS does not incorporate
vector s and a quaternion r ∈ R4 , respectively. temporal information. This limitation results in artifacts
To generate images from a specific viewpoint, 3D gaus- such as ghost-like floating points in the rendered images,
sian ellipsoids are projected onto a 2D image plane to form as illustrated in Fig. 2 (stage 1). To address this issue, we
2D ellipses for rendering. For each pixel, a sequence of leverage the significant differences between static and dy-
Gaussians N is sorted in ascending order based on depth. namic regions to develop an efficient method for extracting
The color is then rendered through alpha blending: segmentation masks that encode motion information.
i−1
Initially, a pretrained foundation model is employed to
extract features from both the rendered image and the ground
X Y
C= ci αi (1 − αj ), (2)
i∈N j=1
truth (GT) image used for supervision. Let F̂ denote the
features extracted from the rendered image I, ˆ and F repre-
where αi and ci denote the density and color of the i-th sent the features extracted from the GT image I. To distin-
Gaussian, respectively, derived from the learned opacity and guish dynamic and static regions, we compute the per-pixel
SH coefficients of the corresponding Gaussian. dissimilarity D between the corresponding features. The
Periodic Vibration Gaussian (PVG): PVG [5] reshapes dissimilarity metric D approaches 0 for similar features, in-
the original Gaussian model by introducing time-dependent dicating static regions, and nears 1 for dissimilar features,
adjustments to the position mean µ and opacity o. The corresponding to dynamic regions.
modified model is represented as follows:  
  D = 1 − cos(F̂ , F ) /2. (6)
l 2π(t − τ )
µ̃(t) = µ + · sin · v, (3)
2π l As the pretrained model is frozen, the resulting dissim-
− 12 (t−τ )2 β −2
ilarity score D ∈ RH×W is computed without involving
õ(t) = o · e , (4) any learnable parameters. Rather than applying a simple
where µ̃(t) denotes vibrating position centered at µ oc- threshold to D to generate a motion segmentation mask, we
curring around life peak τ , and õ(t) represents the time- propose a multi-layer perceptron (MLP) decoder to predict
dependent opacity which undergoes exponential decay as the dynamicness δ ∈ RH×W . This decoder leverages the
time deviates from the life peak τ . β and v determine the extracted features, which contain rich semantic information,
decay rate and the instant velocity at the life peak τ , respec- while the dissimilarity score is employed to guide and opti-
tively, and are both learnable parameters. l, as a pre-defined mize the learning process of the decoder.
parameter of the scene, represents the oscillation period.
Thus, the PVG model is expressed as: Ldyn = δ ⊙ D, (7)

G(t) = {µ̃(t), s, r, õ(t), c, τ, β, v}, (5) where ⊙ refers to the element-wise multiplication.
By employing the loss function Ldyn defined in Eq. 7,
We adopt PVG as the dynamic representation for au- the decoder is optimized to predict lower values in regions
tonomous driving scenes, because PVG model preserves the where D is high, corresponding to dynamic regions, thereby
structure of the original 3D GS model at any given time t, minimizing the loss. We can then obtain the binary mask
enabling it to be rendered using the standard 3D GS pipeline encoding motion information (ε is a fixed threshold):
to reconstruct the dynamic scene. For further details about
PVG, we refer the readers to [5]. M = I (δ > ε) . (8)

4
During training, the joint optimization of image rendering
and mask prediction mutually benefits each other. By ex-
cluding dynamic regions during supervision, the differences
between rendered images and GT images become more no-
ticeable, facilitating the extraction of motion masks.

Lmasked−render = M ⊙ ∥Iˆ − I∥. (9)

4.2. Static Dynamic Decomposition (stage II) Figure 3. Gaussian Scale Regularization.
While stage I provides effective dynamic masks, these
masks are confined to the image space rather than the 3D disks. The scaling regularization loss is:
Gaussian space and depend on ground truth images. This
reliance limits their applicability in novel view synthesis, Ls = ∥ min(s1 , s2 , s3 )∥. (11)
where supervised images may not be available.
To bridge the 2D motion information from stage I to the Normal Derivation: Surface normals are critical for surface
3D Gaussian space, we adopt PVG, a unified representation reconstruction. Previous methods incorporate normals by ap-
for dynamic scenes (Section 3). However, PVG’s reliance pending a normal vector ni ∈ R3 to each Gaussian, which is
on image and sparse depth map supervision introduces chal- then used to render a normal map N ∈ RH×W . The ground
lenges, as accurate motion patterns are difficult to learn from truth normal map is employed to supervise the optimization
indirect supervision signals. Consequently, the rendered ve- of the Gaussian normals. However, these approaches of-
locity map V ∈ RH×W , as shown in Fig. 2 (stage 2), often ten fail to achieve accurate surface reconstruction, as they
contains noisy outliers. For example, static regions such as overlook the inherent relationship between the scale and the
roads and buildings, where the velocity should be zero, are normal. Instead of appending a separate normal vector, we
not handled effectively. This results in unsatisfactory scene derive the normal n directly from the scale vector s. The
decomposition, with PVG frequently misclassifying regions normal direction naturally aligns with the axis corresponding
where zero velocity is expected. to the smallest scale component, as the Gaussians are shaped
To mitigate this issue and generate more accurate Gaus- like a disk after flattening regularization.
sian representations, we incorporate the segmentation masks
obtained from stage I to regularize the 2D velocity map V, n = R · arg min(s1 , s2 , s3 ). (12)
which is rendered from Gaussians in the 3D space.
With such formulation of the normal, the gradient can be
Lv = V ⊙ M. (10) back-propagated to the scale vector, rather than the appended
normal vector, thereby facilitating better optimization of the
Minimizing Lv penalizes regions where velocity should Gaussian parameters. The normal loss is:
be zero, effectively eliminating noisy outliers produced by
the original PVG. This process propagates motion informa- Ln = ∥N − N̂ ∥2 . (13)
tion from the 2D local frame to the global Gaussian space.
With the refined velocity v for each Gaussian, dynamic and Giant Gaussian Regularization: We observed that both
static Gaussians can be distinguished by applying a simple 3DGS and PVG could produce oversized Gaussian ellipsoids
threshold. This approach achieves superior self-supervised without additional regularization, particularly in unbounded
decomposition compared to PVG [5] and S3Gaussian [15], driving scenarios, as illustrated in Fig. 3 (a).
without requiring additional 3D annotations such as bound- Our primary objective is to fit appropriately scaled Gaus-
ing boxes used in previous methods [6, 38, 46]. sians that support accurate image rendering and surface re-
construction. While oversized Gaussian ellipsoids with low
4.3. Surface Reconstruction opacity may have minimal impact on the rendered image,
they can significantly impair surface reconstruction. This is
4.3.1. Geometric Regularization
a limitation often overlooked in existing methods focused
Flattening 3D Gaussian: Inspired by 2D Gaussian Splatting solely on 2D image rendering. To address this issue, we
(2DGS) [14], we aim to flatten 3D ellipsoids into 2D disks, introduce a penalty term for each Gaussian:
allowing the optimized Gaussians to better conform to object
surfaces and enabling high-quality surface reconstruction. sg = max(s1 , s2 , s3 ); Lg = sg · I(sg > ϵ), (14)
The scale s = (s1 , s2 , s3 ) of 3DGS defines the ellipsoid’s
size along three orthogonal axes. Minimizing the scale along where sg is the largest scale direction and ϵ is a predefined
the shortest axis effectively transforms 3D ellipsoids into 2D threshold for huge gaussians.

5
Combined with the motion loss from Eq.7, we can get:

Lstage1 = M ⊙ LI + Ldyn . (19)

Stage II: We use the alpha-blending to render the depth map,


normal map and velocity map as follows:

X i−1
Y
{D, N , V} = αi (1 − αj ){di , ni , vi }. (20)
i∈N j=1

For stage II, we use the projected sparse depth map Dgt
Figure 4. Cross-view consistency from LiDAR as the supervision label.

LD = ∥D − Dgt ∥1 . (21)
4.3.2. Temporal Spatial Consistency Together with the static velocity regularization (Eq. 10),
In driving scenarios, the sparse nature of views often leads flattened gaussian (Eq. 11), normal supervision (Eq. 13),
to overfitting to the training views during the optimization of giant gaussian regularization (Eq. 14), geometric consistency
Gaussians. Single-view image loss is particularly susceptible loss (Eq. 17), etc., the loss for stage II is:
to challenges in texture-less regions at far distances. As a
result, relying on photometric supervision from images and Lstage2 = LI + LD + Ln + Lv + Ls + Lg + Luv . (22)
sparse depth map is not reliable. To address this, we propose
enhancing geometric consistency by leveraging temporal 5. Experiments
cross-view information. 5.1. Experimental Setups
Under the assumption that the depth of static regions
remains consistent over time across varying views, we in- Datasets. We conduct our experiments on the Waymo Open
troduce a cross-view temporal-spatial consistency module. Dataset [28] and KITTI Dataset [11], both consisting of
For a static pixel (ur , vr ) in the reference frame with a depth real-world autonomous driving scenarios. For the Waymo
value dr , we project it to the nearest neighboring view—the Open Dataset, we use the subset from PVG [5]. For a more
view with the largest overlap. Using the camera intrinsics complete comparison with non self-supervised methods, we
K and extrinsics Tr , Tn , the corresponding pixel location in also conduct experiments on the subset provided by OmniRe
the neighboring view is calculated as: [6], which contains a large amount of highly dynamic scenes.
We use the frontal three cameras (FRONT LEFT, FRONT,
[un , vn , 1]T = KTn Tr−1 dr · K −1 [ur , vr , 1]T . (15)

FRONT RIGHT) for Waymo Open Dataset, and the left and
right cameras for KITTI dataset.
We then query the depth value dn at (un , vn ) in the neigh-
boring view. Projecting this back into 3D space, the resulting Evaluation Metrics. We adopt PSNR, SSIM [34] and LPIPS
position should align with the position obtained by back- [45] as metrics for the evaluation of image reconstruction
projecting (ur , vr , dr ) to the reference frame: and novel view synthesis. Following [15, 38, 39], we also
include DPSNR and DSSIM to assess the rendering quality
[unr , vnr , 1]T = KTr Tn−1 dn · K −1 [un , vn , 1]T . (16)

at dynamic regions. Additionally, we introduce depth L1,
which measures the L1 error between the rendered depth map
To enforce cross-view depth consistency, we apply a geo-
and the ground truth depth map obtained from LiDAR point
metric loss to optimize the Gaussians, defined as:
clouds, as an evaluation metric for the quality of geometric
Luv = ∥(ur , vr ) − (unr , vnr )∥2 . (17) reconstruction.

This loss encourages the Gaussians to produce geometri- Baselines. We benchmark DeSiRe-GS against the following
cally consistent depth across views over time. approaches: 3DGS [16], StreetSurf [13], Mars [36], SUDS
[31], EmerNeRF [39], S3Gaussian [15], PVG [5], OmniRe
4.4. Optimization [6], StreetGS [38] and HUGS[46]. Among these methods,
Stage I: During Stage I, our objective is to leverage the SUDS and EmerNeRF are NeRF-based self-supervised ap-
joint optimization of motion masks and rendered images to proaches. S3Gaussian and PVG are both 3DGS-based self-
effectively learn the motion masks. Therefore, we only use supervised methods, the closest to our approach. To further
the masked image losses LI , highlight the superiority of DeSiRe-GS, we also compare
it with OmniRe, StreetGS, and HUGS, all of which require
˜ 1 + λssim SSIM(I, I).
LI = (1 − λssim )∥I − I∥ ˜ (18) additional bounding box information.

6
Waymo Open Dataset KITTI
Method Image reconstruction Novel view synthesis Image reconstruction Novel view synthesis
FPS FPS
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
S-NeRF [37] 19.67 0.528 0.387 19.22 0.515 0.400 0.0014 19.23 0.664 0.193 18.71 0.606 0.352 0.0075
StreetSurf [13] 26.70 0.846 0.3717 23.78 0.822 0.401 0.097 24.14 0.819 0.257 22.48 0.763 0.304 0.37
3DGS [16] 27.99 0.866 0.293 25.08 0.822 0.319 63 21.02 0.811 0.202 19.54 0.776 0.224 125
NSG [24] 24.08 0.656 0.441 21.01 0.571 0.487 0.032 19.19 0.683 0.189 17.78 0.645 0.312 0.19
Mars [36] 21.81 0.681 0.430 20.69 0.636 0.453 0.030 27.96 0.900 0.185 24.31 0.845 0.160 0.31
SUDS [31] 28.83 0.805 0.317 25.36 0.783 0.384 0.008 28.83 0.917 0.147 26.07 0.797 0.131 0.29
EmerNeRF [39] 28.11 0.786 0.373 25.92 0.763 0.384 0.053 26.95 0.828 0.218 25.24 0.801 0.237 0.28
PVG [5] 32.46 0.910 0.229 28.11 0.849 0.279 50 32.83 0.937 0.070 27.43 0.896 0.114 59
Ours 33.61 0.919 0.204 29.75 0.878 0.213 36 33.94 0.949 0.04 28.87 0.901 0.106 41

Table 1. Comparison of methods on the Waymo Open Dataset and KITTI dataset. FPS refers to frames per second.

Figure 5. Qualitative comparison with self-supervised S3Gaussian [15] and PVG [5]

Methods Box PSNR (reconst) ↑ PSNR (nvs) ↑ stage, we train for a total of 30,000 iterations. We start
EmerNeRF [39] 31.93 29.67 to train the motion decoder after 6,000 iterations. For the
3DGS [39] 26.00 25.57 second stage, we train the model for 50,000 iterations. Multi-
DeformGS [40] 28.40 27.72 view temporal consistency regularization begins after 20,000
PVG [5] 32.37 30.19 iterations. The motion masks, obtained from stage I, are
HUGS [46] ✓ 28.26 27.65
employed after 30000 iterations to supervise the optimization
StreetGS [38] ✓ 29.08 28.54 of velocity v. We use Adam [17] as our optimizer with
OmniRe [6] ✓ 34.25 32.57 β1 = 0.9, β2 = 0.999.
Ours 33.82 31.49 5.2. Quantitative Results
Table 2. Comparison of rendering quality against recent SOTA Following PVG [5], we evaluate our method on two tasks:
methods with or without 3D bbox annotations. ‘reconst’ refers to image reconstruction and novel view synthesis, using the
reconstruction and ‘nvs’ refers to novel view synthesis. Waymo Open Dataset [28] and the KITTI dataset [11]. As
shown in Tab. 1, our approach achieves state-of-the-art per-
Implementation Details. All experiments are conducted formance across all rendering metrics for both reconstruction
on NVIDIA RTX A6000. We sample a total of 1 × 106 and synthesis tasks. In terms of rendering speed, our method
points for initialization, with 6 × 105 from LiDAR point reaches approximately 40 FPS. While slightly slower than
cloud, and 4 × 105 randomly sampled points. In the first the 3DGS [16] and PVG [5] baselines due to rendering addi-

7
Setting PSNR ↑ SSIM ↑ LPIPS ↓ DPSNR ↑ DSSIM ↑ Depth L1↓
(a) w/o stage I motion mask 34.7063 0.9570 0.1098 34.7183 0.9570 0.1017
(b) w/o FiT3D model (w DINOv2) 34.9551 0.9559 0.1027 34.9734 0.9602 0.0977
(c) w/o gt normal supervision 35.4469 0.9625 0.0967 35.4876 0.9626 0.0913
(d) w/o gt normal (w normal from depth) 35.2357 0.9509 0.1436 35.5312 0.9512 0.0847
(e) w/o min scale regularization 35.2863 0.9616 0.0989 35.3275 0.9617 0.0935
(f) w/o max scale regularization 35.6911 0.9622 0.0970 35.7306 0.9623 0.0802
(g) w/o multi-view consistency 35.3325 0.9618 0.0983 35.3731 0.9619 0.1154
Full model 35.7598 0.9631 0.0956 35.7820 0.9632 0.0713

Table 3. Ablations Studies.

two approaches. The first approach utilizes powerful pre-


trained models, such as OmniData [8], to predict pseudo-
normal maps N̂ from the input monocular images. The
second approach employs the depth gradient as the pseudo-
normal map N̂D for supervision [20].
As shown in Tab. 3 (c)(d), we found that pseudo-normals
predicted from Omnidata produce the best overall results.
While using N̂D slightly improves depth accuracy, it leads to
Figure 6. Multi-view consistency depth (Better viewed zoom-in) suboptimal rendering quality. We attribute this to the reliance
on sparse depth maps projected from LiDAR point clouds
tional attributes such as normals, our approach delivers over for supervision. Although our rendered depth maps and the
a 1.1 PSNR improvement. corresponding normal maps (derived from depth gradients)
In addition to comparisons with self-supervised meth- are dense, the supervision remains incomplete due to the
ods, we evaluate against approaches that rely on 3D an- inherent sparsity of the LiDAR points.
notations. The results, detailed in Table 2, show that our Scale Regularization. We also impose constraints on the
method achieves comparable, if not superior, performance size of the Gaussians, ensuring their scale remains within a
to baselines like HUGS [46] and StreetGS [38], while out- reasonable range. As shown in Tab. 3 (e)(f), the improve-
performing self-supervised baselines [5, 39]. ments in rendering metrics are not particularly significant.
We attribute this to the strong overfitting capability of Gaus-
5.3. Qualitative Analysis sians. Despite the presence of some oversized Gaussian
We provide visualizations of static-dynamic decomposi- outliers, the final rendering results remain visually satis-
tion and depth prediction in Fig. 5. S3Gaussian [15] fails to factory. However, as illustrated in Figure 3, the Gaussians
generate satisfactory depth maps or achieve effective static- produced with regularization exhibit improved 3D structure,
dynamic decomposition. Similarly, PVG [5] produces only enabling more effective decomposition.
blurry and suboptimal decomposition results. Cross-view Consistency. As shown in Tab. 3 (g), the pro-
posed cross-view consistency significantly enhances the geo-
5.4. Ablation Studies
metric metrics. Fig. 6 demonstrates that, without our method,
To verify the effectiveness of our proposed components, Gaussians tend to overfit to textured areas in the image, such
we conduct ablation studies on Waymo Open Dataset. The as the slogan and white line, resulting in unexpected large
results are listed in Tab. 3. depth variations. With our multi-view consistency module,
Motion mask. We train our model from scratch at stage II this overfitting issue is effectively mitigated.
without using the motion masks obtained from stage I for
regularization. As shown in Tab.3 (a), the motion masks 6. Conclusion
from Stage I improve both the rendering and reconstruction In this paper, we propose DeSiRe-GS, a self-supervised
quality by a large margin. approach for static-dynamic decomposition and high-quality
We also ablate on different foundation models as the surface reconstruction in driving scenes. By introducing a
feature extractor. As shown in Tab.3 (b), the FiT3D model motion mask module and leveraging temporal geometrical
[44] outperforms DINOv2, producing much better results. consistency, DeSiRe-GS addresses key challenges such as
Normal Supervision. For normal supervision, we explored dynamic object modeling and data sparsity.

8
References [14] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and
Shenghua Gao. 2d gaussian splatting for geometrically accu-
[1] Ang Cao and Justin Johnson. Hexplane: A fast representation rate radiance fields. In ACM SIGGRAPH 2024 Conference
for dynamic scenes. In Proceedings of the IEEE/CVF Con- Papers, pages 1–11, 2024. 2, 3, 5
ference on Computer Vision and Pattern Recognition, pages
[15] Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An,
130–141, 2023. 2
Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer,
[2] Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian and Shanghang Zhang. S3gaussian: Self-supervised street
Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, gaussians for autonomous driving. CoRR, 2024. 2, 3, 5, 6, 7,
and Guofeng Zhang. Pgsr: Planar-based gaussian splatting 8, 1, 4
for efficient and high-fidelity surface reconstruction. arXiv
[16] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and
preprint arXiv:2406.06521, 2024. 3
George Drettakis. 3d gaussian splatting for real-time radiance
[3] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural
field rendering. ACM Trans. Graph., 42(4):139–1, 2023. 2, 4,
implicit surface reconstruction with 3d gaussian splatting
6, 7, 5
guidance, 2023. 3
[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[4] Junyi Chen, Weicai Ye, Yifan Wang, Danpeng Chen, Di
stochastic optimization, 2017. 7, 1
Huang, Wanli Ouyang, Guofeng Zhang, Yu Qiao, and
[18] Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc
Tong He. Gigags: Scaling up planar-based 3d gaussians
Pollefeys, and Torsten Sattler. WildGaussians: 3D gaussian
for large scene surface reconstruction. arXiv preprint
splatting in the wild. arXiv, 2024. 1
arXiv:2409.06685, 2024. 3
[5] Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and [19] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor,
Li Zhang. Periodic vibration gaussian: Dynamic urban Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neu-
scene reconstruction and real-time rendering. arXiv preprint ralangelo: High-fidelity neural surface reconstruction, 2023.
arXiv:2311.18561, 2023. 2, 3, 4, 5, 6, 7, 8, 1 3
[6] Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, [20] Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, and Kui Jia.
Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Goj- Gs-ir: 3d gaussian splatting for inverse rendering. In Proceed-
cic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban ings of the IEEE/CVF Conference on Computer Vision and
scene reconstruction. arXiv preprint arXiv:2408.16760, 2024. Pattern Recognition, pages 21644–21653, 2024. 8
2, 5, 6, 7, 4 [21] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
[7] Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: Representing scenes as neural radiance fields for view synthe-
3d gaussian splatting with progressive propagation. In Forty- sis, 2020. 2
first International Conference on Machine Learning, 2024. [22] Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco,
3 Moussab Bennehar, and Dzmitry Tsishkou. Rodus: Robust
[8] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir decomposition of static and dynamic elements in urban scenes.
Zamir. Omnidata: A scalable pipeline for making multi-task arXiv preprint arXiv:2403.09419, 2024. 2, 3
mid-level vision datasets from 3d scans. In Proceedings of [23] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo,
the IEEE/CVF International Conference on Computer Vision, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel
pages 10786–10796, 2021. 8 Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2:
[9] Lue Fan, Yuxue Yang, Minxing Li, Hongsheng Li, and Zhaox- Learning robust visual features without supervision. arXiv
iang Zhang. Trim 3d gaussian splatting for accurate geometry preprint arXiv:2304.07193, 2023. 1, 2
representation, 2024. 3 [24] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
[10] Zhiheng Feng, Wenhua Wu, and Hesheng Wang. Rogs: Large Felix Heide. Neural scene graphs for dynamic scenes. In Pro-
scale road surface reconstruction based on 2d gaussian splat- ceedings of the IEEE/CVF Conference on Computer Vision
ting, 2024. 3 and Pattern Recognition, pages 2856–2865, 2021. 2, 3, 7
[11] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we [25] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
ready for autonomous driving? the kitti vision benchmark Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
suite. In Conference on Computer Vision and Pattern Recog- Martin-Brualla. Nerfies: Deformable neural radiance fields,
nition (CVPR), 2012. 2, 6, 7, 4 2021. 3
[12] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned [26] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
gaussian splatting for efficient 3d mesh reconstruction and Francesc Moreno-Noguer. D-nerf: Neural radiance fields
high-quality mesh rendering. In Proceedings of the IEEE/CVF for dynamic scenes, 2020. 3
Conference on Computer Vision and Pattern Recognition, [27] Konstantinos Rematas, Andrew Liu, Pratul P Srinivasan,
pages 5354–5363, 2024. 3 Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser,
[13] Jianfei Guo, Nianchen Deng, Xinyang Li, Yeqi Bai, Bo- and Vittorio Ferrari. Urban radiance fields. In Proceedings of
tian Shi, Chiyu Wang, Chenjing Ding, Dongliang Wang, the IEEE/CVF Conference on Computer Vision and Pattern
and Yikang Li. Streetsurf: Extending multi-view im- Recognition, pages 12932–12942, 2022. 2, 3
plicit surface reconstruction to street views. arXiv preprint [28] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
arXiv:2306.04988, 2023. 3, 6, 7, 4 Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,

9
Yuning Chai, Benjamin Caine, et al. Scalability in perception [42] Mulin Yu, Tao Lu, Linning Xu, Lihan Jiang, Yuanbo Xiangli,
for autonomous driving: Waymo open dataset. In Proceedings and Bo Dai. Gsdf: 3dgs meets sdf for improved rendering
of the IEEE/CVF conference on computer vision and pattern and reconstruction, 2024. 3
recognition, pages 2446–2454, 2020. 2, 6, 7, 4 [43] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler,
[29] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- and Andreas Geiger. Monosdf: Exploring monocular geo-
han, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, metric cues for neural implicit surface reconstruction, 2022.
and Henrik Kretzschmar. Block-nerf: Scalable large scene 3
neural view synthesis, 2022. 2 [44] Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang,
[30] Haithem Turki, Deva Ramanan, and Mahadev Satya- and Jan Eric Lenssen. Improving 2d feature representations
narayanan. Mega-nerf: Scalable construction of large-scale by 3d-aware fine-tuning. In ECCV, 2024. 8, 1, 2
nerfs for virtual fly-throughs, 2022. 2 [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
[31] Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva and Oliver Wang. The unreasonable effectiveness of deep
Ramanan. Suds: Scalable urban dynamic scenes. In Proceed- features as a perceptual metric, 2018. 6, 1
ings of the IEEE/CVF Conference on Computer Vision and [46] Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao
Pattern Recognition, pages 12375–12385, 2023. 2, 3, 6, 7, 4 Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi
[32] Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Liao. Hugs: Holistic urban 3d scene understanding via gaus-
Seiskari, Esa Rahtu, and Juho Kannala. Dn-splatter: Depth sian splatting. In Proceedings of the IEEE/CVF Conference
and normal priors for gaussian splatting and meshing, 2024. on Computer Vision and Pattern Recognition, pages 21336–
3 21345, 2024. 2, 5, 6, 7, 8, 3, 4
[33] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku [47] Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, De-
Komura, and Wenping Wang. Neus: Learning neural implicit qing Sun, and Ming-Hsuan Yang. Drivinggaussian: Compos-
surfaces by volume rendering for multi-view reconstruction. ite gaussian splatting for surrounding dynamic autonomous
arXiv preprint arXiv:2106.10689, 2021. 3 driving scenes. In Proceedings of the IEEE/CVF Conference
[34] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. on Computer Vision and Pattern Recognition, pages 21634–
Image quality assessment: from error visibility to structural 21643, 2024. 2, 3
similarity. IEEE Transactions on Image Processing, 13(4):
600–612, 2004. 6, 1
[35] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng
Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang.
4d gaussian splatting for real-time dynamic scene rendering,
2024. 2, 3
[36] Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng
Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen,
Runyi Yang, et al. Mars: An instance-aware, modular and
realistic simulator for autonomous driving. In CAAI Inter-
national Conference on Artificial Intelligence, pages 3–15.
Springer, 2023. 2, 6, 7, 4, 5
[37] Ziyang Xie, Junge Zhang, Wenye Li, Feihu Zhang, and Li
Zhang. S-nerf: Neural radiance fields for street views. arXiv
preprint arXiv:2303.00749, 2023. 7
[38] Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang,
Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou,
and Sida Peng. Street gaussians for modeling dynamic urban
scenes. arXiv preprint arXiv:2401.01339, 2024. 2, 3, 5, 6, 7,
8, 1, 4
[39] Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Se-
ung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler,
Marco Pavone, et al. Emernerf: Emergent spatial-temporal
scene decomposition via self-supervision. arXiv preprint
arXiv:2311.02077, 2023. 2, 3, 6, 7, 8, 1, 4, 5
[40] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing
Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-
fidelity monocular dynamic scene reconstruction. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 20331–20341, 2024. 3, 7
[41] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces, 2021. 3

10
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition
and Surface Reconstruction for Urban Driving Scenes
Supplementary Material

Figure 7. DeSiRe-GS. We present a 4D street gaussian splatting representation for self-supervised static-dynamic decomposition and
high-fidelity surface reconstruction without the requirement for extra 3D annotations such as bounding boxes.

7. Implementation Details (PSNR), Structural Similarity Index Measure (SSIM) [34]


and Learned Perceptual Image Patch Similarity (LPIPS) [45]
All experiments are conducted on NVIDIA RTX A6000. as metrics for the assessment of both image reconstruction
We sample a total of 1 × 106 points for initialization, with and novel view synthesis. Following [15, 38, 39], we also
6 × 105 from LiDAR point cloud, 2 × 105 near points and include DPSNR and DSSIM to assess the rendering quality
2 × 105 far points depending on their distance to LiDAR of dynamic objects. Specifically, these values are calculated
origin. In the first stage, we train for a total of 30,000 it- by projecting the 3D bounding box of moving objects onto
erations. We do not train the uncertainty model during the the camera plane and computing the PSNR and SSIM within
initial 6,000 iterations. After that, the uncertainty model the projected box. Additionally, we introduce depth L1,
gradually increases its weight over a 1,800-iteration warm- which measures the L1 error between the rendered depth
up process. In the second stage, we train a total of 50000 map and the ground truth depth map, as an evaluation metric
iterations. Cross-view consistency regularization begins af- which is related to surface reconstruction.
ter 20,000 iterations, with 102400 pixels sampled each time.
Motion masks, obtained from the trained dynamic model, Feature Extractor. DINOv2 [23] is widely used as a foun-
are employed after 30000 iterations to supervise the training dation model for feature extraction and have demonstrated
of velocity v and time scale β. We use Adam [17] as our potential in previous novel view synthesis works, such as
optimizer with β1 = 0.9, β2 = 0.999 and maintain simi- Wild-Gaussians [18]. However, we found through experi-
lar optimization settings of [5]. For the dynamic model we ments that the features extracted from DINOv2 are usually
employ a learning rate of 0.001 and a dropout rate of 0.1. noisy, especially on the road and in the sky, as shown in Fig.
8. The DINOv2 features cannot produce accurate motion
Evaluation Metrics. We adopt Peak Signal-to-Noise Ratio masks sometimes. On the other hand, FiT3D [44] fine-tunes

1
Figure 8. Comparison between features extracted from DINOv2 [23] and FiT3D [44].

Figure 9. Segmentation Mask Extraction of DeSiRe-GS. We utilize the differences between the rendered image and ground truth to train a
dynamic mask decoder.

DINOv2 with gaussian splatting to improve the 3D aware- at the regions will not be supervised. As a result, the dif-
ness of the extracted features, which is a perfect match for ference between the rendered images and the ground truth
our setting in the driving world. Therefore, we turn to FiT3D images will be become more significant, which benefits the
as the feature extractor, producing more clean and robust extraction of the desired motion masks.
features, to measure the similarity of two images. We provide a few samples in Fig. 10. It can be observed
Motion Mask Extractor. that our model can handle the dynamic objects well, even
By introducing the learnable decoder, we are not limited for far-away pedestrians.
to the viewpoints where GT images are available. Instead, Temporal Geometric Constraints. Due to the sparsity
given a rendered image, we can first extract the FiT3D fea- nature of views in driving scenarios, it tends to overfit to the
tures, and then use our decoder to extract the motion mask, training views when optimizing gaussian splatting. Single-
without the requirement for ground truth images. view image loss often suffers from texture-less area in far
During training, the joint optimization of image rendering distance. As a result, relying on photometric consistency is
and mask prediction will benefit from each other by using not reliable. Instead, we propose to enhance the geometric
the obtained mask M to mask out the dynamic regions. The consistency by aggregating temporal information.
rendering loss is as follows: Based on the assumption that depth of static regions re-
Lmasked−render = M ⊙ ∥Iˆ − I∥ (23) mains consistent across time from varying views, we des-
ignate a cross-view temporal spatial consistency module.
As we mask out the dynamic regions, the reconstruction For a static pixel (ur , vr ) in the reference frame, with depth

2
Figure 10. Extracted motion masks using FiT3D features

value of dr , we can project it to the nearest neighboring contains a background node approximating all static parts,
view, which has the largest amount of overlap. Given the several dynamic nodes representing rigidly moving indi-
camera intrinsics K and extrinsics Tr , Tn , we can obtain the viduals, and edges representing transformations. NSG[24]
corresponding pixel location in the neighboring view: also combines implicitly encoded scenes with a jointly
learned latent representation to describe objects in a single
[un , vn , 1]T = KTn Tr−1 dr · K −1 [ur , vr , 1]T

(24) implicit function.
• SUDS [31] is a NeRF-based method for dynamic large
Again, we can query the depth value dn at the position
urban scene reconstruction. It proposed using 2D optical
(un , vn ). When we project it back to the 3D space, the
flow to model scene dynamics, avoiding additional bound-
position should be consistent with the one obtained from
ing box annotations. SUDS develops a three-branch hash
back-projecting (ur , vr , dr ) to the reference frame.
table representation for 4D scene representation, enabling
[unr , vnr , 1]T = KTr Tn−1 dn · K −1 [un , vn , 1]T a variety of downstream tasks.

(25)
• StreetGS [38] models dynamic driving scenes using 3D
We apply geometric loss to optimize the Gaussians to Gaussian splatting. It represents the components in the
produce cross-view consistent depth as follows: scene separately, with a background model for static part
and an object model for foreground moving objects. To
Luv = ∥(ur , vr ) − (unr , vnr )∥2 (26) capture dynamic features, the position and rotation of
gaussians are defined in an object local coordinate system,
which relies on bounding boxes predicted by an off-the-
8. Baselines
shelf model.
• StreetSurf [13] is an implicit neural rendering method for • HUGS [46] is a 3DGS-based method addressing the prob-
both geometry and appearance reconstruction in street lem of urban scene reconstruction and understanding. It as-
views. The whole scene is divided in to close-range, sumes that the scene is composed of static rigions and mov-
distant-view and sky parts according to the distance of ob- ing vehicles with rigid motions, using a unicycle model to
jects. A cuboid close-range hash grid and a hyper-cuboid model vehicles’ states. HUGS also extends original 3DGS
distant-view model are employed to tackle long and nar- to model additional modalities, including optical flow and
row observation space in most street scenes, showcasing semantic information, achieving good performance in both
good performance in unbounded scenes captured by long scene reconstruction and semantic reconstruction. Bound-
camera trajectories. ing boxes are also required in this process.
• NSG [24] enables efficient rendering of novel arrange- • EmerNeRF [39] is a NeRF-based method for constructing
ments and views by encoding object transformations and 4D neural scene representations in urban driving scenes.
radiance within a learnable scene graph representation. It It decomposes dynamic scenes into a static field and a

3
Figure 11. Qualitative Comparison.

dynamic field, both parameterized by hash grids. Then an • PVG [5] is a self-supervised gaussian splatting approach
emergent scene flow field is introduced to represent ex- that reconstructs dynamic urban scenes and isolates dy-
plicit correspondences between moving objects and aggre- namic parts from static background. Refer to Sec. 2 for
gate temporally-displaced features. Remarkably, EmerN- more details about PVG.
eRF finishes these tasks all through self-supervision. In the approaches mentioned above, StreetSurf[13],
• S3Gaussian [15] is a self-supervised approach that de- Mars[36],SUDS[31] and EmerNeRF[39] are based upon
composes static and dynamic 3D gaussians in driving NeRF, while others are based upon 3DGS. Notably, among
scenes. It aggregates 4D gaussian representations in a the 3DGS-based approaches, HUGS[46], StreetGS[38] and
spatial-temporal field network with a multi-resolution hex- OmniRe[6] all rely on instance-level bounding boxes for
plane encoder, where the dynamic objects are visible only moving objects, which are sometimes difficult to obtain.
within spatial-temporal plane while static objects within PVG[5] and S3Gaussian[15] are most closely related to our
spatial-only plane. Then S3Gaussian utilizes a multi-head work, both of which are self-supervised Gaussian Splatting
decoder to capture the deformation of 3D Gaussians in a method without reliance on extra annotations.
canonical space for decomposition.
• Omnire [6] successfully models urban dynamic scenes us- 9. Data
ing Gaussian Scene Graphs, with different types of nodes
tackling sky, background, rigidly moving objects and non- We conduct our experiments on the Waymo Open Dataset
rigidly moving objects. It introduces rigid nodes for ve- [28] and the KITTI Dataset [11], both consisting of real-
hicles, where the Gaussians will not change over time, world autonomous driving scenarios.
and non-rigid nodes for human-ralated dynamics, where 9.1. Waymo Open Dataset
local deformations will be taken into consideration. Om-
niRe additionally employs a Skinned Multi-Person Linear NOTR from EmerNeRF. NOTR is a subset consisting of
(SMPL) model to parameterize human body model, show- diverse and balance sequences derived from Waymo Open
casing good results in reconstructing in-the-wild humans. Dataset introduced by [39]. It includes 120 distinct driving
Notably, Omnire also requires accurate instance bounding sequences, categorized into 32 static, 32 dynamic, and 56
boxes for dynamic modeling. diverse scenes covering various challenging driving condi-

4
Dynamic-32 Split Static-32 Split
Method Image reconstruction Novel view synthesis Image reconstruction Novel view synthesis
PSNR ↑ DPSNR ↑ L1↓ PSNR ↑ DPSNR ↑ L1↓ PSNR ↑ L1↓ PSNR ↑ L1↓
3DGS [16] 28.47 23.26 - 25.14 20.48 - 29.42 - 26.82 -
Mars [36] 28.24 23.37 - 26.61 22.21 - 28.31 - 27.63 -
EmerNeRF [39] 28.16 24.32 3.12 25.14 23.49 4.33 30.00 2.84 28.89 3.89
S3Gaussian [15] 31.35 26.02 5.31 27.44 22.92 6.18 30.73 5.84 27.05 6.53
PVG [5] 33.14 31.79 3.33 29.77 27.19 4.84 32.84 3.75 29.12 5.07
Ours 34.56 32.63 2.96 30.45 28.66 4.17 34.57 2.89 31.78 3.93

Table 4. Comparison of methods on the Waymo NOTR Dataset from EmerNeRF.

Segment Name seg104554 seg125050 seg169514 seg584622 seg776165 seg138251 seg448767 seg965324
Scene Index 23 114 327 621 703 172 552 788

Table 5. Segment Names and Scene IDs of 8 scenes used in OmniRe[6].

Segment Name seg102319 seg103913 seg109636 seg117188 9.3. Data Source


Scene Index 17 22 50 81
For Tab. 1, the results of the baselines are taken from
PVG [5], since we are using the same dataset and evaluated
Table 6. Segment Names and Scene IDs of 4 scenes used in PVG[5].
on the devices. The results of Tab. 2 are sourced from
OmniRe [6].
tions. Following [15], we incorporate the 32 dynamic scenes
10. Additional Results
and 32 static scenes from NOTR into our testing set. Refer
to EmerNeRF[39] for NOTR dataset details. 10.1. Additional quantitative results
OmniRe subset. OmniRe[6] selects eight highly complex We conducted experiments on the NOTR dataset, and the
dynamic driving sequences from Waymo Open Dataset, each results are listed in Tab. 4. Following PVG [5], we scaled
including dynamic classes such as vehicles and pedestrains. the camera pose and point clouds during pre-processing. For
The Segment IDs of selected scenes are shown in 5. a fair comparison of depth errors with other methods, we
re-map the depth back to the original scale and calculate the
PVG subset. PVG [5] provides four sequences randomly
depth L1 error. The results in Tab. 3 for ablation studies are
selected from Waymo Open Dataset, which are also included
without the rescaling.
in our experiments. The Segment IDs are shown in 6
For the sequences in the Waymo dataset, we follow the 10.2. Analysis
same setup as [39]. Camera images are captured from
We compare the rendered depth of various methods in
three frontal cameras—FRONT LEFT, FRONT, and FRONT
Fig. 11. S3Gaussian [15] fails to predict accurate depth map,
RIGHT—and then resized to a resolution of 640×960. Only
because they use only LiDAR point clouds for initialization,
the first return of the LiDAR point cloud data is considered.
where there are no points at the upper part. Other than the
We select the first 50 frames from each dataset for our exper-
LiDAR points, PVG [5] and DeSiRe-GS randomly sample
iments, and then scale the time range to [0,1].
points, enabling us to render much better depth map.
GS-based methods, such as PVG [5] and S3Gaussian [15]
9.2. KITTI Dataset
generally outperform NeRF-based methods like EmerNeRF
For KITTI Dataset, we only test DeSiRe-GS on the subset [39] in terms of image rendering. However, the explicit GS
provided by PVG[5]. Different from the Waymo dataset, methods tend to overfit to the images, thereby performing
we only use the left and right cameras for evaluation on poorly on depth rendering. With the proposed cross-view
KITTI dataset. The resolution of images from each camera consistency, our model can successfully solve the over-fitting
is 375 × 1242. Similar to Waymo Dataset preprocessing, we problem, achieving satisfactory rendering quality both in
randomly choose 50 frames from the whole sequence from image and depth.
each KITTI dataset and rescale time duration to [0,1] with a
frame interval of 0.02 seconds.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy