Desire-Gs: 4D Street Gaussians For Static-Dynamic Decomposition and Surface Reconstruction For Urban Driving Scenes
Desire-Gs: 4D Street Gaussians For Static-Dynamic Decomposition and Surface Reconstruction For Urban Driving Scenes
Desire-Gs: 4D Street Gaussians For Static-Dynamic Decomposition and Surface Reconstruction For Urban Driving Scenes
Figure 1. DeSiRe-GS. We present a 4D street gaussian splatting representation for self-supervised static-dynamic decomposition and
high-fidelity surface reconstruction without the requirement for extra 3D annotations such as bounding boxes.
1
1. Introduction and walls. We also couple the normal and scale of each
Gaussian, which can be optimized jointly to improve surface
Modeling driving scenes [11, 28] is essential for au- reconstruction quality.
tonomous driving applications, as it facilitates real-world
To further address the overfitting issue, we propose tempo-
simulation and supports scene understanding [46]. An ef-
ral geometrical cross-view consistency, which significantly
fective scene representation enables a system to efficiently
enhances the model’s geometric awareness and accuracy by
perceive and reconstruct dynamic driving environments. Re-
aggregating information from different views across time.
cent 3D Gaussian Splatting (3DGS) [16] has emerged as a
These strategies allow us to achieve state-of-the-art recon-
prominent 3D representation that can be optimized through
struction quality, surpassing other Gaussian splatting ap-
2D supervision. It has gained popularity due to its explicit
proaches in the field of autonomous driving.
nature, high efficiency, and rendering speed.
Overall, DeSiRe-GS makes the following contributions:
While 3D Gaussian Splatting (3DGS) has demonstrated
strong performance in static object-centric reconstructions, • We propose to extract motion information easily from
the original 3DGS struggles to handle dynamic objects in un- appearance differences based on a simple observation that
bounded street views, which are common in real-world sce- 3DGS cannot successfully model the dynamic regions.
narios, particularly for autonomous driving applications. It is • We then distill the extracted 2D motion priors in local
unable to effectively model dynamic regions, leading to blur- frames into global gaussian space, using time-varying
ring artifacts due to the Gaussian model’s time-independent Gaussians in a differentiable manner.
parameterization. As a result, 4D-GS [35] is proposed, mod- • We introduce effective 3D regularizations and temporal
eling the dynamics with a Hexplane encoder. The Hexplane cross-view consistency to generate physically reasonable
[1] works well on object-level datasets, but struggles with Gaussian ellipsoids, further enhancing high-quality de-
driving scenes because of the unbounded areas in outdoor composition and reconstruction.
environments. Instead, we choose to reformulate the original We demonstrate DeSiRe-GS’s capability of effective
static Gaussian model as time-dependent variables with mi- static-dynamic decomposition and high-fidelity surface re-
nor changes, ensuring the efficiency of handling large-scale construction across various challenging datasets [11, 28].
driving scenes.
In this paper, we present DeSiRe-GS, a purely Gaus- 2. Related Work
sian Splatting-based representation, which facilitates self-
supervised static-dynamic decomposition and high-quality Urban Scene Reconstruction. Recent advancements in
surface reconstruction in driving scenarios. For static- novel view synthesis, such as Neural Radiance Field (NeRF)
dynamic decomposition, existing methods such as Driving- [21] and 3D Gaussian Splatting (3DGS) [16], have signifi-
Gaussian [47] and Street Gaussians [38], rely on explicit cantly advanced urban scene reconstruction. Many studies
3D bounding boxes, which significantly simplifies the de- [22, 24, 27, 29, 30, 36, 39] have integrated NeRF into work-
composition problem, since dynamic Gaussians in a moving flows for autonomous driving. Urban Radiance Fields [27]
bounding box can be simply removed. Without the 3D anno- combine lidar and RGB data, while Block-NeRF [29] and
tations, some recent self-supervised methods like PVG [5] Mega-NeRF [30] partition large scenes for parallel train-
and S3Gaussian [15] have attempted to achieve decomposi- ing. However, dynamic environments pose challenges. NSG
tion but fall short in performance, as they treat all Gaussians [24], use neural scene graphs to decompose dynamic scenes,
as dynamic, relying on indirect supervision to learn motion and SUDS [31] introduces a multi-branch hash table for
patterns. However, our proposed method can achieve ef- 4D scene representation. Self-supervised approaches like
fective self-supervised decomposition, based on a simple EmerNeRF [39] and RoDUS [22] can effectively address
observation that dynamic regions reconstructed from 3DGS dynamic scene challenges. EmerNeRF capturing object cor-
are blurry—quite different from the ground truth images. De- respondences via scene flow estimation, and RoDUS utilizes
spite the absence of 3D annotations, DeSiRe-GS produces a robust kernel-based training strategy combined with se-
results comparable to, or better than, approaches that use mantic supervision.
explicit bounding boxes for decomposition. In 3DGS-based urban reconstruction, recent works [5, 6,
Another challenge in applying 3D Gaussian Splatting 15, 38, 46, 47] have gained attention. StreetGaussians [38]
(3DGS) to autonomous driving is the sparse nature of im- models static and dynamic scenes separately using spherical
ages, which is more pronounced compared to object-centric harmonics, while DrivingGaussian [47] introduces specific
reconstruction tasks. This sparsity often leads 3DGS to over- modules for static background and dynamic object recon-
fit on the limited number of observations, resulting in inaccu- struction. OmniRe [6] unifies static and dynamic object
rate geometry learning. Inspired by 2D Gaussian Splatting reconstruction via dynamic Gaussian scene graphs. How-
(2DGS) [14], we aim to generate flatter, disk-shaped Gaus- ever, [6, 38, 46] all require additional 3D bounding boxes
sians to better align with the surfaces of objects like roads which are sometimes difficult to obtain.
2
Figure 2. Pipeline of DeSiRe-GS. To tackle the challenges in self-supervised street scene decomposition. The entire pipeline is optimized
without extra annotations in a self-supervised manner, leading to superior scene decomposition ability and rendering quality.
Static Dynamic Decomposition. Several approaches technologies, neural implicit representations have shown
seek to model the deformation of dynamic and static compo- promise for high-fidelity surface reconstruction. Approaches
nents. D-NeRF [26], Nerfiles [25], Deformable GS [40] and like [19, 33, 41, 43] train neural signed distance functions
4D-GS [35] extend the vanilla NeRF or 3DGS by incorpo- (SDF) to represent scenes. StreetSurf [13] proposes disen-
rating a deformation field. They compute the canonical-to- tangling close and distant views for better implicit surface
observation transformation and separate static and dynamic reconstruction in urban settings, while [27] steps further
components through the deformation network. However, using sparse lidar to enhance depth details.
applying such methods to large-scale driving scenarios is 3D GS has renewed interest in explicit geometric recon-
challenging due to the substantial computational resources struction, with recent works [2, 3, 9, 12, 14, 32, 42] focusing
needed for learning dense deformation parameters, and the on geometric regularization techniques. SuGaR [12] aligns
inaccurate decomposition leads to suboptimal peformance. Gaussian ellipsoids to object surfaces through introducing
For autonomous driving scenarios, NSG [24] models and additional regularization term, while 2DGS [14] directly
dynamic and static parts as nodes in neural scene graphs replaces 3D ellipsoids with 2D discs and utilizes the trun-
but requires additional 3D annotations. Other NeRF-based cated signed distance function (TSDF) to fuse depth maps,
methods [22, 31, 39] leverage a multi-branch structure to enabling noise-free surface reconstruction. PGSR [2] intro-
train time-dependent and time-invariant features separately. duces single- and multi-view regularization for multi-view
3DGS-based methods, such as [5, 15, 38, 47], also focus consistency. GSDF [42] and NeuSG [3] combine 3D Gaus-
on static-dynamic separation but still face limitations. [15] sians with neural implicit SDFs to enhance surface details.
utilizes a deformation network with a hexplane temporal- TrimGS [9] refines surface structures by trimming inaccu-
spatial encoder, requiring extensive computation. PVG [5] rate geometry, maintaining compatibility with earlier meth-
assigns attributes like velocity and lifespan to each Gaussian, ods like 3DGS and 2DGS. While these approaches excel in
distinguishing static from dynamic ones. Yet, the separation small-scale reconstruction, newer works like [4, 7, 10] aim
remains incomplete and lacks thoroughness. to address large-scale urban scenes. [4] adopts a large-scene
Neural Surface Reconstruction. Traditional methods partitioning strategy for reconstruction, while RoGS [10]
for Neural Surface Reconstruction focus more on real geom- proposes 2D Gaussian surfel representation which aligns
etry structures. With the rise of neural radiance field (NeRF) with physical characteristics of road surfaces.
3
3. Preliminary 4. DeSiRe-GS
3D Gaussian Splatting: 3D Gaussian Splatting (3DGS) As shown in Fig. 2, the training process is divided into
[16] employs a collection of colored ellipsoids, G = {g} two stages. We first extract 2D motion masks by calculat-
to explicitly represent 3D scenes. Each Gaussian g = ing the feature difference between the rendered image and
{µ, s, r, o, c} is defined by the following learnable at- GT image. In the second stage, we distill the 2D motion
tributes: a position center µ ∈ R3 , a covariance matrix information into Gaussian space using PVG [5], enabling
Σ ∈ R3×3 , an opacity scalar o, and a color vector c, which the rectification of inaccurate attributes for each Gaussian in
is modeled using spherical harmonics. The distribution of a differentiable manner.
3D Gaussians is mathematically described as:
4.1. Dynamic Mask Extraction (stage I)
1 ⊤ −1
G(x) = exp − (x − µ) Σ (x − µ) . (1) During the first stage, we observe that 3D Gaussian Splat-
2
ting (3DGS) performs effectively in reconstructing static
The covariance matrix Σ can be formulated as follows: elements, such as parked cars and buildings in a driving
Σ = RSS⊤ R⊤ , where S denotes a diagonal scaling ma- scene. However, it struggles to accurately reconstruct dy-
trix, while R is a rotation matrix, parameterized as a scaling namic regions, as the original 3DGS does not incorporate
vector s and a quaternion r ∈ R4 , respectively. temporal information. This limitation results in artifacts
To generate images from a specific viewpoint, 3D gaus- such as ghost-like floating points in the rendered images,
sian ellipsoids are projected onto a 2D image plane to form as illustrated in Fig. 2 (stage 1). To address this issue, we
2D ellipses for rendering. For each pixel, a sequence of leverage the significant differences between static and dy-
Gaussians N is sorted in ascending order based on depth. namic regions to develop an efficient method for extracting
The color is then rendered through alpha blending: segmentation masks that encode motion information.
i−1
Initially, a pretrained foundation model is employed to
extract features from both the rendered image and the ground
X Y
C= ci αi (1 − αj ), (2)
i∈N j=1
truth (GT) image used for supervision. Let F̂ denote the
features extracted from the rendered image I, ˆ and F repre-
where αi and ci denote the density and color of the i-th sent the features extracted from the GT image I. To distin-
Gaussian, respectively, derived from the learned opacity and guish dynamic and static regions, we compute the per-pixel
SH coefficients of the corresponding Gaussian. dissimilarity D between the corresponding features. The
Periodic Vibration Gaussian (PVG): PVG [5] reshapes dissimilarity metric D approaches 0 for similar features, in-
the original Gaussian model by introducing time-dependent dicating static regions, and nears 1 for dissimilar features,
adjustments to the position mean µ and opacity o. The corresponding to dynamic regions.
modified model is represented as follows:
D = 1 − cos(F̂ , F ) /2. (6)
l 2π(t − τ )
µ̃(t) = µ + · sin · v, (3)
2π l As the pretrained model is frozen, the resulting dissim-
− 12 (t−τ )2 β −2
ilarity score D ∈ RH×W is computed without involving
õ(t) = o · e , (4) any learnable parameters. Rather than applying a simple
where µ̃(t) denotes vibrating position centered at µ oc- threshold to D to generate a motion segmentation mask, we
curring around life peak τ , and õ(t) represents the time- propose a multi-layer perceptron (MLP) decoder to predict
dependent opacity which undergoes exponential decay as the dynamicness δ ∈ RH×W . This decoder leverages the
time deviates from the life peak τ . β and v determine the extracted features, which contain rich semantic information,
decay rate and the instant velocity at the life peak τ , respec- while the dissimilarity score is employed to guide and opti-
tively, and are both learnable parameters. l, as a pre-defined mize the learning process of the decoder.
parameter of the scene, represents the oscillation period.
Thus, the PVG model is expressed as: Ldyn = δ ⊙ D, (7)
G(t) = {µ̃(t), s, r, õ(t), c, τ, β, v}, (5) where ⊙ refers to the element-wise multiplication.
By employing the loss function Ldyn defined in Eq. 7,
We adopt PVG as the dynamic representation for au- the decoder is optimized to predict lower values in regions
tonomous driving scenes, because PVG model preserves the where D is high, corresponding to dynamic regions, thereby
structure of the original 3D GS model at any given time t, minimizing the loss. We can then obtain the binary mask
enabling it to be rendered using the standard 3D GS pipeline encoding motion information (ε is a fixed threshold):
to reconstruct the dynamic scene. For further details about
PVG, we refer the readers to [5]. M = I (δ > ε) . (8)
4
During training, the joint optimization of image rendering
and mask prediction mutually benefits each other. By ex-
cluding dynamic regions during supervision, the differences
between rendered images and GT images become more no-
ticeable, facilitating the extraction of motion masks.
4.2. Static Dynamic Decomposition (stage II) Figure 3. Gaussian Scale Regularization.
While stage I provides effective dynamic masks, these
masks are confined to the image space rather than the 3D disks. The scaling regularization loss is:
Gaussian space and depend on ground truth images. This
reliance limits their applicability in novel view synthesis, Ls = ∥ min(s1 , s2 , s3 )∥. (11)
where supervised images may not be available.
To bridge the 2D motion information from stage I to the Normal Derivation: Surface normals are critical for surface
3D Gaussian space, we adopt PVG, a unified representation reconstruction. Previous methods incorporate normals by ap-
for dynamic scenes (Section 3). However, PVG’s reliance pending a normal vector ni ∈ R3 to each Gaussian, which is
on image and sparse depth map supervision introduces chal- then used to render a normal map N ∈ RH×W . The ground
lenges, as accurate motion patterns are difficult to learn from truth normal map is employed to supervise the optimization
indirect supervision signals. Consequently, the rendered ve- of the Gaussian normals. However, these approaches of-
locity map V ∈ RH×W , as shown in Fig. 2 (stage 2), often ten fail to achieve accurate surface reconstruction, as they
contains noisy outliers. For example, static regions such as overlook the inherent relationship between the scale and the
roads and buildings, where the velocity should be zero, are normal. Instead of appending a separate normal vector, we
not handled effectively. This results in unsatisfactory scene derive the normal n directly from the scale vector s. The
decomposition, with PVG frequently misclassifying regions normal direction naturally aligns with the axis corresponding
where zero velocity is expected. to the smallest scale component, as the Gaussians are shaped
To mitigate this issue and generate more accurate Gaus- like a disk after flattening regularization.
sian representations, we incorporate the segmentation masks
obtained from stage I to regularize the 2D velocity map V, n = R · arg min(s1 , s2 , s3 ). (12)
which is rendered from Gaussians in the 3D space.
With such formulation of the normal, the gradient can be
Lv = V ⊙ M. (10) back-propagated to the scale vector, rather than the appended
normal vector, thereby facilitating better optimization of the
Minimizing Lv penalizes regions where velocity should Gaussian parameters. The normal loss is:
be zero, effectively eliminating noisy outliers produced by
the original PVG. This process propagates motion informa- Ln = ∥N − N̂ ∥2 . (13)
tion from the 2D local frame to the global Gaussian space.
With the refined velocity v for each Gaussian, dynamic and Giant Gaussian Regularization: We observed that both
static Gaussians can be distinguished by applying a simple 3DGS and PVG could produce oversized Gaussian ellipsoids
threshold. This approach achieves superior self-supervised without additional regularization, particularly in unbounded
decomposition compared to PVG [5] and S3Gaussian [15], driving scenarios, as illustrated in Fig. 3 (a).
without requiring additional 3D annotations such as bound- Our primary objective is to fit appropriately scaled Gaus-
ing boxes used in previous methods [6, 38, 46]. sians that support accurate image rendering and surface re-
construction. While oversized Gaussian ellipsoids with low
4.3. Surface Reconstruction opacity may have minimal impact on the rendered image,
they can significantly impair surface reconstruction. This is
4.3.1. Geometric Regularization
a limitation often overlooked in existing methods focused
Flattening 3D Gaussian: Inspired by 2D Gaussian Splatting solely on 2D image rendering. To address this issue, we
(2DGS) [14], we aim to flatten 3D ellipsoids into 2D disks, introduce a penalty term for each Gaussian:
allowing the optimized Gaussians to better conform to object
surfaces and enabling high-quality surface reconstruction. sg = max(s1 , s2 , s3 ); Lg = sg · I(sg > ϵ), (14)
The scale s = (s1 , s2 , s3 ) of 3DGS defines the ellipsoid’s
size along three orthogonal axes. Minimizing the scale along where sg is the largest scale direction and ϵ is a predefined
the shortest axis effectively transforms 3D ellipsoids into 2D threshold for huge gaussians.
5
Combined with the motion loss from Eq.7, we can get:
X i−1
Y
{D, N , V} = αi (1 − αj ){di , ni , vi }. (20)
i∈N j=1
For stage II, we use the projected sparse depth map Dgt
Figure 4. Cross-view consistency from LiDAR as the supervision label.
LD = ∥D − Dgt ∥1 . (21)
4.3.2. Temporal Spatial Consistency Together with the static velocity regularization (Eq. 10),
In driving scenarios, the sparse nature of views often leads flattened gaussian (Eq. 11), normal supervision (Eq. 13),
to overfitting to the training views during the optimization of giant gaussian regularization (Eq. 14), geometric consistency
Gaussians. Single-view image loss is particularly susceptible loss (Eq. 17), etc., the loss for stage II is:
to challenges in texture-less regions at far distances. As a
result, relying on photometric supervision from images and Lstage2 = LI + LD + Ln + Lv + Ls + Lg + Luv . (22)
sparse depth map is not reliable. To address this, we propose
enhancing geometric consistency by leveraging temporal 5. Experiments
cross-view information. 5.1. Experimental Setups
Under the assumption that the depth of static regions
remains consistent over time across varying views, we in- Datasets. We conduct our experiments on the Waymo Open
troduce a cross-view temporal-spatial consistency module. Dataset [28] and KITTI Dataset [11], both consisting of
For a static pixel (ur , vr ) in the reference frame with a depth real-world autonomous driving scenarios. For the Waymo
value dr , we project it to the nearest neighboring view—the Open Dataset, we use the subset from PVG [5]. For a more
view with the largest overlap. Using the camera intrinsics complete comparison with non self-supervised methods, we
K and extrinsics Tr , Tn , the corresponding pixel location in also conduct experiments on the subset provided by OmniRe
the neighboring view is calculated as: [6], which contains a large amount of highly dynamic scenes.
We use the frontal three cameras (FRONT LEFT, FRONT,
[un , vn , 1]T = KTn Tr−1 dr · K −1 [ur , vr , 1]T . (15)
FRONT RIGHT) for Waymo Open Dataset, and the left and
right cameras for KITTI dataset.
We then query the depth value dn at (un , vn ) in the neigh-
boring view. Projecting this back into 3D space, the resulting Evaluation Metrics. We adopt PSNR, SSIM [34] and LPIPS
position should align with the position obtained by back- [45] as metrics for the evaluation of image reconstruction
projecting (ur , vr , dr ) to the reference frame: and novel view synthesis. Following [15, 38, 39], we also
include DPSNR and DSSIM to assess the rendering quality
[unr , vnr , 1]T = KTr Tn−1 dn · K −1 [un , vn , 1]T . (16)
at dynamic regions. Additionally, we introduce depth L1,
which measures the L1 error between the rendered depth map
To enforce cross-view depth consistency, we apply a geo-
and the ground truth depth map obtained from LiDAR point
metric loss to optimize the Gaussians, defined as:
clouds, as an evaluation metric for the quality of geometric
Luv = ∥(ur , vr ) − (unr , vnr )∥2 . (17) reconstruction.
This loss encourages the Gaussians to produce geometri- Baselines. We benchmark DeSiRe-GS against the following
cally consistent depth across views over time. approaches: 3DGS [16], StreetSurf [13], Mars [36], SUDS
[31], EmerNeRF [39], S3Gaussian [15], PVG [5], OmniRe
4.4. Optimization [6], StreetGS [38] and HUGS[46]. Among these methods,
Stage I: During Stage I, our objective is to leverage the SUDS and EmerNeRF are NeRF-based self-supervised ap-
joint optimization of motion masks and rendered images to proaches. S3Gaussian and PVG are both 3DGS-based self-
effectively learn the motion masks. Therefore, we only use supervised methods, the closest to our approach. To further
the masked image losses LI , highlight the superiority of DeSiRe-GS, we also compare
it with OmniRe, StreetGS, and HUGS, all of which require
˜ 1 + λssim SSIM(I, I).
LI = (1 − λssim )∥I − I∥ ˜ (18) additional bounding box information.
6
Waymo Open Dataset KITTI
Method Image reconstruction Novel view synthesis Image reconstruction Novel view synthesis
FPS FPS
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
S-NeRF [37] 19.67 0.528 0.387 19.22 0.515 0.400 0.0014 19.23 0.664 0.193 18.71 0.606 0.352 0.0075
StreetSurf [13] 26.70 0.846 0.3717 23.78 0.822 0.401 0.097 24.14 0.819 0.257 22.48 0.763 0.304 0.37
3DGS [16] 27.99 0.866 0.293 25.08 0.822 0.319 63 21.02 0.811 0.202 19.54 0.776 0.224 125
NSG [24] 24.08 0.656 0.441 21.01 0.571 0.487 0.032 19.19 0.683 0.189 17.78 0.645 0.312 0.19
Mars [36] 21.81 0.681 0.430 20.69 0.636 0.453 0.030 27.96 0.900 0.185 24.31 0.845 0.160 0.31
SUDS [31] 28.83 0.805 0.317 25.36 0.783 0.384 0.008 28.83 0.917 0.147 26.07 0.797 0.131 0.29
EmerNeRF [39] 28.11 0.786 0.373 25.92 0.763 0.384 0.053 26.95 0.828 0.218 25.24 0.801 0.237 0.28
PVG [5] 32.46 0.910 0.229 28.11 0.849 0.279 50 32.83 0.937 0.070 27.43 0.896 0.114 59
Ours 33.61 0.919 0.204 29.75 0.878 0.213 36 33.94 0.949 0.04 28.87 0.901 0.106 41
Table 1. Comparison of methods on the Waymo Open Dataset and KITTI dataset. FPS refers to frames per second.
Figure 5. Qualitative comparison with self-supervised S3Gaussian [15] and PVG [5]
Methods Box PSNR (reconst) ↑ PSNR (nvs) ↑ stage, we train for a total of 30,000 iterations. We start
EmerNeRF [39] 31.93 29.67 to train the motion decoder after 6,000 iterations. For the
3DGS [39] 26.00 25.57 second stage, we train the model for 50,000 iterations. Multi-
DeformGS [40] 28.40 27.72 view temporal consistency regularization begins after 20,000
PVG [5] 32.37 30.19 iterations. The motion masks, obtained from stage I, are
HUGS [46] ✓ 28.26 27.65
employed after 30000 iterations to supervise the optimization
StreetGS [38] ✓ 29.08 28.54 of velocity v. We use Adam [17] as our optimizer with
OmniRe [6] ✓ 34.25 32.57 β1 = 0.9, β2 = 0.999.
Ours 33.82 31.49 5.2. Quantitative Results
Table 2. Comparison of rendering quality against recent SOTA Following PVG [5], we evaluate our method on two tasks:
methods with or without 3D bbox annotations. ‘reconst’ refers to image reconstruction and novel view synthesis, using the
reconstruction and ‘nvs’ refers to novel view synthesis. Waymo Open Dataset [28] and the KITTI dataset [11]. As
shown in Tab. 1, our approach achieves state-of-the-art per-
Implementation Details. All experiments are conducted formance across all rendering metrics for both reconstruction
on NVIDIA RTX A6000. We sample a total of 1 × 106 and synthesis tasks. In terms of rendering speed, our method
points for initialization, with 6 × 105 from LiDAR point reaches approximately 40 FPS. While slightly slower than
cloud, and 4 × 105 randomly sampled points. In the first the 3DGS [16] and PVG [5] baselines due to rendering addi-
7
Setting PSNR ↑ SSIM ↑ LPIPS ↓ DPSNR ↑ DSSIM ↑ Depth L1↓
(a) w/o stage I motion mask 34.7063 0.9570 0.1098 34.7183 0.9570 0.1017
(b) w/o FiT3D model (w DINOv2) 34.9551 0.9559 0.1027 34.9734 0.9602 0.0977
(c) w/o gt normal supervision 35.4469 0.9625 0.0967 35.4876 0.9626 0.0913
(d) w/o gt normal (w normal from depth) 35.2357 0.9509 0.1436 35.5312 0.9512 0.0847
(e) w/o min scale regularization 35.2863 0.9616 0.0989 35.3275 0.9617 0.0935
(f) w/o max scale regularization 35.6911 0.9622 0.0970 35.7306 0.9623 0.0802
(g) w/o multi-view consistency 35.3325 0.9618 0.0983 35.3731 0.9619 0.1154
Full model 35.7598 0.9631 0.0956 35.7820 0.9632 0.0713
8
References [14] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and
Shenghua Gao. 2d gaussian splatting for geometrically accu-
[1] Ang Cao and Justin Johnson. Hexplane: A fast representation rate radiance fields. In ACM SIGGRAPH 2024 Conference
for dynamic scenes. In Proceedings of the IEEE/CVF Con- Papers, pages 1–11, 2024. 2, 3, 5
ference on Computer Vision and Pattern Recognition, pages
[15] Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An,
130–141, 2023. 2
Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer,
[2] Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian and Shanghang Zhang. S3gaussian: Self-supervised street
Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, gaussians for autonomous driving. CoRR, 2024. 2, 3, 5, 6, 7,
and Guofeng Zhang. Pgsr: Planar-based gaussian splatting 8, 1, 4
for efficient and high-fidelity surface reconstruction. arXiv
[16] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and
preprint arXiv:2406.06521, 2024. 3
George Drettakis. 3d gaussian splatting for real-time radiance
[3] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural
field rendering. ACM Trans. Graph., 42(4):139–1, 2023. 2, 4,
implicit surface reconstruction with 3d gaussian splatting
6, 7, 5
guidance, 2023. 3
[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[4] Junyi Chen, Weicai Ye, Yifan Wang, Danpeng Chen, Di
stochastic optimization, 2017. 7, 1
Huang, Wanli Ouyang, Guofeng Zhang, Yu Qiao, and
[18] Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc
Tong He. Gigags: Scaling up planar-based 3d gaussians
Pollefeys, and Torsten Sattler. WildGaussians: 3D gaussian
for large scene surface reconstruction. arXiv preprint
splatting in the wild. arXiv, 2024. 1
arXiv:2409.06685, 2024. 3
[5] Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and [19] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor,
Li Zhang. Periodic vibration gaussian: Dynamic urban Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neu-
scene reconstruction and real-time rendering. arXiv preprint ralangelo: High-fidelity neural surface reconstruction, 2023.
arXiv:2311.18561, 2023. 2, 3, 4, 5, 6, 7, 8, 1 3
[6] Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, [20] Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, and Kui Jia.
Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Goj- Gs-ir: 3d gaussian splatting for inverse rendering. In Proceed-
cic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban ings of the IEEE/CVF Conference on Computer Vision and
scene reconstruction. arXiv preprint arXiv:2408.16760, 2024. Pattern Recognition, pages 21644–21653, 2024. 8
2, 5, 6, 7, 4 [21] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
[7] Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: Representing scenes as neural radiance fields for view synthe-
3d gaussian splatting with progressive propagation. In Forty- sis, 2020. 2
first International Conference on Machine Learning, 2024. [22] Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco,
3 Moussab Bennehar, and Dzmitry Tsishkou. Rodus: Robust
[8] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir decomposition of static and dynamic elements in urban scenes.
Zamir. Omnidata: A scalable pipeline for making multi-task arXiv preprint arXiv:2403.09419, 2024. 2, 3
mid-level vision datasets from 3d scans. In Proceedings of [23] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo,
the IEEE/CVF International Conference on Computer Vision, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel
pages 10786–10796, 2021. 8 Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2:
[9] Lue Fan, Yuxue Yang, Minxing Li, Hongsheng Li, and Zhaox- Learning robust visual features without supervision. arXiv
iang Zhang. Trim 3d gaussian splatting for accurate geometry preprint arXiv:2304.07193, 2023. 1, 2
representation, 2024. 3 [24] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
[10] Zhiheng Feng, Wenhua Wu, and Hesheng Wang. Rogs: Large Felix Heide. Neural scene graphs for dynamic scenes. In Pro-
scale road surface reconstruction based on 2d gaussian splat- ceedings of the IEEE/CVF Conference on Computer Vision
ting, 2024. 3 and Pattern Recognition, pages 2856–2865, 2021. 2, 3, 7
[11] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we [25] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
ready for autonomous driving? the kitti vision benchmark Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
suite. In Conference on Computer Vision and Pattern Recog- Martin-Brualla. Nerfies: Deformable neural radiance fields,
nition (CVPR), 2012. 2, 6, 7, 4 2021. 3
[12] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned [26] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
gaussian splatting for efficient 3d mesh reconstruction and Francesc Moreno-Noguer. D-nerf: Neural radiance fields
high-quality mesh rendering. In Proceedings of the IEEE/CVF for dynamic scenes, 2020. 3
Conference on Computer Vision and Pattern Recognition, [27] Konstantinos Rematas, Andrew Liu, Pratul P Srinivasan,
pages 5354–5363, 2024. 3 Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser,
[13] Jianfei Guo, Nianchen Deng, Xinyang Li, Yeqi Bai, Bo- and Vittorio Ferrari. Urban radiance fields. In Proceedings of
tian Shi, Chiyu Wang, Chenjing Ding, Dongliang Wang, the IEEE/CVF Conference on Computer Vision and Pattern
and Yikang Li. Streetsurf: Extending multi-view im- Recognition, pages 12932–12942, 2022. 2, 3
plicit surface reconstruction to street views. arXiv preprint [28] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
arXiv:2306.04988, 2023. 3, 6, 7, 4 Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
9
Yuning Chai, Benjamin Caine, et al. Scalability in perception [42] Mulin Yu, Tao Lu, Linning Xu, Lihan Jiang, Yuanbo Xiangli,
for autonomous driving: Waymo open dataset. In Proceedings and Bo Dai. Gsdf: 3dgs meets sdf for improved rendering
of the IEEE/CVF conference on computer vision and pattern and reconstruction, 2024. 3
recognition, pages 2446–2454, 2020. 2, 6, 7, 4 [43] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler,
[29] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- and Andreas Geiger. Monosdf: Exploring monocular geo-
han, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, metric cues for neural implicit surface reconstruction, 2022.
and Henrik Kretzschmar. Block-nerf: Scalable large scene 3
neural view synthesis, 2022. 2 [44] Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang,
[30] Haithem Turki, Deva Ramanan, and Mahadev Satya- and Jan Eric Lenssen. Improving 2d feature representations
narayanan. Mega-nerf: Scalable construction of large-scale by 3d-aware fine-tuning. In ECCV, 2024. 8, 1, 2
nerfs for virtual fly-throughs, 2022. 2 [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
[31] Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva and Oliver Wang. The unreasonable effectiveness of deep
Ramanan. Suds: Scalable urban dynamic scenes. In Proceed- features as a perceptual metric, 2018. 6, 1
ings of the IEEE/CVF Conference on Computer Vision and [46] Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao
Pattern Recognition, pages 12375–12385, 2023. 2, 3, 6, 7, 4 Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi
[32] Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Liao. Hugs: Holistic urban 3d scene understanding via gaus-
Seiskari, Esa Rahtu, and Juho Kannala. Dn-splatter: Depth sian splatting. In Proceedings of the IEEE/CVF Conference
and normal priors for gaussian splatting and meshing, 2024. on Computer Vision and Pattern Recognition, pages 21336–
3 21345, 2024. 2, 5, 6, 7, 8, 3, 4
[33] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku [47] Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, De-
Komura, and Wenping Wang. Neus: Learning neural implicit qing Sun, and Ming-Hsuan Yang. Drivinggaussian: Compos-
surfaces by volume rendering for multi-view reconstruction. ite gaussian splatting for surrounding dynamic autonomous
arXiv preprint arXiv:2106.10689, 2021. 3 driving scenes. In Proceedings of the IEEE/CVF Conference
[34] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. on Computer Vision and Pattern Recognition, pages 21634–
Image quality assessment: from error visibility to structural 21643, 2024. 2, 3
similarity. IEEE Transactions on Image Processing, 13(4):
600–612, 2004. 6, 1
[35] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng
Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang.
4d gaussian splatting for real-time dynamic scene rendering,
2024. 2, 3
[36] Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng
Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen,
Runyi Yang, et al. Mars: An instance-aware, modular and
realistic simulator for autonomous driving. In CAAI Inter-
national Conference on Artificial Intelligence, pages 3–15.
Springer, 2023. 2, 6, 7, 4, 5
[37] Ziyang Xie, Junge Zhang, Wenye Li, Feihu Zhang, and Li
Zhang. S-nerf: Neural radiance fields for street views. arXiv
preprint arXiv:2303.00749, 2023. 7
[38] Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang,
Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou,
and Sida Peng. Street gaussians for modeling dynamic urban
scenes. arXiv preprint arXiv:2401.01339, 2024. 2, 3, 5, 6, 7,
8, 1, 4
[39] Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Se-
ung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler,
Marco Pavone, et al. Emernerf: Emergent spatial-temporal
scene decomposition via self-supervision. arXiv preprint
arXiv:2311.02077, 2023. 2, 3, 6, 7, 8, 1, 4, 5
[40] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing
Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-
fidelity monocular dynamic scene reconstruction. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 20331–20341, 2024. 3, 7
[41] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces, 2021. 3
10
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition
and Surface Reconstruction for Urban Driving Scenes
Supplementary Material
Figure 7. DeSiRe-GS. We present a 4D street gaussian splatting representation for self-supervised static-dynamic decomposition and
high-fidelity surface reconstruction without the requirement for extra 3D annotations such as bounding boxes.
1
Figure 8. Comparison between features extracted from DINOv2 [23] and FiT3D [44].
Figure 9. Segmentation Mask Extraction of DeSiRe-GS. We utilize the differences between the rendered image and ground truth to train a
dynamic mask decoder.
DINOv2 with gaussian splatting to improve the 3D aware- at the regions will not be supervised. As a result, the dif-
ness of the extracted features, which is a perfect match for ference between the rendered images and the ground truth
our setting in the driving world. Therefore, we turn to FiT3D images will be become more significant, which benefits the
as the feature extractor, producing more clean and robust extraction of the desired motion masks.
features, to measure the similarity of two images. We provide a few samples in Fig. 10. It can be observed
Motion Mask Extractor. that our model can handle the dynamic objects well, even
By introducing the learnable decoder, we are not limited for far-away pedestrians.
to the viewpoints where GT images are available. Instead, Temporal Geometric Constraints. Due to the sparsity
given a rendered image, we can first extract the FiT3D fea- nature of views in driving scenarios, it tends to overfit to the
tures, and then use our decoder to extract the motion mask, training views when optimizing gaussian splatting. Single-
without the requirement for ground truth images. view image loss often suffers from texture-less area in far
During training, the joint optimization of image rendering distance. As a result, relying on photometric consistency is
and mask prediction will benefit from each other by using not reliable. Instead, we propose to enhance the geometric
the obtained mask M to mask out the dynamic regions. The consistency by aggregating temporal information.
rendering loss is as follows: Based on the assumption that depth of static regions re-
Lmasked−render = M ⊙ ∥Iˆ − I∥ (23) mains consistent across time from varying views, we des-
ignate a cross-view temporal spatial consistency module.
As we mask out the dynamic regions, the reconstruction For a static pixel (ur , vr ) in the reference frame, with depth
2
Figure 10. Extracted motion masks using FiT3D features
value of dr , we can project it to the nearest neighboring contains a background node approximating all static parts,
view, which has the largest amount of overlap. Given the several dynamic nodes representing rigidly moving indi-
camera intrinsics K and extrinsics Tr , Tn , we can obtain the viduals, and edges representing transformations. NSG[24]
corresponding pixel location in the neighboring view: also combines implicitly encoded scenes with a jointly
learned latent representation to describe objects in a single
[un , vn , 1]T = KTn Tr−1 dr · K −1 [ur , vr , 1]T
(24) implicit function.
• SUDS [31] is a NeRF-based method for dynamic large
Again, we can query the depth value dn at the position
urban scene reconstruction. It proposed using 2D optical
(un , vn ). When we project it back to the 3D space, the
flow to model scene dynamics, avoiding additional bound-
position should be consistent with the one obtained from
ing box annotations. SUDS develops a three-branch hash
back-projecting (ur , vr , dr ) to the reference frame.
table representation for 4D scene representation, enabling
[unr , vnr , 1]T = KTr Tn−1 dn · K −1 [un , vn , 1]T a variety of downstream tasks.
(25)
• StreetGS [38] models dynamic driving scenes using 3D
We apply geometric loss to optimize the Gaussians to Gaussian splatting. It represents the components in the
produce cross-view consistent depth as follows: scene separately, with a background model for static part
and an object model for foreground moving objects. To
Luv = ∥(ur , vr ) − (unr , vnr )∥2 (26) capture dynamic features, the position and rotation of
gaussians are defined in an object local coordinate system,
which relies on bounding boxes predicted by an off-the-
8. Baselines
shelf model.
• StreetSurf [13] is an implicit neural rendering method for • HUGS [46] is a 3DGS-based method addressing the prob-
both geometry and appearance reconstruction in street lem of urban scene reconstruction and understanding. It as-
views. The whole scene is divided in to close-range, sumes that the scene is composed of static rigions and mov-
distant-view and sky parts according to the distance of ob- ing vehicles with rigid motions, using a unicycle model to
jects. A cuboid close-range hash grid and a hyper-cuboid model vehicles’ states. HUGS also extends original 3DGS
distant-view model are employed to tackle long and nar- to model additional modalities, including optical flow and
row observation space in most street scenes, showcasing semantic information, achieving good performance in both
good performance in unbounded scenes captured by long scene reconstruction and semantic reconstruction. Bound-
camera trajectories. ing boxes are also required in this process.
• NSG [24] enables efficient rendering of novel arrange- • EmerNeRF [39] is a NeRF-based method for constructing
ments and views by encoding object transformations and 4D neural scene representations in urban driving scenes.
radiance within a learnable scene graph representation. It It decomposes dynamic scenes into a static field and a
3
Figure 11. Qualitative Comparison.
dynamic field, both parameterized by hash grids. Then an • PVG [5] is a self-supervised gaussian splatting approach
emergent scene flow field is introduced to represent ex- that reconstructs dynamic urban scenes and isolates dy-
plicit correspondences between moving objects and aggre- namic parts from static background. Refer to Sec. 2 for
gate temporally-displaced features. Remarkably, EmerN- more details about PVG.
eRF finishes these tasks all through self-supervision. In the approaches mentioned above, StreetSurf[13],
• S3Gaussian [15] is a self-supervised approach that de- Mars[36],SUDS[31] and EmerNeRF[39] are based upon
composes static and dynamic 3D gaussians in driving NeRF, while others are based upon 3DGS. Notably, among
scenes. It aggregates 4D gaussian representations in a the 3DGS-based approaches, HUGS[46], StreetGS[38] and
spatial-temporal field network with a multi-resolution hex- OmniRe[6] all rely on instance-level bounding boxes for
plane encoder, where the dynamic objects are visible only moving objects, which are sometimes difficult to obtain.
within spatial-temporal plane while static objects within PVG[5] and S3Gaussian[15] are most closely related to our
spatial-only plane. Then S3Gaussian utilizes a multi-head work, both of which are self-supervised Gaussian Splatting
decoder to capture the deformation of 3D Gaussians in a method without reliance on extra annotations.
canonical space for decomposition.
• Omnire [6] successfully models urban dynamic scenes us- 9. Data
ing Gaussian Scene Graphs, with different types of nodes
tackling sky, background, rigidly moving objects and non- We conduct our experiments on the Waymo Open Dataset
rigidly moving objects. It introduces rigid nodes for ve- [28] and the KITTI Dataset [11], both consisting of real-
hicles, where the Gaussians will not change over time, world autonomous driving scenarios.
and non-rigid nodes for human-ralated dynamics, where 9.1. Waymo Open Dataset
local deformations will be taken into consideration. Om-
niRe additionally employs a Skinned Multi-Person Linear NOTR from EmerNeRF. NOTR is a subset consisting of
(SMPL) model to parameterize human body model, show- diverse and balance sequences derived from Waymo Open
casing good results in reconstructing in-the-wild humans. Dataset introduced by [39]. It includes 120 distinct driving
Notably, Omnire also requires accurate instance bounding sequences, categorized into 32 static, 32 dynamic, and 56
boxes for dynamic modeling. diverse scenes covering various challenging driving condi-
4
Dynamic-32 Split Static-32 Split
Method Image reconstruction Novel view synthesis Image reconstruction Novel view synthesis
PSNR ↑ DPSNR ↑ L1↓ PSNR ↑ DPSNR ↑ L1↓ PSNR ↑ L1↓ PSNR ↑ L1↓
3DGS [16] 28.47 23.26 - 25.14 20.48 - 29.42 - 26.82 -
Mars [36] 28.24 23.37 - 26.61 22.21 - 28.31 - 27.63 -
EmerNeRF [39] 28.16 24.32 3.12 25.14 23.49 4.33 30.00 2.84 28.89 3.89
S3Gaussian [15] 31.35 26.02 5.31 27.44 22.92 6.18 30.73 5.84 27.05 6.53
PVG [5] 33.14 31.79 3.33 29.77 27.19 4.84 32.84 3.75 29.12 5.07
Ours 34.56 32.63 2.96 30.45 28.66 4.17 34.57 2.89 31.78 3.93
Segment Name seg104554 seg125050 seg169514 seg584622 seg776165 seg138251 seg448767 seg965324
Scene Index 23 114 327 621 703 172 552 788