3DSAM Segment Anything in NeRF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 979-8-3503-4485-1/24/$31.

00 ©2024 IEEE | DOI: 10.1109/ICASSP48485.2024.10445897

3DSAM: SEGMENT ANYTHING IN NERF

Shangjie Wang, Yan Zhang

State Key Lab for Novel Software Technology, Nanjing University, Nanjing, China

ABSTRACT
Object segmentation within Neural Radiance Fields (NeRF)
plays a pivotal role, holding potential to enrich a myriad of
downstream applications like NeRF editing. Most existing
methods, heavily reliant on feature similarity of 3D space,
make it non-trivial to manipulate. Instead of intricate 3D in-
terfaces, segmenting multiview images rendered from NeRF
proves to be more intuitive, enhancing both visibility and in-
teractivity. However, annotating multiple images places a
heavy demand on users. To address this, we propose an in-
teractive NeRF segmentation framework that leverages user-
input from just one rendered view, automatically generating Fig. 1. Our model takes in user’s prompts(red point) from just
consistent prompts across all other views. Delving deeper, one rendered view, and accurately segmented objects for other
we propose the Semantic Prompt Generator (SPG) which em- views. Finally, these masks are utilized for reconstruction.
ploys a pre-trained SAM image encoder to extract image fea-
tures. Cosine similarities between these features are then uti- while DFF [6] uses textual prompt or patches as the segmen-
lized to form positive-negative location pair prompts. More- tation cues. Both of them use distillation for feature matching
over, we propose the Position Prompt Generator (PPG) to between user-provided prompts and the learnt 3D feature vol-
capture geometric relationships across different views, gener- ume. SPIn-NeRF [7] treat the image sequence rendered from
ating consistent bounding box prompts. Our method seam- NeRF [1] as video. They first estimate an initial mask using
lessly extends SAM’s impressive segmentation capabilities one-shot segmentation, and then segment the other images
into 3D scenarios without additional network training. Exten- by using video segmentation method [8, 9]. Finally, they use
sive evaluations confirm that our algorithm not only surpasses semantic NeRF [10, 11, 12] to refine the masks. The above
previous works in segmentation quality but also spends less approaches are limited by several factors: 1) It’s inefficient
time. to train one additional feature field for each scene. 2) Needs
network training, which consumes considerable resources. 3)
Index Terms— Neural Radiance Fields, Segment Any- Keeping high-dimensional 3D representations needs a large
thing Model, Interactive Segmentation. memory footprint.
NeRF scene regularly implicitly embedded inside the neu-
1. INTRODUCTION ral mapping weights, resulting in an entangled and uninter-
pretable representation that is difficult to alter. Instead of in-
Recently, Neural Radiance Fields (NeRF) [1] have emerged tricate 3D interfaces, segmenting multiview images rendered
as a new modality for representing and reconstruction scenes, from NeRF proves to be more intuitive, enhancing both vis-
demonstrating impressive result in reconstructing 3D scenes. ibility and interactivity. However, annotating multiple im-
The spectrum of research in NeRF is expansive. Object seg- ages while keeping view consistent places a heavy demand
mentation within NeRF is one of the most important area, as it on users. An intriguing alternative is to just expect a small
holds potential to enrich a myriad of downstream applications number of annotations for a single view. This inspires the de-
like NeRF editing [2, 3, 4] and object removing in NeRF. velopment of a method for generating a view-consistent 3D
There have been a few efforts at segmenting of radiance segmentation mask of an object from a single-view sparse an-
fields but remain unsatisfactory. Most existing methods, notation.
heavily reliant on feature similarity of 3D space, make it non- In this paper we proposed a novel interactive NeRF seg-
trivial to manipulate. N3F [5] uses user-provided patches mentation method named 3DSAM, As can be seen in Fig-
This paper is equally contributed by Shangjie Wang and Yan Zhang. ure.2. 3DSAM leverages user-input from just one rendered
Email: wangshangjiew3@gmail.com view, automatically generating consistent prompts across

979-8-3503-4485-1/24/$31.00 ©2024 IEEE 3060 ICASSP 2024

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Workflow Overview of 3DSAM: Initially, the user designates a source view and supplies prompts corresponding
to the object targeted for segmentation within the image rendered from NeRF [1]. Following this, SAM [13] is invoked to
derive the mask for the source view. This mask, in conjunction with images rendered from other views, serves as the input for
3DSAM. Leveraging our specially designed Position Prompt Generator and Semantic Prompt Generator, 3DSAM is capable of
autonomously generating prompts for images from other views. These prompts are subsequently ingested by SAM to acquire
the corresponding masks. In the culmination of this process, these masks facilitate the reconstruction of NeRF [1].

all other views. Delving deeper, we propose the Seman- 2.1. Position Prompt Generator
tic Prompt Generator (SPG) as be shown in Figure.3, which
The Position Prompt Generator is engineered to deduce object
employs a pre-trained SAM [13] image encoder to extract im-
positional information across different views by harnessing
age features. Cosine similarities between these features are
geometric relationships.In a given source view, the object’s
then utilized to form positive-negative location pair prompts.
bounding box can be precisely determined by querying the
Moreover, we propose the Position Prompt Generator (PPG)
source view mask MS . Our proposed method spread bound-
to capture geometric relationships across different views,
ing box position across other views. It is noteworthy that a
generating consistent bounding box prompts. Our method
2D point within a specific view can be reprojected into 3D
seamlessly extends SAM’s [13] impressive segmentation
space. To ascertain positions of respective points in different
capabilities into 3D scenarios without additional network
views, it suffices to project these 3D points onto 2D planes,
training.
contingent on a designated camera pose. This procedure is
We summarize the contributions of our paper as follows:
efficiently executed utilizing sparse point cloud reconstruc-
1) Enhancing the visibility and interactivity of segmentation
tion method. Specifically, we employ COLMAP [14], which
in NeRF [1]. 2) Compared to previous model, we not only
introduces a distinctive bijective mapping structure, enabling
achieves superior segmentation results, but also have a signif-
point localization in 3D space using mere 2D coordinates as
icant reduction in computational time and memory consump-
queries. However, given the inherent sparsity of the recon-
tion. 3) Extend SAM’s [13] remarkable ability to 3D scene
struction, direct mapping for arbitrary user inputs isn’t in-
without extra network training.
variably feasible. By querying proximate points in the extant
discrete set, we can resolve such ambiguities. Furthermore,
2. METHOD COLMAP [14] operates in real-time, eschewing neural net-
work training in favor of matrix computations.
The workflow of our interactive segmentation pipline is
demonstrated in Fig.1. In this section, We will introduce
2.2. Semantic Prompt Generator
the important modules of 3DSAM: proposed Position Prompt
Generator(PPG) and Semantic Prompt Generator(SPG) in The architecture of Semantice Prompt Generator is illustrated
Section 2.1 and Section 2.2, respectively. in Fig.3. First of all, the pre-trained image encoder EncI of

3061

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
point pair and supplied into the prompt encoder as
TP = EncP (Pl , Ph ) (4)
2×c
where TP ∈ R acts as the decoder’s prompt tokens. By
doing so, SAM [13] would have a tendency to segment the
area immediately surrounding the positive point on the other
view image while ignoring the negative ones.

3. EXPERIMENTS

3.1. Experimental Setup


Datasets: We adapt SPIn-NeRF [7] dataset to evaluate our
model. The dataset consists of 10 real-world scenes from
LLFF [16], NeRF-360 [1], NeRF-Supervision [17], and Shiny
[18], which includes human-annotated object masks. Addi-
tionally, Another 12 scenes are chosen to evaluate our model
from a variety of regularly used 3D reconstruction datasets,
which is different from scenes in SPIn-NeRF [7] dataset.
Metrics: To evaluate the performance of our segmen-
tation algorithm, we use evaluation metrics commonly used
in segmentation tasks, including accuracy of the predictions
(pixel-wise) and the intersection over union (IoU) metric. We
use the average value of accuracy and IoU from different
Fig. 3. Semantic Prompt Generator. To obtain a location
views as the evaluation result of this scene.
prior on the other view image, we employee SAM’s [13] im-
age encoder to extract visual feature,and calculate a similar-
ity map for positive-negative point selection, which provides 3.2. Performance
3DSAM with foreground and background cues without hu- Owing to the limited scholarly exploration in the domain of
man prompting. interactive NeRF segmentation, we compare our model with
established models and self-constructed baseline. Projection-
SAM [13] is used to extract the feature of both the source based methods are one of our self-constructed baseline: the
view image IS and other view image I as source mask is projected into the other views using the
scene geometry from a NeRF. The incomplete masks are
FS = EncI (IS ), FI = EncI (I), (1) then subjected to a number of interactive segmentation al-
where FS , FI ∈ Rh×w×c . The features of pixels within the gorithms, which are propagated to create complete object
target visual concept are then derived from FS using the refer- masks: P roj. + GrabCut [19] and P roj. + EdgeF low
ence mask MS ∈ Rh×w×1 , and the global visual embedding [20]. Furthermore, we explore P roj. + EdgeF low +
TS ∈ R1×c is aggregated using average pooling. SemanticN eRF [10], where an extra Semantic NeRF [10]
is fitted to make the outputs 3D consistent. The last baseline
TS = AvgP ooling(MS ⊙ FS ), (2) is video-segmentation method [8].
Quantitative Evaluation As can be seen in Table.2, our
where ⊙ represents spatial-wise multiplication. After that, model outperforms all of the different types baselines and
We create a position confidence map using the target embed- state-of-the-art model SPIn-NeRF [7]. Table.1 shows the
ding TS by computing the cosine similarity S between the TS comparison of time consumption between our method and
and other view image feature FI as SPIn-NeR [7]. SPIn-NeRF [7] construct a extremely com-
S = FI TS T ∈ Rh×w (3) plex pipline that consumes huge time and hardware resources.
In contrast, our approach makes use of SAM [13] efficiently
where FI and TS are pixel-wisely L2-normalized. In order to and requires minimal matrix computation. As a result, our
provide SAM [13] with a location prior on the other view im- model produces higher-quality results with less time.
age, we choose two pixel coordinates from S with the highest Qualitative Evaluation The proposed approach is visu-
and lowest similarity values, designated as Ph and Pl respec- ally compared to N3F [5] and NVOS [15] in Figure.4. As
tively. While the latter identifies the background, the former demonstrated, our method can not only preserve object com-
shows where the target object is most likely to be in the fore- pleteness, but also restore well-edge details, producing overall
ground. They are then treated as the positive and negative better prediction results.

3062

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
Table 1. Comparison of time consumption between our method and SPIn-NeRF [7] with same experiment setting.
One-shot Segmentation Video Segmentation Semantic NeRF Train Semantic NeRF Predict Total
SPIn-NeRF <1 second <1minute 3-5minutes 1minute 5-7 minutes
Ours 1 minutes 1 minutes

Fig. 4. (a) Our 3DSAM vs N3F [5]. (b) Our 3DSAM vs NVOS [15]. Our method preserve object completeness and restore
well-edge details, producing overall better prediction result than N3F [5] and NVOS [15]. Red point indicates user’s prompt.

Table 2. Quantitative comparison on SPIn-NeRF Dataset.


Method Accuracy IoU
Proj. + Grab Cut [19] 91.08 46.61
Proj. + EdgeFlow [20] 96.84 81.63
Semantic NeRF [10] (only source mask) 94.63 75.13
Proj. + EdgeFlow [20] + Semantic NeRF [10] 97.26 83.95
Video Segmentation [8] 98.43 88.34
SPIn-NeRF [7] 98.91 91.96
Ours 99.65 94.78

Table 3. Ablation Study of 3DSAM.


Variant IoU Gain
Only Positive Point 78.19 - Fig. 5. Qualitative comparisons with different model settings.
+ Negative Point 84.23 +6.04
+ Position Prompt 94.78 +10.53 4. CONCLUSION

In this paper, we propose a novel interactive NeRF segmen-


3.3. Ablation Analysis tation method named 3DSAM. 3DSAM leverages user-input
from just one rendered view, automatically generating con-
In this part, we analyze the contribution of different model sistent prompts across, greatly reduces the burden to user. we
components. As shown in Table.3, We begin with a baseline design the Semantic Prompt Generator (SPG) which employs
model with 78.19 IoU, in which only the SPG’s positive loca- a pre-trained SAM [13] image encoder to extract image fea-
tion prior is used to automatically prompt SAM [13]. Then, tures. Cosine similarities between these features are then uti-
we add the negative location prior, improving the segmenta- lized to form positive-negative location pair prompts. More-
tion IoU by 6.04 %. On top of that, we introduce PPG into over, we propose the Position Prompt Generator (PPG) to cap-
our model to provide the positional information for SAM, im- ture geometric relationships across different views, generat-
proving the segmentation IoU by 10.53 % . Qualitative com- ing consistent bounding box prompts. Extensive evaluations
parisons with different model settings are shown in Fig.5. All confirm that our algorithm not only surpasses previous works
these result fully indicate the significance of our designs. in segmentation quality but also spends less time.

3063

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
5. REFERENCES [11] A. Mirzaei, Y. Kant, J. Kelly, and I. Gilitschenski, “Lat-
erf: Label and text driven object radiance fields,” in
[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Bar- European Conference on Computer Vision, pp. 20–36,
ron, R. Ramamoorthi, and R. Ng, “Nerf: Represent- Springer, 2022.
ing scenes as neural radiance fields for view synthesis,”
Communications of the ACM, vol. 65, no. 1, pp. 99–106, [12] S. Zhi, E. Sucar, A. Mouton, I. Haughton, T. Laidlow,
2021. and A. J. Davison, “ilabel: Interactive neural scene la-
belling,” arXiv preprint arXiv:2111.14637, 2021.
[2] Y. Peng, Y. Yan, S. Liu, Y. Cheng, S. Guan, B. Pan,
G. Zhai, and X. Yang, “Cagenerf: Cage-based neural ra- [13] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol-
diance field for generalized 3d deformation and anima- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg,
tion,” Advances in Neural Information Processing Sys- W.-Y. Lo, et al., “Segment anything,” arXiv preprint
tems, vol. 35, pp. 31402–31415, 2022. arXiv:2304.02643, 2023.

[3] T. Xu and T. Harada, “Deforming radiance fields with [14] J. L. Schönberger, T. Price, T. Sattler, J.-M. Frahm, and
cages,” in European Conference on Computer Vision, M. Pollefeys, “A vote-and-verify strategy for fast spa-
pp. 159–175, Springer, 2022. tial verification in image retrieval,” in Computer Vision–
ACCV 2016: 13th Asian Conference on Computer Vi-
[4] Y.-J. Yuan, Y.-T. Sun, Y.-K. Lai, Y. Ma, R. Jia, and sion, Taipei, Taiwan, November 20-24, 2016, Revised
L. Gao, “Nerf-editing: geometry editing of neural ra- Selected Papers, Part I 13, pp. 321–337, Springer, 2017.
diance fields,” in Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, [15] Z. Ren, A. Agarwala, B. Russell, A. G. Schwing, and
pp. 18353–18364, 2022. O. Wang, “Neural volumetric object selection,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vi-
[5] V. Tschernezki, I. Laina, D. Larlus, and A. Vedaldi, sion and Pattern Recognition, pp. 6133–6142, 2022.
“Neural feature fusion fields: 3d distillation of self-
supervised 2d image representations,” in 2022 Interna- [16] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K.
tional Conference on 3D Vision (3DV), pp. 443–453, Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local
IEEE, 2022. light field fusion: Practical view synthesis with prescrip-
tive sampling guidelines,” ACM Transactions on Graph-
[6] S. Kobayashi, E. Matsumoto, and V. Sitzmann, “De- ics (TOG), vol. 38, no. 4, pp. 1–14, 2019.
composing nerf for editing via feature field distillation,”
Advances in Neural Information Processing Systems, [17] L. Yen-Chen, P. Florence, J. T. Barron, T.-Y. Lin, A. Ro-
vol. 35, pp. 23311–23330, 2022. driguez, and P. Isola, “Nerf-supervision: Learning dense
object descriptors from neural radiance fields,” in 2022
[7] A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, International Conference on Robotics and Automation
J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levin- (ICRA), pp. 6496–6503, IEEE, 2022.
shtein, “Spin-nerf: Multiview segmentation and percep-
tual inpainting with neural radiance fields,” in Proceed- [18] S. Wizadwongsa, P. Phongthawee, J. Yenphraphai, and
ings of the IEEE/CVF Conference on Computer Vision S. Suwajanakorn, “Nex: Real-time view synthesis
and Pattern Recognition, pp. 20669–20679, 2023. with neural basis expansion,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
[8] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,
Recognition, pp. 8534–8543, 2021.
P. Bojanowski, and A. Joulin, “Emerging properties in
self-supervised vision transformers,” in Proceedings of [19] C. Rother, V. Kolmogorov, and A. Blake, “” grabcut”
the IEEE/CVF international conference on computer vi- interactive foreground extraction using iterated graph
sion, pp. 9650–9660, 2021. cuts,” ACM transactions on graphics (TOG), vol. 23,
no. 3, pp. 309–314, 2004.
[9] W. Wang, T. Zhou, F. Porikli, D. Crandall, and
L. Van Gool, “A survey on deep learning technique for [20] Y. Hao, Y. Liu, Z. Wu, L. Han, Y. Chen, G. Chen,
video segmentation,” arXiv e-prints, pp. arXiv–2107, L. Chu, S. Tang, Z. Yu, Z. Chen, et al., “Edgeflow:
2021. Achieving practical interactive segmentation with edge-
guided flow,” in Proceedings of the IEEE/CVF Interna-
[10] S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davi-
tional Conference on Computer Vision, pp. 1551–1560,
son, “In-place scene labelling and understanding with
2021.
implicit scene representation,” in Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pp. 15838–15847, 2021.

3064

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy