3DSAM Segment Anything in NeRF
3DSAM Segment Anything in NeRF
3DSAM Segment Anything in NeRF
State Key Lab for Novel Software Technology, Nanjing University, Nanjing, China
ABSTRACT
Object segmentation within Neural Radiance Fields (NeRF)
plays a pivotal role, holding potential to enrich a myriad of
downstream applications like NeRF editing. Most existing
methods, heavily reliant on feature similarity of 3D space,
make it non-trivial to manipulate. Instead of intricate 3D in-
terfaces, segmenting multiview images rendered from NeRF
proves to be more intuitive, enhancing both visibility and in-
teractivity. However, annotating multiple images places a
heavy demand on users. To address this, we propose an in-
teractive NeRF segmentation framework that leverages user-
input from just one rendered view, automatically generating Fig. 1. Our model takes in user’s prompts(red point) from just
consistent prompts across all other views. Delving deeper, one rendered view, and accurately segmented objects for other
we propose the Semantic Prompt Generator (SPG) which em- views. Finally, these masks are utilized for reconstruction.
ploys a pre-trained SAM image encoder to extract image fea-
tures. Cosine similarities between these features are then uti- while DFF [6] uses textual prompt or patches as the segmen-
lized to form positive-negative location pair prompts. More- tation cues. Both of them use distillation for feature matching
over, we propose the Position Prompt Generator (PPG) to between user-provided prompts and the learnt 3D feature vol-
capture geometric relationships across different views, gener- ume. SPIn-NeRF [7] treat the image sequence rendered from
ating consistent bounding box prompts. Our method seam- NeRF [1] as video. They first estimate an initial mask using
lessly extends SAM’s impressive segmentation capabilities one-shot segmentation, and then segment the other images
into 3D scenarios without additional network training. Exten- by using video segmentation method [8, 9]. Finally, they use
sive evaluations confirm that our algorithm not only surpasses semantic NeRF [10, 11, 12] to refine the masks. The above
previous works in segmentation quality but also spends less approaches are limited by several factors: 1) It’s inefficient
time. to train one additional feature field for each scene. 2) Needs
network training, which consumes considerable resources. 3)
Index Terms— Neural Radiance Fields, Segment Any- Keeping high-dimensional 3D representations needs a large
thing Model, Interactive Segmentation. memory footprint.
NeRF scene regularly implicitly embedded inside the neu-
1. INTRODUCTION ral mapping weights, resulting in an entangled and uninter-
pretable representation that is difficult to alter. Instead of in-
Recently, Neural Radiance Fields (NeRF) [1] have emerged tricate 3D interfaces, segmenting multiview images rendered
as a new modality for representing and reconstruction scenes, from NeRF proves to be more intuitive, enhancing both vis-
demonstrating impressive result in reconstructing 3D scenes. ibility and interactivity. However, annotating multiple im-
The spectrum of research in NeRF is expansive. Object seg- ages while keeping view consistent places a heavy demand
mentation within NeRF is one of the most important area, as it on users. An intriguing alternative is to just expect a small
holds potential to enrich a myriad of downstream applications number of annotations for a single view. This inspires the de-
like NeRF editing [2, 3, 4] and object removing in NeRF. velopment of a method for generating a view-consistent 3D
There have been a few efforts at segmenting of radiance segmentation mask of an object from a single-view sparse an-
fields but remain unsatisfactory. Most existing methods, notation.
heavily reliant on feature similarity of 3D space, make it non- In this paper we proposed a novel interactive NeRF seg-
trivial to manipulate. N3F [5] uses user-provided patches mentation method named 3DSAM, As can be seen in Fig-
This paper is equally contributed by Shangjie Wang and Yan Zhang. ure.2. 3DSAM leverages user-input from just one rendered
Email: wangshangjiew3@gmail.com view, automatically generating consistent prompts across
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Workflow Overview of 3DSAM: Initially, the user designates a source view and supplies prompts corresponding
to the object targeted for segmentation within the image rendered from NeRF [1]. Following this, SAM [13] is invoked to
derive the mask for the source view. This mask, in conjunction with images rendered from other views, serves as the input for
3DSAM. Leveraging our specially designed Position Prompt Generator and Semantic Prompt Generator, 3DSAM is capable of
autonomously generating prompts for images from other views. These prompts are subsequently ingested by SAM to acquire
the corresponding masks. In the culmination of this process, these masks facilitate the reconstruction of NeRF [1].
all other views. Delving deeper, we propose the Seman- 2.1. Position Prompt Generator
tic Prompt Generator (SPG) as be shown in Figure.3, which
The Position Prompt Generator is engineered to deduce object
employs a pre-trained SAM [13] image encoder to extract im-
positional information across different views by harnessing
age features. Cosine similarities between these features are
geometric relationships.In a given source view, the object’s
then utilized to form positive-negative location pair prompts.
bounding box can be precisely determined by querying the
Moreover, we propose the Position Prompt Generator (PPG)
source view mask MS . Our proposed method spread bound-
to capture geometric relationships across different views,
ing box position across other views. It is noteworthy that a
generating consistent bounding box prompts. Our method
2D point within a specific view can be reprojected into 3D
seamlessly extends SAM’s [13] impressive segmentation
space. To ascertain positions of respective points in different
capabilities into 3D scenarios without additional network
views, it suffices to project these 3D points onto 2D planes,
training.
contingent on a designated camera pose. This procedure is
We summarize the contributions of our paper as follows:
efficiently executed utilizing sparse point cloud reconstruc-
1) Enhancing the visibility and interactivity of segmentation
tion method. Specifically, we employ COLMAP [14], which
in NeRF [1]. 2) Compared to previous model, we not only
introduces a distinctive bijective mapping structure, enabling
achieves superior segmentation results, but also have a signif-
point localization in 3D space using mere 2D coordinates as
icant reduction in computational time and memory consump-
queries. However, given the inherent sparsity of the recon-
tion. 3) Extend SAM’s [13] remarkable ability to 3D scene
struction, direct mapping for arbitrary user inputs isn’t in-
without extra network training.
variably feasible. By querying proximate points in the extant
discrete set, we can resolve such ambiguities. Furthermore,
2. METHOD COLMAP [14] operates in real-time, eschewing neural net-
work training in favor of matrix computations.
The workflow of our interactive segmentation pipline is
demonstrated in Fig.1. In this section, We will introduce
2.2. Semantic Prompt Generator
the important modules of 3DSAM: proposed Position Prompt
Generator(PPG) and Semantic Prompt Generator(SPG) in The architecture of Semantice Prompt Generator is illustrated
Section 2.1 and Section 2.2, respectively. in Fig.3. First of all, the pre-trained image encoder EncI of
3061
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
point pair and supplied into the prompt encoder as
TP = EncP (Pl , Ph ) (4)
2×c
where TP ∈ R acts as the decoder’s prompt tokens. By
doing so, SAM [13] would have a tendency to segment the
area immediately surrounding the positive point on the other
view image while ignoring the negative ones.
3. EXPERIMENTS
3062
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
Table 1. Comparison of time consumption between our method and SPIn-NeRF [7] with same experiment setting.
One-shot Segmentation Video Segmentation Semantic NeRF Train Semantic NeRF Predict Total
SPIn-NeRF <1 second <1minute 3-5minutes 1minute 5-7 minutes
Ours 1 minutes 1 minutes
Fig. 4. (a) Our 3DSAM vs N3F [5]. (b) Our 3DSAM vs NVOS [15]. Our method preserve object completeness and restore
well-edge details, producing overall better prediction result than N3F [5] and NVOS [15]. Red point indicates user’s prompt.
3063
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.
5. REFERENCES [11] A. Mirzaei, Y. Kant, J. Kelly, and I. Gilitschenski, “Lat-
erf: Label and text driven object radiance fields,” in
[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Bar- European Conference on Computer Vision, pp. 20–36,
ron, R. Ramamoorthi, and R. Ng, “Nerf: Represent- Springer, 2022.
ing scenes as neural radiance fields for view synthesis,”
Communications of the ACM, vol. 65, no. 1, pp. 99–106, [12] S. Zhi, E. Sucar, A. Mouton, I. Haughton, T. Laidlow,
2021. and A. J. Davison, “ilabel: Interactive neural scene la-
belling,” arXiv preprint arXiv:2111.14637, 2021.
[2] Y. Peng, Y. Yan, S. Liu, Y. Cheng, S. Guan, B. Pan,
G. Zhai, and X. Yang, “Cagenerf: Cage-based neural ra- [13] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol-
diance field for generalized 3d deformation and anima- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg,
tion,” Advances in Neural Information Processing Sys- W.-Y. Lo, et al., “Segment anything,” arXiv preprint
tems, vol. 35, pp. 31402–31415, 2022. arXiv:2304.02643, 2023.
[3] T. Xu and T. Harada, “Deforming radiance fields with [14] J. L. Schönberger, T. Price, T. Sattler, J.-M. Frahm, and
cages,” in European Conference on Computer Vision, M. Pollefeys, “A vote-and-verify strategy for fast spa-
pp. 159–175, Springer, 2022. tial verification in image retrieval,” in Computer Vision–
ACCV 2016: 13th Asian Conference on Computer Vi-
[4] Y.-J. Yuan, Y.-T. Sun, Y.-K. Lai, Y. Ma, R. Jia, and sion, Taipei, Taiwan, November 20-24, 2016, Revised
L. Gao, “Nerf-editing: geometry editing of neural ra- Selected Papers, Part I 13, pp. 321–337, Springer, 2017.
diance fields,” in Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, [15] Z. Ren, A. Agarwala, B. Russell, A. G. Schwing, and
pp. 18353–18364, 2022. O. Wang, “Neural volumetric object selection,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vi-
[5] V. Tschernezki, I. Laina, D. Larlus, and A. Vedaldi, sion and Pattern Recognition, pp. 6133–6142, 2022.
“Neural feature fusion fields: 3d distillation of self-
supervised 2d image representations,” in 2022 Interna- [16] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K.
tional Conference on 3D Vision (3DV), pp. 443–453, Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local
IEEE, 2022. light field fusion: Practical view synthesis with prescrip-
tive sampling guidelines,” ACM Transactions on Graph-
[6] S. Kobayashi, E. Matsumoto, and V. Sitzmann, “De- ics (TOG), vol. 38, no. 4, pp. 1–14, 2019.
composing nerf for editing via feature field distillation,”
Advances in Neural Information Processing Systems, [17] L. Yen-Chen, P. Florence, J. T. Barron, T.-Y. Lin, A. Ro-
vol. 35, pp. 23311–23330, 2022. driguez, and P. Isola, “Nerf-supervision: Learning dense
object descriptors from neural radiance fields,” in 2022
[7] A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, International Conference on Robotics and Automation
J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levin- (ICRA), pp. 6496–6503, IEEE, 2022.
shtein, “Spin-nerf: Multiview segmentation and percep-
tual inpainting with neural radiance fields,” in Proceed- [18] S. Wizadwongsa, P. Phongthawee, J. Yenphraphai, and
ings of the IEEE/CVF Conference on Computer Vision S. Suwajanakorn, “Nex: Real-time view synthesis
and Pattern Recognition, pp. 20669–20679, 2023. with neural basis expansion,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
[8] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,
Recognition, pp. 8534–8543, 2021.
P. Bojanowski, and A. Joulin, “Emerging properties in
self-supervised vision transformers,” in Proceedings of [19] C. Rother, V. Kolmogorov, and A. Blake, “” grabcut”
the IEEE/CVF international conference on computer vi- interactive foreground extraction using iterated graph
sion, pp. 9650–9660, 2021. cuts,” ACM transactions on graphics (TOG), vol. 23,
no. 3, pp. 309–314, 2004.
[9] W. Wang, T. Zhou, F. Porikli, D. Crandall, and
L. Van Gool, “A survey on deep learning technique for [20] Y. Hao, Y. Liu, Z. Wu, L. Han, Y. Chen, G. Chen,
video segmentation,” arXiv e-prints, pp. arXiv–2107, L. Chu, S. Tang, Z. Yu, Z. Chen, et al., “Edgeflow:
2021. Achieving practical interactive segmentation with edge-
guided flow,” in Proceedings of the IEEE/CVF Interna-
[10] S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davi-
tional Conference on Computer Vision, pp. 1551–1560,
son, “In-place scene labelling and understanding with
2021.
implicit scene representation,” in Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pp. 15838–15847, 2021.
3064
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:45:40 UTC from IEEE Xplore. Restrictions apply.