Ne RF
Ne RF
Ne RF
Abstract—Neural Radiance Field (NeRF), a new novel view synthesis with implicit scene representation has taken the field of
Computer Vision by storm. As a novel view synthesis and 3D reconstruction method, NeRF models find applications in robotics, urban
mapping, autonomous navigation, virtual reality/augmented reality, and more. Since the original paper by Mildenhall et al., more than
250 preprints were published, with more than 100 eventually being accepted in tier one Computer Vision Conferences. Given NeRF
popularity and the current interest in this research area, we believe it necessary to compile a comprehensive survey of NeRF papers
arXiv:2210.00379v3 [cs.CV] 18 Dec 2022
from the past two years, which we organized into both architecture, and application based taxonomies. We also provide an introduction
to the theory of NeRF based novel view synthesis, and a benchmark comparison of the performance and speed of key NeRF models.
By creating this survey, we hope to introduce new researchers to NeRF, provide a helpful reference for influential works in this field, as
well as motivate future research directions with our discussion section.
Index Terms—Neural Radiance Field, NeRF, Computer Vision Survey, Novel View Synthesis, Neural Rendering, 3D Reconstruction
1 I NTRODUCTION
Fig. 1. The NeRF volume rendering and training process. Image sourced from [1]. (a) illustrates the selection of sampling points for individual pixels
in a to-be-synthesized image. (b) illustrates the generation of densities and colors at the sampling points using NeRF MLP(s). (c) and (d) illustrate
the generation of individual pixel color(s) using in-scene colors and densities along the associated camera ray(s) via volume rendering, and the
comparison to ground truth pixel color(s), respectively.
Certain depth regularization [21] [22] [23] [24] methods focal length and are placed at the same distance from the
use the expected depth to restrict densities to delta-like object. The dataset is composed of eight scenes with eight
functions at scene surfaces, or to enforce depth smoothness. different objects. For six of these, viewpoints are sampled
For each pixel, a square error photometric loss is used to from the upper hemisphere, for the two others, viewpoints
optimize the MLP parameters. Over the entire image, this is are sampled from the entire sphere. These objects are ”hot-
given by dog”, ”materials”, ”ficus”, ”lego”, ”mic”, ”drums”, ”chair”,
X ”ship”. The images are rendered at 800×800 pixels, with
L= ||Ĉ(r) − Cgt (r)||22 (8) 100 views for training and 200 views for testing. The ”lego”
r∈R scene was often used for visualization in subsequent NeRF
papers.
where Cgt (r) is the ground truth color of the training
The LLFF [5] consists of 24 real-life scenes captured
image’s pixel associated to r, and R is the batch of rays
from handheld cellphone cameras. The views are forward
associated to the to-be-synthesized image.
facing towards the central object. Each scene consists of 20-
NeRF models often employ positional encoding, which
30 images. The COLMAP package was used to compute the
was shown by Mildenhall et al. [1] to greatly improve fine
poses of the images.
detail reconstruction in the rendered views. This was also
The DTU dataset [26] is a multi-view stereo dataset
shown in more details, with corroborating theory using
captured using a 6-axis industrial robot mounted with both
Neural Tangent Kernels in [25]. In the original implemen-
a camera and a structured light scanner. The robot pro-
tation, the following positional encoding γ was applied to
vided precise camera positioning. Both the camera intrinsics
each component of the scene coordinate x (normalized to
and poses are carefully calibrated using the MATLAB cali-
[-1,1]) and viewing direction unit vector d
bration toolbox [27]. The light scanner provides reference
γ(v) = (sin(20 πv), cos(20 πv), sin(21 πv), cos(21 πv), dense point clouds which provide reference 3D geometry.
Nonetheless, due to self-occlusion, the scans of certain areas
..., sin(2N −1 πv), cos(2N −1 πv)), (9) in certain scenes are not complete. The original paper’s
where N is a user determined encoding dimensionality dataset consists of 80 scenes each containing 49 views sam-
parameter, set to N = 10 for x and N = 4 for d in the pled on a sphere of radius 50 cm around the central object.
original paper. However, modern researches have experi- For 21 of these scenes, an additional 15 camera positions
mented and achieved great results with alternate forms of are sampled at a radius of 65 cm, for a total of 64 views.
positional encoding including trainable parametric, integral, The entire dataset consists of 44 additional scenes that have
and hierarchical variants (see section 3). been rotated and scanned four times at 90 degree interval.
The illumination of scenes is varied using 16 LEDs, with
seven different lighting conditions. The image resolution is
2.3 Datasets 1600 × 1200.
NeRF models are trained per-scene. Although there are The ScanNet dataset [28] is a large-scale real-life RGB-D
some NeRF models designed to be trained from sparse input multi-modal dataset containing more than 2.5 million views
views or unposed images, typical NeRF models requires of indoor scenes, with annotated camera poses, reference
relatively dense images with relatively varied poses. The 3D surfaces, semantic labels, and CAD models. The depth
COLMAP [2] library is often used to extract camera poses frames are captured at 640 × 480 pixels, and the RGB images
prior to training when necessary. are captured at 1296 × 968 pixels. The scans were performed
The original NeRF paper [1] presented a synthetic using RGB-D sensors attached to handheld devices such
dataset created from Blender (referred to as Realistic Syn- as iPhone/iPad. The poses were estimated from BundleFu-
thetic 360 ◦ in [1]). The virtual cameras have the same sion [29] and geometric alignment of resulting mesh. The
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
Tanks and Temples dataset [30] is a 3D reconstruction from LPIPS [33] is a full reference quality assessment metric
video dataset. It consists of 14 scenes, including individual which uses learned convolutional features. The score is
objects such as ”Tank” and ”Train”, and large scale in- given by a weighted pixel-wise MSE of feature maps over
door scenes such as ”Auditorium” and ”Museum”. Ground multiple layers.
truth 3D data was captured using high quality industrial HX
L l ,Wl
laser scanner. The ground truth point cloud was used to X 1
LP IP S(x, y) = ||wl (xlhw − yhw
l
)||22 (15)
estimate camera poses using least squares optimization of Hl Wl h,w
l
correspondence points.
The ShapeNet dataset [31] is a simplistic large scale l
where xlhw , yhw are the reference and assessed images’ fea-
synthetic 3D dataset, consisting of 3D CAD model classi- ture at pixel width w, pixel height h, and layer l. Hl and Wl
fied into 3135 classes. The most used are the 12 common are the feature maps height and width at the corresponding
object categories subset. This dataset is sometimes used layer. The original LPIPS paper used SqueezeNet [34], VGG
when object-based semantic labels are an important part [35] and AlexNet [36] as feature extraction backbone. Five
of a particular NeRF model. From ShapeNet CAD models, layers were used in the original paper. The original authors
software such as Blender are often used to render training offered fine-tuned and from-scratch configurations, but in
views with known poses. practice, the pretrained networks are used as is.
mip-NeRF [37] based mip-NeRF[37] (2021), Ref-NeRF [38] (2021), RawNeRF (2021) [39]
Fundamentals Deformation-based Nerfies [40] (2020), HyperNeRF [41] (2021), CodeNeRF (2021) [42]
(Sec. 3.1)
Non-Baked NSVF (2020) [47], AutoInt (2020) [48] , Instant-NGP (2022) [49]
MLP-less (Sec. 3.7.1) Plenoxels (2021) [54], DVGO (2021) [55], TensorRF (2021) [56]
Cost Volume MVSNeRF [57] (2021), PixelNeRF (2020)[58], NeuRay (2021) [59]
NeRF [1]
Sparse View
(Sec. 3.3)
Others DietNeRF (2021) [60], DS-NeRF (2021) [22]
Foreground/Background NeRF-W (2020) [66], NeRF++ (2020) [68], GIRAFFE (2020) [61]
Composition
(Sec. 3.5)
Semantic/ Fig-NeRF (2021) [69], Yang et al. (2021) [70],
Object Composition Semantic-NeRF (2021) [71], Panoptic Neural Fields (2022) [72]
Fig. 2. Taxonomy of selected key NeRF innovation papers. The papers are selected using a combination of citations and GitHub star rating. We
note that the MLP-less speed-based models are not strictly speaking NeRF models. Nonetheless, we decided to include them in this taxonomy tree
due to their recent popularity and their similarity to speed based NeRF models.
final color. Additionally, they parameterized the directional depth and color regularization. The model was tested on
vector using the spherical harmonics of vector sampled from DTU [26] and LLFF [5] datasets, outperformed models such
a spherical Gaussian parameterized by the roughness. Ref- as PixelNeRF [58], SRF [80], MVSNeRF [57]. RegNeRF which
NeRF outperformed benchmarked methods, including mip- did not require pretraining, achieved comparable perfor-
NeRF [37], NSVF [47], baseline NeRF [1], and non-NeRF mance to these models which were pre-trained on DTU
models, on the Shiny Blender dataset (created by authors), and fine-tuned per scene. It outperformed Mip-NeRF and
the original NeRF dataset [1], and Real Captured Scenes DietNeRF [60].
from [79]. Ref-NeRF performed particularly well on reflec- Ray Prior NeRF (RapNeRF) (May 2022) [81] explored a
tive surfaces, and is able of accurately modelling specular NeRF model better suited for view extrapolation, whereas
reflections and highlights. standard NeRF models were better suited for interpolation.
RegNeRF [21] (December 2021) aimed to solve the prob- RapNeRF performed Random Ray Casting (RRC) whereby,
lem of NeRF training with sparse input view. Unlike most given a training ray hitting a surface point v = o + tz d,
other methods which approached this task by using image a backward ray was cast starting from v towards a new
features from pretrained networks as a prior conditioning origin o0 using uniformly sampled perturbations in angles.
for NeRF volume rendering, RegNeRF employed additional RapNeRF also made use of a Ray Atlas (RA) by first extract-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
TABLE 1
Comparison of select NeRF models on the synthetic NeRF dataset [1]
Method Positional Encoding Sampling Strategy PSNR (dB) SSIM LIPIS Training Iteration Training Time Inference Speed1
Baseline NeRF (2020)[1] Fourier [1] H [1]2 31.01 0.947 0.081 100-300k >12h 1
Speed Improvement
JaxNeRF (2020)[78] Fourier H 31.65 0.952 0.051 250k >12h ∼1.3
NSVF (2020) [47] Fourier Occupancy [47] 31.74 0.953 0.047 100-150k - ∼10
SNeRG (2021) [50] Fourier Occupancy [50] 30.38 0.950 0.050 250k >12h ∼9000
PlenOctree (2021) [51] Fourier H 31.71 0.958 0.053 2000k >12h ∼3000
FastNeRF (2021) [52] Fourier H 29.97 0.941 0.053 300k >12h ∼4000
KiloNeRF (2021) [53] Fourier Occupancy [53] 31.00 0.95 0.03 600k+150k+1000k >12h ∼2000
Instant-NGP (2022) [49] Hash [49] Occupancy [49] 33.18 - - 256k ∼5m ”orders of magnitude”
Quality Improvement
mip-NeRF (2021)[37] IPE [37] H*(single MLP) [37] 33.09 0.961 0.043 1000k ∼3h ∼1
ref-NeRF (2021)[38] IPE + IDE [38] H*(single MLP) 35.96 0.967 0.058 250k - ∼1
Sparse View/Few Shots
MVSNeRF (2021)[57] Fourier Uniform 27.07 0.931 0.163 10k (*3 views) ∼15m* ∼1
DietNeRF (2021)[60] Fourier H 23.15 0.866 0.109 200k (*8 views) - ∼1
DS-NeRF (2021)[22] Fourier H 24.9 0.72 0.34 150-200k (*10 views) - ∼1
Speed-based and single-object based quality improvement models were selected for benchmark on the Synthetic NeRF dataset.
For Positional Encoding and Sampling Strategy, unless indicated otherwise with a citation, identical entry denote that the
particular strategy is from a previously indicated work.
1 Inferencespeeds are given as speedup factor over the baseline NeRF.
2H and H* denote coarse to fine hierarchical sampling strategies.
ing a rough 3D mesh from a pretrained NeRF, and map- scenes with topological changes with examples such as a
ping training ray directions onto the 3D vertices. During human opening their mouth, or a banana being peeled.
training, a baseline NeRF was first trained to recover the CoNeRF [84] (December 2021) was built on HyperNeRF,
rough 3D mesh, then RRC and RA are used to augment but allowed for easily controllable photo editing via sliders,
the training rays, with a predetermined probability. The whose values are provided to a per-attribute Hypermap
authors evaluated their methods on the Synthetic NeRF deformation field, parameterized by an MLP. This is done
dataset [1], and their own MobileObject dataset, showing via sparse supervised annotation of slider values, and image
that their RRC and RA augmentations can be adapted to patch masks, with a L2 loss term for slider attribute value,
other NeRF framework, and that it resulted in better view and a cross entropy loss for mask supervision. CodeNeRF
synthesis quality. achieved good results, using sliders to adjust facial expres-
sions in their example dataset, which could have broad
3.1.1 Deformation Fields commercial applications for virtual human avatars.
Park et al. introduced Nerfies [40] (November 2020), a NeRF
model build using a deformation field which strongly im- 3.1.2 Depth Supervision and Point Cloud Methods
proved the performance of their model in presence of non- By using supervising expected depth (6) with point clouds
rigid transformations in the scene (e.g., a dynamic scene). By acquired from LiDAR or SfM, these models converge faster,
introducing an additional MLP which mapped input obser- converge to higher final quality, and require fewer training
vation frame coordinates to deformed canonical coordinates views than the baseline NeRF model. Many of these models
and by adding elastic regularization, background regular- were also built as few show/sparse view NeRF.
ization, and coarse-to-fine deformation regularization by Deng et al. [22] (July 2021) used depth supervision from
adaptive masking the positional encoding, they were able point clouds with a method named Depth-Supervised NeRF
to accurately reconstruct certain non-static scenes which (DS-NeRF). In addition to color supervision via volume
the baseline NeRF completely failed to do. An interesting rendering and photometric loss, DS-NeRF also performs
application the authors found was the creation of multi- depth supervision using sparse point clouds extracted from
view ”selfies” 2 . Concurrent to Nerfies was NerFace [82] the training images using COLMAP [2]. Depth is modelled
(December 2020), which also used per-frame learned latent as a normal distribution around the depth recorded by
codes, and added facial expression as a 76-dimensional coef- the sparse point cloud. A KL divergence term is added to
ficient of a morphable model of constructed from Face2Face minimize the divergence of the ray’s distribution and this
[83]. noisy depth distribution (See [22] for details). DS-NeRF was
Park et al. introduced HyperNeRF [41] (June 2021), extensively tested on the DTU dataset [26], NeRF dataset
which built on Nerfies by extending the canonical space [1], and the RedWood-3dscan dataset [85], outperforming
to a higher dimension, and adding an additional slicing benchmark methods such as baseline NeRF [1], pixelNerF
MLP which describes how return to the 3D representation [58] and MVSNeRF [57].
using ambient space coordinates. The canonical coordinate Concurrent to DS-NeRF is a work by Roessle et al.
and ambient space coordinate were then used to condition [43] (April 2021). In this work, the authors used COLMAP
the usual density and color MLPs of baseline NeRF models. to extract a sparse point cloud, which was processed by
HyperNeRF achieved great results in synthesizing views in a Depth Completion Network [86] to produce depth and
uncertainty maps. In addition to the standard volumetric
2. Popular self-portraits in social media loss, the authors introduced a depth loss based on predicted
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
depth and uncertainty. The model was trained on RGB-D separate scene features from the learned MLPs’
data from ScanNet [28] and Matterport3D [87] by intro- parameters, which in turn allows for smaller MLPs
ducing Gaussian noise to depth. The model outperformed (e.g., learning and storing features in a voxel grid,
DS-NeRF[22] marginally, and significantly outperformed which are then fed into MLPs which produce color
baseline NeRF [1], and NerfingMVS [44]. and density) which can improve both training and
NerfingMVS [44] (September 2021) used multi-view im- inference speed at the cost of memory.
ages in their NeRF model focused on depth reconstruction.
In NerfingMVS, COLMAP was used to extract sparse depth Other techniques such as ray termination (prevent fur-
priors in the form of a point cloud. This was then fed into ther sampling points when accumulated transmittance ap-
a pretrained (fine-tuned on the scene) monocular depth proaches zero), empty space skipping, and/or hierarchical
network [88] to extract a depth map prior. This depth sampling (coarse+fine MLPs used in the original NeRF
map prior was used to supervise volume sampling by only paper). These are also often used to further improve training
allowing sampling points at the appropriate depth. During and inference speed in conjunction.
volume rendering, the ray was divided into N equal bins, A popular early re-implementation of the original NeRF
with the ray bounds clamped using the depth priors. The in JAX [92], called JaxNeRF [78] (December 2020), was often
depth value D of a pixel was approximated by a modified used as benchmark comparison. This model was slightly
version of (4). NerfingMVS outperformed previous methods faster and more suited for distributed computing than the
on the ScanNet [28] dataset for depth estimation. original TensorFlow implementation.
PointNeRF [46] (January 2022) used feature point clouds In addition, a recent trend (CVPR 2022, 2022 preprints)
as an intermediate step to volume rendering. A Pretrained introduced multiple NeRF adjacent methods which are
3D CNN [89] was used to generate depth and surface based on category 2), using learned voxel/tree features.
probability γ from a cost volume created from training However, these methods skip over entirely the MLPs and
views, and produced a dense point cloud. A pretrained 2D performed volume rendering directly on the learned fea-
CNN [35] was used to extract image features from training tures. These are introduced in a later section (3.7.1) since
views. These were used to populate the point cloud features they are, not strictly speaking, NeRF models.
with image features, and probability γi of point pi lying of
a surface. Given the input position and view-direction, a 3.2.1 Non-Baked
PointNet[90]-like network was used to regress local density In Neural Sparse Voxel Fields (NSVF) (July 2020), Liu et
and color, which was then used for volume rendering. al.[47] developed a voxel-based NERF model which models
Using point cloud features also allowed the model to skip the scene as a set of radiance fields bounded by vox-
empty spaces, resulting in a speed-up of a factor of 3 els. Feature representations were obtained by interpolating
over baseline NeRF. PointNeRF outperformed methods such learnable features stored at voxel vertices, which were then
as PixelNeRF [58], MVSNeRF [57], IBRNet [91] after per- processed by a shared MLP which computed σ and c. NSVF
scene optimization in the form of point growing and point used a sparse voxel intersection-based point sampling for
pruning on the DTU dataset [26]. Point clouds acquired by rays, which was much more efficient than dense sampling,
other methods such as COLMAP can also be used in place or the hierarchical two step approach of Mildenhall et al. [1].
of the 3D depth network-based point cloud, whereby per- However, this approach was more memory intensive due to
scene optimization could be used to improve point cloud storing feature vectors on a potentially dense voxel grid.
quality. AutoInt (Dec 2020) [48] approximates the volume ren-
dering step. By separating the discrete volume rendering
3.2 Improvements to Training and Inference Speed equation 4 piecewise, then using their newly developed
AutoInt, which trains the MLP Φθ by training its gradient
In the original implementation by Mildenhall et al. [1], to
(grad) networks Ψiθ , which share internal parameters with,
improve computation efficiency, a hierarchical rendering
and are used to reassemble the integral network Φθ . This
was used. A naive rendering would require densely evalu-
allowed for the rendering step to use much fewer samples,
ating MLPs at all query points along each camera ray during
resulting in a ten times speed-up over baseline NeRF slight
the numerical integration (2). In their proposed method,
quality decrease.
they used two networks to represent the scene, one coarse
Deterministic Integration for Volume Rendering (DIVeR)
and one fine. The output of the coarse network was used to
[93] (November 2021) took inspiration from NSVF [47], also
pick sampling points for the fine network, which prevented
jointly optimizing a feature voxel grid and a decoder MLP
dense sampling at a fine scale. In subsequent works, most
while performing sparsity regularization and voxel culling.
attempts to improve NeRF training and inference speed can
However, they innovated on the volume rendering, using
be broadly classified into the two following categories.
a technique unlike NeRF methods. DIVeR performed deter-
1) The first category trains, precomputes and stores ministic ray sampling on the voxel grid which produced
NeRF MLP evaluation results into more easily acces- an integrated feature for each ray interval (defined by the
sible data structures. This only improves inference intersection of the ray with a particular voxel), which was
speed, albeit by a large factor. We refer to these decoded by an MLP to produce density and color of the ray
models as baked models. interval. This essentially reversed the usual order between
2) The second category are the non-baked models. volume sampling and MLP evaluation. The method was
These include multiple types of innovations. These evaluated on the NeRF Synthetic [1], BlendedMVS [94] and
models commonly (but not always) attempt to learn Tanks and Temple datasets [30], outperforming methods
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
such as PlenOctrees [51], FastNeRF [52] and KiloNeRF [53] 71. Separating the baseline NeRF’s MLP into thousands of
in terms of quality, at comparable a rendering speed. smaller MLP further improved render time by a factor of 36,
A recent innovation by Muller et al., dubbed Instant- resulting in a total of 2000-fold speed up in render time.
Neural Graphics Primitives (Instant-NGP) [49] (January The Fourier Plenoctree [96] (February 2022) approach
2022) greatly improved NERF model training and inference was proposed by Wang et al. in 2022. It was built for human
speed. The authors proposed a learned parametric multi- silhouette rendering since it used the domain specific tech-
resolution hash encoding that was trained simultaneously nique of Shape-From-Silhouette. The approach also takes
with the NERF model MLPs. They also employed advanced inspiration from generalizable image conditioned NeRFs
ray marching techniques including exponential stepping, such as [57] and [58]. It first constructed a coarse visual hull
empty space skipping, sample compaction. This new po- using sparse views predicted from a generalization NeRF
sitional encoding and associated optimized implementation and Shape-From-Silhouette. Then colors and densities were
of MLP greatly improved training and inference speed, as densely sampled inside this hull and stored on a coarse
well as scene reconstruction accuracy of the resulting NERF Plenoctree. Dense views were sampled from the Plenoctree,
model. Within seconds of training, they achieved similar with transmissivity thresholding used to eliminate most
results to hours of training in previous NERF models. empty points. For the remaining points, new leaf densities
and SH color coefficients were generated and the Plenoctree
3.2.2 Baked was updated. Then a Fourier Transform MLP was used to
A model by Hedman et al. [50] (July 2020) stored a precom- extract Fourier coefficients the density and SH color coeffi-
puted NeRF on a sparse voxel. The method, called Sparse cients, which were fed into an inverse discrete Fourier trans-
Neural Voxel Grid (SNeRG) stored precomputed diffused form to restore SH coefficients and density. According to
color, density, and feature vectors, which were stored on a the authors, using the frequency domain helped the model
sparse voxel grid in a process sometimes referred to as ”Bak- encode time-dependent information for dynamic scenes,
ing”. During evaluation time, an MLP was used to produce such as the moving silhouettes modelled in the paper. The
specular color, which combined with the specular colors SH coefficients were then used to restore color. The Fourier
alpha composited along the ray, produced the final pixel Plenoctree can be fine-tuned on a per scene basis using the
color. The method was 3000 times faster than the original standard photometric loss (8).
implementation, with speed comparable to PlenOctree. A recent preprint (June 2022) created a proposed a
The concurrent to SNeRG, PlenOctree [51] (March 2021) lightweight method, MobileNeRF [97]. During training, Mo-
approach of Yu et al. achieved a inference time that was bileNeRF train a NeRF-like models based on a polygonal
3000 times faster than the original implementation. The mesh with color, feature, and opacity MLPs attached to
authors trained a spherical harmonic NeRF (NeRF-SH), each mesh point. Alpha values were then discretized, and
which instead of predicting the color function, predicted its features were super-sampled for anti-aliasing. During ren-
spherical harmonic coefficients. The authors built an octree dering, the mesh with associated features and opacities are
of precomputed spherical harmonic coefficients of the colors rasterized based on viewing position, and a small MLP is
MLP. During the building of the octree, the scene was first used to shade each pixel. The method was shown to be
voxelized, with low transmisivity voxels eliminated. This around 10 times faster than SNeRG [50].
procedure could also be applied to standard NeRF (Non EfficientNeRF [98] (July 2022) was based on PlenOctree
NeRF-SH models) by performing Monte Carlo estimations [51], choosing to use spherical harmonics and to cache
of the spherical harmonics components of the NeRF. PlenOc- the trained scene in a tree. However, it made several im-
trees could be further optimized using the initial training provements. Most importantly, EfficientNeRF improved the
images. This fine-tuning procedure was fast relative to the training speed by using momentum density voxel grid to
NeRF training. store predicted density using exponential weighted average
In FastNeRF [52] (March 2021), Garbin et al. factorized update. During the coarse sampling stage, the grid was used
color function c into the inner product of the output of the to discard sampling points with zero density. During the
direction position dependent MLP (which also produces the fine sampling stage, a pivot system was also used to speed
density σ ) and the output of a direction-dependent MLP. up volume rendering. Pivot points were defined as points
This allowed Fast-NeRF to easily cache color and density xi for which Ti αi > where is a predefined threshold,
evaluation in a dense grid of the scene, which greatly and Ti and αi are the transmittance and alpha values as
improved inference time by a factor of 3000+. They also in- defined in (4) and (5). During fine sample, only points near
cluded hardware accelerated ray tracing [95] which skipped the pivot points are considered. These two improvements
empty spaces, and stopped when the ray’s transmittance speed up the training time by a factor of 8 over the baseline
was saturated. NeRF [1]. The authors then cached the trained scene into
Reiser et al. [53] (May 2021) improved on the baseline a NeRF tree. This resulted in rendering speed comparable
NeRF by introducing KiloNeRF, which separated the scene to FastNeRF [52], and exceeding that of baseline NeRF by
into thousands of cells, and trained independent MLPs for thousands fold.
color and density predictions on each cell. These thousands
of small MLPs were trained using knowledge distillation
from a large pretrained teacher MLP, which we find closely 3.3 Few Shot/Sparse Training View NeRF
related to ”baking”. They also employed early ray termina- In pixelNeRF [58] (December 2020), Yu et al. used the
tion and empty space skipping. These two methods alone pretrained layers of a Convolutional Neural Networks (and
improved on the baseline NeRF’s render time by a factor of bilinear interpolation) to extract image features. Camera
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
rays used in NeRF were then projected onto the image plane GeoNeRF[101] (November 2021) extracted 2D image fea-
and the image features were extracted for each query points. tures from every view using a pretrained Feature Pyramid
The features, view direction, and query points were then Network. This method then constructed a cascading 3D
passed onto the NeRF network which produced density cost-volume using plane sweeping. From these two feature
and color. General Radiance Field (GRF) [99] (Oct 2020) representations, for each of the N query point along a
by Trevithick et al. took a similar approach, with the key ray, one view independent, and multiple view dependent
difference being that GRF operated in canonical space as feature tokens were extracted. These were refined using a
opposed to view-space for pixelNeRF. Transformer [102]. Then, the N view-independent tokens
MVSNeRF [57] (March 2021) used a slightly different are refined through an AutoEncoder, which returned the
approach. They also extracted 2D image features using a N densities along the ray. The N sets of view dependent
pretrained CNN. These 2D features were then mapped to tokens were each fed into an MLP which extracted color.
a 3D voxelized cost volume using plane sweeping and a These networks can all be pretrained and generalized well
variance based cost. A pretrained 3D CNN was used to to new scenes as shown by the authors. Moreover, they
extract a 3D neural encoding volume which was used to can be fine tuned per-scene achieving great results on the
generate per-point latent codes using interpolation. When DTU [26], NeRF synthetic [1], and LLF Forward Facing [5]
performing point sampling for volume rendering, the NeRF datasets, outperforming methods such as pixelNeRF [58]
MLP then generated point density and color using as input and MVSNeRF [57].
these latent features, point coordinate and viewing direc- Concurrent to GeoNeRF is LOLNeRF (November 2021)
tion. The training procedure involves the joint optimization [103], which is capable of single-shot view synthesis of
of the 3D feature volume and the NeRF MLP. When eval- human faces, and is built similarly to π -GAN [63], but uses
uating on the DTU dataset, within 15 minutes of training, Generative Latent Optimization [104] instead of adversarial
MVSNeRF could achieve similar results to hours of baseline training [105].
NeRF training. NeRFusion (March 2022) also extracted a 3D cost volume
DietNeRF [60] (June 2021) introduced the semantic con- from 2D image features extracted from CNN, which was
sistency loss Lsc based on image features extracted from then processed by a sparse 3D CNN into a local feature
Clip-ViT [100], in addition to the standard photometric loss. volume. However, this method performs this step for each
frame, and then used a GRU [106] to fuse these local feature
λ ˆ 22 volumes into a global feature volume, which were used
Lsc = ||φ(I) − φ(I)|| (19)
2 to condition density and color MLPs. NeRFusion outper-
where φ performs the Clip-ViT feature extraction on training formed IBRNet, baseline NeRF [1], NeRFingMVS[44], MVS-
image I and rendered image Iˆ. This reduced to a cosine NeRF [57] on ScanNet [28], DTU [26], and NeRF Synthetic
similarity loss for normalized feature vectors (eq. 5 in [60]). [1] datasets.
DietNeRF was benchmarked on a subsampled NeRF syn- AutoRF [107] (April 2022) focused on novel view syn-
thetic dataset [1], and DTU dataset [26]. The best performing thesis of objects without background. Given 2D multi-view
method for single-view novel synthesis was a pixelNeRF images, a 3D object detection algorithm was used with
[58] model fine-tuned using the semantic consistency loss of panoptic segmentation to extract 3D bounding boxes and
DietNeRF. object masks. The bounding boxes were used to define
The Neural Rays (NeuRay) approach, by Liu et al. [59] Normalized Object Coordinate Spaces, which were used
(July 2021) also used a cost volume approach. From all input for per-object volume rendering. An encoder CNN was
views, the authors estimated cost volumes (or depth maps) used to extract appearance and shape codes which were
using multi-view stereo algorithms. From these, a CNN is used in the same way as in GRAF [62]. In addition to the
used to created feature maps G. During volume rendering, standard photometric loss (8), an additional occupancy loss
from these features, both visibility and local features are was defined as
extracted and processed using MLPs to extract color and 1 X
Locc = log(Yu (1/2 − α) + 1/2) (21)
alpha. The visibility is computed as a cumulative density |Wocc | u∈W
occ
function written as a weighted sum sigmoid functions Φ
where Y is the object mask, and Wocc is either the set of
foreground pixels, or background pixels. During test-time,
N
X the shape codes, appearance code, and bounding boxes were
v(z) = 1 − t(z), where t(z) = wi Φ((z − µi )/σi ) (20) further refined using the same loss function. The method
i=1
outperformed pixelNeRF [58] and CodeNeRF [42] for object-
where wi , µi , σi are decoded from G using an MLP. NeuRay based novel view synthesis on the nuScene [108], SRN-Cars
also used an alpha based sampling strategy, by computing a [6] and Kitti [109] datasets.
hitting probability, and only sampling around points with a SinNeRF [24] attempted NeRF scene reconstruction from
high hitting probability (see Sec. 3.6 in [59] for details). Like single images by integrating multiple techniques. They used
other NeRF models conditioned on extracted image features image warping and known camera intrinsic and poses
from pre-trained neural networks, NeuRay generalizes well to create reference depth for depth-supervision of unseen
to new scenes, and can be further fine-tuned to exceed the views. They used adversarial training with a CNN discrim-
performance of the baseline NeRF model. NeuRay outper- inator to provide patch-wise texture supervision. They also
formed MVSNeRF on the NeRF synthetic dataset after fine- use a pretrained ViT to extract global image features from
tuning both models for 30 minutes. reference patch and unseen patch, comparing them with an
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
L2 loss term a global structure prior. SinNeRF outperformed Another form of latent conditioning used by NeRF
DS-NeRF [22], PixelNeRF [58], and DietNeRF [58] on the models is Generative Latent Optimization (GLO) [104]. In
NeRF synthetic dataset [1], the DTU dataset[26] and the GLO, a set of randomly sampled latent codes {z1 , ..., zn },
LLFF dataset [5]. usually normally distributed, is paired to a set of images
Methods such as [22] and [43] from section 3.1.2 ap- {I1 , ..., In }. These latent codes are input to a generator
proach the sparse view problem by using point clouds for G whose parameters are jointly optimized with the latent
depth supervision. code using some reconstruction loss L such as L2 . I.e., the
optimization is formulated as
n
3.4 (Latent) Conditional NeRF X
min L(G(zi , ui ), Ii ) (23)
G,zi ,...,zn
Latent conditioning of NeRF models refers to using latent i=i
vector(s)/code(s) to control various aspects of NeRF view where ui represent the other inputs not optimized over
synthesis. These latent vectors can be input at various points (needed in NeRF but not necessarily for other models).
along the pipeline to control scene composition, shape, and According to the GLO authors, this method can be thought
appearance. They allow for an addition set of parameters of as a Discriminator-less GAN.
to control for aspects of the scene which changes image-
to-image, while allowing for other parts to account for the 3.4.1 Adversarially Trained Models
permanent aspects of the scene, such as scene geometry. GRAF [62] (July 2020) was the first latent conditioned NeRF
A fairly simple way to train image generators conditioned model trained adversarially. It paved way for many later
on latent code is to use Variational Auto-Encoder (VAE) works. A NeRF based generator was conditioned on latent
[110] methods. These models use an Encoder, which turns appearance code za and shape code zs , and is given by
images into latent codes following a particular probability
distribution defined by the user, and a Decoder which turns G(γ(x), γ(d), zs , za ) → (σ, c). (24)
sampled latent codes back into images. These methods are In practice, the shape code, conditioning scene density,
not as often used in NeRF models compared to the two was concatenated with the embedded position, which was
following methods, as such, we do not introduce a separate input to the direction independent MLP. The appearance
subsection for VAE. code, conditioning scene radiance, was concatenated with
In NeRF-VAE [65] (January 2021), Kosiorek et al. pro- the embedded viewing direction, which was input to the di-
posed a generative NeRF model which generalized well to rection dependent MLP. As per baseline NeRF, images were
out-of-distribution scenes, and removed the need to train on generated via volume sampling. These were then compared
each scene from scratch. The NeRF renderer in NeRF-VAE using a discriminator CNN for adversarial training.
was conditioned on latent code which was trained using Within three months of GRAF, Chan et al. developed
Iterative Amortized Inference [111][112] and a ResNet [113] π -GAN [63] (December 2020) which also used a GAN ap-
based encoder. The authors also introduced an attention- proach to train a conditional NeRF model. The generator
based scene function (as opposed to the typical MLP). NeRF- was a SIREN-based [114] NeRF volumetric renderer, with
VAE consistently outperformed the baseline NeRF with low sinusoidal activation replacing the standard ReLU activa-
number (5-20) of scene views, but due to lower scene ex- tion in the density and color MLPs. π -GAN outperformed
pressivity, was outperformed by baseline NeRF when large GRAF[62] and HOLOGAN on standard GAN datasets such
number of views were available (100+). as Celeb-A[115], CARLA [116] and Cats [117].
Adversarial training is often used for generative and/or Pix2NeRF [118] (February 2022) was proposed as an
latent conditioned NeRF models. First developed in 2014, adversarial trained NeRF model which could generate NeRF
Generative Adversarial Networks (GANs) [105] are genera- rendered images given randomly sampled latent codes and
tive models which employ a generator G which synthesizes poses. Built from π -GAN [63], It is composed of a NeRF
images from ”latent code/noise”, and a discriminator D based generator G : d, z → I , a CNN discriminator
which classifies images as ”synthesized” or ”real”. The D : I → d, l and an Encoder E : I → d, z . z is a latent
generator seeks to ”trick” the discriminator, and make its code sampled from a multi-variate distribution, d is a pose,
images indistinguishable from ”real” training images. The I is a generated image, and l is a binary label for real vs.
discriminator seeks to maximize its classification accuracy. synthesized image. In addition to the π -GAN loss, from
These two networks are trained adversarially, which is the which the adversarial architecture is based, the pix2NeRF
optimization of the following minimax loss/value function, loss function also include the following: 1) a reconstruction
loss comparing zpredicted and zsampled to ensure consistency
min max Ex∼data [log D(x)] + Ez∼p(z) [log(1 − D(G(z))] of latent space, 2) a reconstruction loss ensuring image re-
G D
(22) construction quality, between Ireal and Ireconstructed , where
where the generator generates images based on latent code Ireconstructed is created by the Generator from a zpred , dpred
z sampled from some distribution p(z), which the dis- pair produced by the Encoder 3) a conditional adversarial
criminator compares to training image x. In GAN-based objective which prevents mode collapse towards trivial
NeRF model, the generator G encompasses all novel-view poses (see original paper for the exact expressions). The
synthesis steps, and is transitionally thought of as the NeRF model achieved good results on the CelebA dataset [115],
model. The generator in this case also requires an input the CARLA dataset [116], and the ShapeNet subclasses from
pose, in addition to a latent code. The discriminator D is SRN [119], but is outperformed by its backbone π -GAN for
usually an image classification CNN. conditional image synthesis.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
3.4.2 Jointly Optimized Latent Models then reparametrized using as an inverted sphere space. Two
Edit-NeRF [67] (June 2021) allowed for scene editing using separate NeRF models were trained, one for the inside the
image conditioning from user input. Edit-NeRF’s shape sphere and one for the outside. The camera ray integral
representation was composed of a category specific shared was also evaluated in two parts. Using this approach, they
shape network Fshare , and a instance specific shape network outperformed the baseline NeRF on Tanks and Temples [30]
Finst . Finst was conditioned on zs whereas Fshared was scenes as well as scenes from Yucer et al. [121].
not. In theory, the Fshared behaved as a deformation field, GIRAFFE [61] (November 2020) also was built with a
not unlike [40]. The NeRF editing was formulated as a joint similar approach to NeRF-W, using generative latent codes
optimization problem of both the NeRF network parameters and separating background and foreground MLP for scene
and the latent codes zs , za , using GLO. They then conducted composition. GIRAFFE was based on GRAF. It assigned
NeRF photometric loss optimization on latent codes, then on to each object in the scene an MLP, which produced a
the MLP weights, and finally optimized both latent codes scalar density and a deep feature vector (replacing color).
and weights jointly. These MLPs (with shared architecture and weights) took as
Innovating on Edit-NeRF, CLIP-NeRF’s [13] (December input shape and appearance latent vectors, as well as an
2021) neural radiance field was based on the standard latent input pose. The background was treated as all other objects,
conditioned NeRF, i.e., NeRF models conditioned on shape except with its own MLP and weights. The scene was then
and appearance latent codes. However, by using Contrastive composed using a density weighted sum of features. A
Language Image Pre-training (CLIP), CLIP-NeRF could ex- small 2D feature map was then created from this 3D volume
tract from user input text or images, the induced latent space feature field using volume rendering, which was fed into
displacements by using shape and appearance mapper net- an upsampling CNN to produce an image. GIRAFFE per-
works. These displacements could then be used to modify formed adversarial training using this synthesized image,
the scene’s NeRF representation based on these input text and a 2D CNN discriminator. The resulting model had a
or images. This step allowed for skipping the per-edit latent disentangled latent space, allowing for fine control over the
code optimization used in Edit-NeRF, resulting in a speed- scene generation.
up of a factor of ∼8-60, depending on the task. They also Fig-NeRF [69] (April 2021) also took on scene composi-
used a deformation network similar to deformable NeRFs tion, but focused on object interpolation and amodal seg-
[40] (called instance-specific-network) in Edit-NeRF [67] to mentation. They used two separate NeRF models, one for
help with modifying the scene based on latent space dis- the foreground, one for the background. Their foreground
placement. The model was trained in two stages. In the first, model was the deformable Nerfies model [40]. Their back-
CLIP-NeRF’s conditional latent NeRF models was trained ground model was an appearance latent code conditioned
adversarially. In the second, the CLIP shape and appear- NeRF. They used two photometric losses, one for the fore-
ance mappers were trained using self-supervision. When ground, one for the background. Fig-NeRF achieved good
applying CLIP-NeRF to images with unknown pose and results for amodal segmentation and object interpolation, on
latent codes, the authors used an Expectation-Maximization datasets such as ShapeNet [31] Gelato [122], and Objectron
algorithm which then allowed them to use CLIP-NeRF [123].
latent code editing. CLIP-NeRF outperformed EDIT-NeRF Yang et al. [70] (September 2021) created composition
in terms of inference speed, and post-edit metrics, especially model which can edit objects within the scene. They used a
for global edits, which was weakness of Edit-NeRF. voxel-based approach [47], creating a voxel grid of features
which is jointly optimized with MLP parameters. They used
two different NeRFs, one for objects, one for the scene, both
3.5 Unbound Scene and Scene Composition
of which were conditioned on interpolated voxel features.
In NeRF in the Wild (NeRF-W) [66] (August 2020), Martin- The object NeRF was further conditioned on a set of object
Brualla et al. addressed two key issues of baseline NeRF activation latent codes. Their method was trained and evalu-
models. Real-life photographs of the same scene can contain ated on ScanNet[28] as well as an inhouse ToyDesk dataset
per-image appearance variations due to lighting conditions, with instance segmentation labels. They incorporated seg-
as well as transient objects which are different in each mentation labels with a mask loss term given by
image. The density MLP was kept fixed for all images in a
scene. However, NeRF-W conditioned their color MLP on a w(r)k ||Ô(r)k − M (r)||22 (25)
per-image appearance embedding. Moreover, another MLP PN
conditioned on per-image transient embedding predicted where Ô(r)k = i=i Ti αi is the 2D object opacity, M (r) is
the color and density functions of transient objects. These the kth object mask of 0s and 1s, and w(r)k is the balance
latent embeddings were constructed using Generative La- weight between 0s and 1s in the mask. The authors edited
tent Optimization. NeRF-W did not improve on NeRF in objects within the scene by first obtaining the background
terms of rendering speed, but achieved much higher results colors and densities from the scene NeRF branch, pruning
in the crowded Phototourism dataset [120]. sample points at the object’s location. Then the object’s
Zhang et al. developed the NeRF++ [68] (October 2020) colors and densities are obtained from the object NeRF,
model, which was adapted to generate novel views for and transformed according to user defined manipulations.
unbound scenes, by separating the scene using a sphere. Finally, all colors and densities are aggregated according to
The inside of the sphere contained all foreground object the discrete volume rendering equation (4). The authors’s
and all fictitious cameras views, whereas the background method outperformed baseline NeRF [1] as well as NSVF
was outside the sphere. The outside of the sphere was [47] on both their inhouse dataset and ScanNet.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
NeRFReN [23] (November 2021) addressed the problem by adaptively masking the positional encoding, similar to
of reflective surfaces in NeRF view synthesis. The authors the technique used in Nerfies [40]. Overall, BARF results
separated the radiance field into two components, transmit- exceeded those of NeRF– on the LLFF forward facing scenes
ted (σ t , ct ) and reflected (σ r , cr ), with the final pixel value dataset with unknown camera poses by 1.49 PNSR averaged
given by over the eight scenes, and outperformed COLMAP regis-
I = It + βIr (26) tered baseline NeRF by 0.45 PNSR. Both BARF and NeRF–
used naive dense ray sampling for simplicity.
where β is the reflection fraction given by the geometry of
Jeong et al. introduced a self-calibrating joint optimiza-
the transmitted radiance field as
X tion model for NeRF (SCNeRF) [77] (August 2021). Their
β= Tσit (1 − exp(−σit δi ))αi . (27) camera calibration model can not only optimize unknown
i poses, but also camera intrinsic for non-linear camera mod-
Tσit is given by (3), and αi by (5). In addition to the standard els such as fish-eye lens models. By using curriculum learn-
photometric loss, the authors used a depth smoothness Ld ing, they gradually introduce the nonlinear camera/noise
(eq. 8 in [23]) loss to encourage the transmitted radiance parameters to the joint optimization. This camera optimiza-
field to have the correct geometry, and likewise, a bidi- tion model was also modular and could be easily used with
rectional depth consistency loss Lbdc (eq. 10 in [23]) for different NeRF models. The method outperformed BARF
the reflected radiance field. NeRFReN was able to render [76] on LLFF scenes [5].
reflective surface on the authors’ RFFR dataset, outperform- Sucar et al. introduced the first NeRF-based dense online
ing benchmark methods such as baseline NeRF [1], and SLAM model named iMAP [73] (March 2021). The model
NerfingMVS [44], as well as ablation models. The method jointly optimizes camera pose and the implicit scene rep-
was shown to support scene editing via reflection removal resentation in the form of a NeRF model, making use of
and reflection substitution. continual online learning. They used an iterative two-step
Panoptic Neural Field [72] separates the scene into fore- approach of tracking (pose optimization with respect to
ground and objects. The background is represented by an NeRF) and mapping (bundle-adjustment joint optimization
MLP which outputs color, density, and semantic label. Each of pose and NeRF model parameters). iMAP achieved a
object’s color and density are represented by their own MLP pose tracking speed close to camera framerate by running
with a dynamic bounding box, foregoing the traditional the much faster tracking step in parallel to the mapping
approach of using a shared MLP with object specific latent process. iMAP also used keyframe selection by optimizing
codes. The method is capable of a wide variety of computer the scene on a sparse and incrementally selected set of
vision tasks such as 2D panoptic segmentation, 2D depth images.
prediction, scene editing, as well as the standard view GNeRF, a different type of approach by Meng et al [64]
synthesis and 3D reconstruction. (March 2021), used pose as generative latent code. GNeRF
first obtains coarse camera poses and radiance field with
adversarial training. This is done by using a generator which
3.6 Pose Estimation takes a randomly sampled pose, and synthesized a view
iNeRF [124] (December 2020) formulated pose reconstruc- using NeRF-style rending. Then a discriminator compared
tion as an inverse problem. Given a pre-trained NeRF, the rendered view with the training image. An inversion
using the photo-metric loss 8, the Yen-Chen et al. opti- network then took the generated image, and output a pose,
mized the pose instead of the network parameters. The which was compared to the sampled pose. This resulted in
authors used an interest-point detector, and performed in- a coarse image-pose pairing. The images and poses were
terest region-based sampling. The authors also performed then jointly refined via a photometric loss in a hybrid
semi-supervision experiments, where they used iNeRF pose optimization scheme. GNeRF was slightly outperformed by
estimation on unposed training images to augment the COLMAP based NeRF on the Synthetic-NeRF dataset, and
NeRF training set, and further train the forward NeRF. This outperformed COLMAP based NeRF on the DTU dataset.
semi-supervision was shown by the author to reduce the Building on iMAP, NICE-SLAM [74] (December 2021)
requirement of posed photos from the forward NeRF by 25 improved on various aspects such as keyframe selection and
%. NeRF architecture. Specifically, they used a hierarchical grid
NeRF– [75] (February 2021) however jointly estimated based representation of the scene geometry, which was able
NeRF model parameters and camera parameters. This al- to fill in gaps iMAP reconstruction in large scale unobserved
lowed for the model to construct radiance fields and syn- scene features (walls, floors etc.) for certain scenes. NICE-
thesize novel views only images, in an end-to-end man- SLAM achieved lower pose estimation errors and better
ner. NeRF– overall achieved comparable results to using scene reconstruction results than iMAP. The NICE-SLAM
COLMAP with the 2020 NeRF model in terms of both also only used ∼1/4 of the FLOPs of iMAP, ∼1/3 the
view synthesis. However, due to limitations with pose ini- tracking time and ∼1/2 the mapping time.
tialization, NeRF– was most suited for front facing scenes,
and struggled with rotational motion and object tracking
3.7 Adjacent Methods
movements.
Concurrent to NeRF– was the Bundle-Adjusted Neural 3.7.1 Fast MLP-less Volume Rendering
Radiance Field (BARF) [76] (April 2021), which also jointly Plenoxel [54] (December 2021) followed in Plenoctree’s foot-
estimated poses alongside the training of the neural ra- steps in voxelizing the scene and storing a scalar for density
diance field. BARF also used a coarse-to-fine registration and spherical harmonics coefficients direction dependent
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
color. However, surprisingly, Plenoxel skipped the MLP alongside relative viewing direction ∆di to produce a color
training entirely, and instead fit these features directly on blending weight. The final color was simply the per-image
the voxel grid. They also obtained comparable results to color ci summed using the blending weights. IBRNet in
NeRF ++ and JaxNeRF, with faster training times by a factor general outperformed baseline NeRF [1], and used 1/6 the
of a few hundreds. These results showed that the primary number of FLOPs in the experimental setup of the paper.
contribution of NeRF models is the volumetric rendering of Compared to NeRF models, Scene Rendering Trans-
new view given densities and colors, and not the density former (SRT) [125] (November 2021) took a different ap-
and color MLPs themselves. proach to volume rendering. They used a CNN to extract
A concurrent paper by Sun et al. [55] (November 2021) feature patches from scene images which were then fed
also explored this topic. The authors also directly optimized into Encoder-Decoder Transformers [102], which along with
a voxel grid of scalars for density. However, instead of camera ray and viewpoint coordinates {o, d}, which then
using spherical harmonic coefficient, the authors use 12 produced the output color. The entire ray was queried
and 24 dimensional features, and a small shallow decoding at once, unlike with NeRF models. The SRT is geometry-
MLP. The authors used a sampling strategy analogous to free. and did not produce the scene’s density function, nor
the coarse-fine sampling of the original NeRF paper by relied on geometric inductive biases. The NeRFormer [126]
training a coarse voxel grid first, and then a fine voxel (September 2021) is a comparable concurrent model which
grid based on the geometry of the coarse grid. The model also uses Transformers as part of the volume rendering
was named Direct Voxel Grid Optimization (DVGO), which process. The paper also introduced the Common Objects in
outperformed the baseline NeRF (1-2 days) of training with 3D dataset, which could gain popularity in the near future.
only 15 minutes of training on the Synthetic-NeRF dataset.
The authors obtained a PSNR of 31.95 at voxel resolution 3.8 Applications
1603 after 11 minutes of training, and a PSNR of 32.80
3.8.1 Urban
at voxel resolution of 2563 after 22 minutes of trainign
on a 2080Ti. They outperformed Plenoxel’s 5123 resolution Urban Radiance Fields [45] (November 2021) aimed at ap-
model’s 31.71 PSNR after 11 minutes of training on an RTX plying NeRF based view synthesis and 3D reconstruction
Titan. for urban environment using sparse multi-view images
Along the lines of MLP-less radiance field, TensorRF supplemented by LiDAR data. In addition to the standard
[56] (March 2022) also used neural-network-free volume photometric loss, they also use a LiDAR based depth loss
rendering. TensorRF stored a scalar density and a vector Ldepth and sight loss Lsight , as well as a skybox based
color feature (SH harmonic coefficient, or color feature to be segmentation loss Lseg . These are given by
input into a small MLP) in a 3D voxel grid, which were then Ldepth = E[(z − ẑ 2 )], (28)
represented as a rank 3 tensor Tσ ∈ RH×W ×D and a rank 4
tensor Tc ∈ RH×W ×D×C , where H, W, D are the height, Z t2
width and depth resolution of the voxel grid, and C is Lsight = E[ (w(t) − δ(z))2 dt]. (29)
channel dimension. Then the authors used two factorization t1
schemes: CANDECOMP-PARAFAC (CP) which factorized Z t2
the tensors as pure vector outer products and Vector Matrix Lseg = E[Si (r (w(t) − δ(z))2 dt]. (30)
(VM) which factorized the tensors as vector/matrix outer t1
products. These factorizations decreased the memory re- w(t) is defined as T (t)σ(t) as defined in eq(3). z and ẑ
quirement from Plenoxels by a factor of 200 when using are the LiDAR measure depth and estimated depth (6),
CP. Their VM factorization performed better in terms of respectively. δ(z) is the Dirac delta function. Si (r) = 1 if
visual quality, albeit at a memory tradeoff. The training the ray goes through a sky pixel in the ith image, where sky
speed was comparable to Pleoxels and much faster than the pixels are segmented through a pretrained model [145], 0
benchmarked NeRF models. otherwise. The depth loss forces the estimated depth ẑ to
match the LiDAR acquired depth. The sight loss forces the
3.7.2 Others radiance to be concentrated at the surface of the measured
IBRNet [91] (February 2021) was published in 2021 as a depth. The segmentation loss forces point samples along
NeRF adjacent method for view synthesis that is widely rays through to the sky pixels to have zero density. 3D
used in benchmarks. For a target view, IBRNet selected N reconstruction was performed by extracting point clouds
views from the training set who’s viewing directions are form the NeRF model during volumetric rendering. A ray
most similar. A CNN was used to extract features from these was cast for each pixel in the virtual camera. Then, the
images. For a single query point, for each of the i input estimated depth was used to place the point cloud in the 3D
view, the known camera matrix was used to project onto the scene. Poisson Surface Reconstruction was used to construct
corresponding image to extract color ci and feature fi . An 3D mesh from this generated point cloud. The authors used
MLP was then used to refine these features fi0 to be multi- Google Street View data on which the Urban Radiance Field
view aware and produce pooling weights wi . For density outperformed NeRF [1], NeRF-W [66], mip-NeRF [37], and
prediction these features were summed using the weights. DS-NeRF[22] for both view synthesis and 3D reconstruction.
This is done for each query point, and the results (of all Mega-NeRF [128] (December 2021) performed large scale
query points along the ray) were concatenated together and urban reconstruction from aerial drone images. Mega-NeRF
fed into a ray Transformer [102] which predicted the density. used a NeRF++[68] inverse sphere parameterization to sep-
For color prediction, the fi ’s and wi ’s were fed into an MLP arate foreground from background. However, the authors
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
Urban
MegaNeRF [128], BungeeNeRF [129],
Remote Sensing/Aerial
S-NeRF [130]
Fundamental Operations
Denoising/Deblurring/ RawNeRF [39], DeblurNeRF [133],
Super-Resolution NaN [134], NeRF-SR [135]
Human NeRF
Neural Body [140], HumanNeRF [141], Zheng et al. [142],
Body
DoubleField [143], Animatable NeRF [144]
Fig. 3. Application of NeRF models. Papers are selected based on application as well as citation numbers and GitHub star rating.
extended the method by using an ellipsoid which better fit discard low visibility Block-NeRFs. Neighbourhoods were
the aerial point of view. They incorporated the per-image divided into blocks, on which a Block-NeRF was trained.
appearance embedding code of NeRF-W [66] into their These blocks were assigned with overlap, and images were
model as well. They partitioned the large urban scenes into sampled from overlapping Block-NeRFs and composited
cells each one represented by its own NeRF module, and using inverse distance weighting after an appearance code
train each module on only the images with potentially rele- matching optimization.
vant pixels. For rendering, the method also cached a coarse Other methods published in first class conferences such
rendering of densities and colors into an octree. During as S-NeRF [130] (April 2021), BungeeNeRF [129] (December
rendering for dynamic fly-through, a coarse initial view was 2021), also deal with urban 3D reconstruction and view
quickly produced, and dynamically refined via additional synthesis, albeit from remote sensing images.
rounds of model sampling. The model outperformed bench-
marked baseline such as NeRF++[68], COLMAP based MVS 3.8.2 Human Body
reconstruction [2], and produced impressive fly-through Neural Body applied NeRF volume rendering to rendering
videos. humans with moving poses from videos. The authors first
used the input video to anchor a vertex based deformable
Block-NeRFs [127] (February 2022) performed city-scale human body model (SMPL [146]). Onto each vertex, the
NeRF based reconstruction from 2.8 million street-level im- authors attached a 16-dimensional latent code Z. Human
ages. Such large scale outdoor dataset posed problems such pose parameters S (initially estimated from video during
as transient appearance and objects. Each Individual Block- training, can be input during inference) were then used
NeRF was built on mip-NeRF [37] by using its IPE and to deform the human body model. This model was vox-
NeRF-W[66] by using its appearance latent code optimiza- elized in a grid and then processed by a 3D CNN, which
tion. Moreover, the authors used semantic segmentation to extracted 128-dimensional latent code (feature) at each oc-
mask out transient objects such as pedestrians and cars dur- cupied voxel. Any 3D point x was first transformed to
ing NeRF training. A visibility MLP was trained in parallel, SMPL coordinate system, then a 128-dimensional latent
supervised using the transitivity function (3) and the density code/feature ψ was extract via interpolation. This was
value generated by the NeRF MLP. These were used to passed to the density MLP. The color MLP took in addition
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
the positional encoding of 3D coordinate γx (x) and viewing was done using an additional direction independent MLP
direction γd (d) and appearance latent code lt (accounting (branch) which took as input position and density MLP
for per frame difference in the video). features, and produced point-wise semantic label s. The
semantic labels were also generated via volume rendering
σ(x) = Mσ (ψ(x|Z, S)). (31) via
N
c(x) = Mc (ψ(x|Z, S), γx (x), γd (d), lt ).
X
(32) S(r) = Ti αi si . (34)
i
Standard NeRF approaches struggled with the moving bod-
ies, whereas the mesh deformation approach of Neural The semantic labels were supervised via a categor-
Body was able to interpolate between frames and between ical cross entropy loss. The method was able to train
poses. State of the art models from top tier conferences with sparse semantic label (10% labelled) training data,
in 2021/2022 such as Animatable NeRF [144] (May 2021), as well as recover semantic label from pixel-wise noise
DoubleField [143] (Jun 2021), HumanNeRF [141] (Jan 2022), and region/instance-wise noise. The method also achieved
Zheng et al. [142] (Mar 2022) also innovated on this topic. good label-super resolution results, and label propagation
(from sparse point-wise annotation), and can be used for
3.8.3 Image Processing multi-view semantic fusion, outperforming non-deep learn-
ing methods. NeSF [131] (November 2021). The previously
Mildenhall et al. created RawNeRF [39] (Nov 2021), adapt-
introduced Fig-NeRF [69] also approached this issue.
ing Mip-NeRF [37], to High Dynamic Range (HDR) image
view synthesis and denoising. RawNeRF renders in a linear 3.8.4 Surface Reconstruction
color space using raw linear images as training data. This
The scene geometry of NeRF model is implicit and hidden
allows for varying exposure and tone-mapping curves, es-
inside the neural networks. However, for certain applica-
sentially applying the post-processing after NeRF rendering
tions, more explicit representations, such as 3D mesh are de-
instead of directly using post-processed images as training
sired. For the baseline NeRF, it is possible to extract a rough
data. It is trained using a relative MSE loss for noisy HDR
geometry by evaluating and thresholding the density MLP.
path tracing from Noise2Noise [147], given by
The methods introduced in this subsection used innovative
N
X yˆi − yi 2 scene representation strategies, changing the fundamental
L= ( ) (33) behaviour of the density MLP.
sg(yˆi ) +
i=1 UNISERF [138] (April 2021) reconstructed scene surfaces
where sg(.) indicates a gradient-stop (treating its argument by replacing the alpha value ai at the i-th sample point used
as a constant with zero gradient). RawNeRF is supervised in the discretized volume rendering equation, given by (5),
with variable exposure images, with the NeRF models’ with a discrete occupancy function o(x) = 1 in occupied
”exposure” scaled by the training image’s shutter speed, as space, and o(x) = 0 in free space. This occupancy function
well as a per-channel learned scaling factor. It achieved im- was also computed by an MLP, and essentially replaced
pressive results in night-time and low light scene rendering the volume density. Surfaces were then retrieved via root-
and denoising. RawNeRF is particularly suited for scenes finding along rays. UNISURF outperformed benchmark
with low lighting. On the topic of NeRF based denoising, methods including using density threshold in baseline NeRF
NaN [134] (April 10) also explored this emerging research models, as well as IDR [148]. The occupancy MLP can be
area. used to define an explicit surface geometry for the scene.
Concurrent to RawNeRF, HDR-NeRF [132] (Nov 2021) A recent workshop by Tesla [149] showed that autonomous
from Xin et al. also worked on HDR view synthesis. How- driving module’s 3D understanding is driven by one such
ever, HDR-NeRF approached HDR view synthesis by using NeRF-like occupancy network.
Low Dynamic Range training images with variable expo- Signed distance functions (SDF) give the signed distance
sure time as opposed to the raw linear images in RawN- of a 3D point to the surface(s) they define (i.e., negative
eRF. RawNeRF modelled a HDR radiance e(r) ∈ [0, ∞) distance if inside an object, positive distance if outside).
which replaced the standard c(r) in (1). This radiance was They are often used in computer graphics to define surfaces
mapped to a color c using three MLP camera response which are the zero set of the function. SDF can be used for
functions (one for each color channel) f . These represent surface reconstruction via root-finding, and can be used to
the typical camera manufacturer dependent linear and non- define entire scene geometries.
linear post-processing. HDR-NeRF strongly outperformed The Neural Surface (NeuS) [136] (June 2021) model
baseline NeRF and NeRF-W [66] on Low Dynamic Range performed volume rendering like the baseline NeRF model.
(LDR) reconstruction, and achieved high visual assessment However it used signed distance functions to define scene
scores on HDR reconstruction. geometries. It replaces the density outputting portion of
Other methods such as DeblurNeRF [133] (November the MLP with an MLP which outputs the signed distance
2021), NeRF-SR [135] (December 2021), NaN (April 2022) function value. The density ρ(t) which replaced σ(t) in the
[134]. Focus on fundamental image processing tasks such volume rendering equation (2) was then constructed as
as denoising, deblurring and super-resolution, allowing for
− dΦ
dt (f (r(t)))
high quality view synthesis from low quality training im- ρ(t) = max( , 0) (35)
ages. Φ(f (r(t)))
Semantic-NeRF [71] (March 2021) was a NeRF model where Φ(·) is the sigmoid function, and its derivative
dΦ
capable of synthesizing semantic labels for novel views. This dt is the logistic density distribution. The authors have
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
shown their model to outperform baseline NeRF, and have for which a highly impactful research direction would be to
shown both theoretical and experimental justification for improve the data structure and design of these additional
their method and their implementation of SDF based scene learned scene features.
density. This method was concurrent to UNISERF and out-
performed it on the DTU dataset [26]. Like with UNISURF, 4.2 Concerning Quality
performing root finding on the SDF can be used to define an
explicit surface geometry for the scene. For quality improvement, we found NeRF-W’s [66] im-
A concurrent work by Azinovic et al. [137] (April 2021) plementation of per-image transient latent code and ap-
also replaced the density MLP with a truncated SDF MLP. pearance code to be influential. A similar idea was also
They instead computed their pixel color as weighted sum of found on the concurrent GRAF [62]. These latent codes
sampled colors allowed for the NeRF model to control per-image light-
ing/coloration change, as well as small changes in scene
PN
wi ci content. On NeRF fundamentals, we found mip-NeRF [37]
C(r) = Pi=1 N
(36) to be most influential, as the cone tracing IPE was unlike any
i=1 wi previous NeRF implementation. Ref-NeRF, built upon mip-
where wi is given by a product of sigmoid function given NeRF, then further improved view-dependent appearance.
by Ref-NeRF is an excellent modern baseline for quality based
Di Di NeRF research. Specific to image processing, innovations
wi = Φ( ) · Φ(− ) (37)
tr tr from RawNeRF [39] and DeblurNeRF [133] can be combined
where tr is the truncation distance, which cuts off any SDF with Ref-NeRF as foundation to build extremely high qual-
value too far from individual surfaces. To account for possi- ity denoising/deblurring NeRF models.
ble multiple ray-surface intersection, subsequent truncation
regions weighted to zero, and do not contribute to the pixel 4.3 Concerning Pose Estimation and Sparse View
color. The authors also use a per-frame appearance latent
Given the current state of NeRF research, we believe non-
code from NeRF-W [66] to account for to white-balance and
SLAM pose estimation is a solved problem. The SfM using
exposure changes. The authors reconstructed the triangular
the COLMAP[2] package is used by most NeRF dataset
mesh of the scene by using Marching Cubes [150] on their
to provide approximate poses, which is sufficient for most
truncated SDF MLP, and achieved clean reconstruction re-
NeRF research. BA can also be used to jointly optimize
sults on ScanNet[28] and a private synthetic dataset (but is
NeRF models and poses during training. NeRF based SLAM
not directly comparable to UNISERF and NeuS since DTU
is a relatively under-explored area of research. iMAP [73]
results were not provided).
and Nice-SLAM [74] offer excellent NeRF based SLAM
frameworks which could integrate faster and better quality
4 D ISCUSSION NeRF models.
4.1 Concerning Speed Sparse View/few shot NeRF use 2D/3D feature extrac-
tion from multi-view images using pretrained CNN. Some
The baseline NeRF models had both slow training and in- also use point cloud from SfM for additional supervision.
ference speed. Currently, speed based NeRF/NeRF-adjacent We believe many of these models already achieved the
models use three main paradigms. They either 1) are goal of few shot (2-10 views). Further small improvements
baked (by evaluating an already trained NeRF model, and can be achieved by using more advanced feature extraction
caching/baking its results), or 2) separate learned scene fea- backbones. We believe a key area of research would be com-
tures from the color and density MLPs by using additional bining sparse views methods and fast methods to achieve
learned voxel/spatial-tree features, or 3) perform volume real-time NeRF models deployable to mobile devices.
rendering directly on learned voxel features without use of
MLPs. Additional speed up can be achieved by using more
advanced methods during volume rendering such as empty 4.4 Concerning Applications
space skipping and early ray termination. However, method We believe the immediate applications of NeRF are novel
1) does not improve training time by its design. Methods view synthesis and 3D reconstruction of Urban environ-
2) and 3) require additional memory since voxel/spatial- ment, and of human avatars. Further improvements can
tree based scene features have larger memory footprint be made by facilitating the extraction of 3D mesh, point
compared to small NeRF MLPs. Currently, Instant-NGP [49] cloud or SDF from density MLPs and integrating faster
shows the most promise by making use of a multi-resolution NeRF models. Urban environment specifically require the
hashed positional encoding as additional learned features, division of the environment into separate small scenes, each
the model could represent scenes accurately with tiny and to be represented by a small scene specific NeRF model.
efficient MLPs. This allowed extremely fast training and The baking or learning of separate scene features for speed
inference. The Instant-NGP model also finds applications based NeRF models for city scale models is an interesting
in image compression and neural SDF scene representation. research direction. For human avatars, the integration of a
Alternatively, for method 3), factorized tensor [56] repre- model which can separate view-specific lighting such as Ref-
sentation for the learned voxel grid reduced the memory NeRF[38] would be highly beneficial to applications such
requirement of the learned voxel features by two orders of as virtual reality and 3D graphics. NeRF is also finding
magnitude and is also a viable research area. We expect applications in fundamental image processing tasks such as
future speed-based methods to follow methods 2) and 3) denoising, deblurring, upsampling, compression, and image
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
editing, and we expect more innovations in these areas in [13] C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “Clip-nerf: Text-
the near future as more computer vision practitioners adopt and-image driven manipulation of neural radiance fields,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
NeRF models. Pattern Recognition, 2022, pp. 3835–3844.
[14] Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang,
“Ad-nerf: Audio driven neural radiance fields for talking head
5 C ONCLUSION synthesis,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2021, pp. 5784–5794.
Since the original paper by Mildenhall et al., NeRF models [15] A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole,
have made tremendous progress in speed, quality, and train- “Zero-shot text-guided object generation with dream fields,” in
ing view requirements, improving on all the weaknesses of Proceedings of the IEEE/CVF Conference on Computer Vision and
the original model. NeRF models have found applications in Pattern Recognition, 2022, pp. 867–876.
[16] K. Jo, G. Shim, S. Jung, S. Yang, and J. Choo, “Cg-nerf:
urban mapping/modelling/photogrammetry, image edit- Conditional generative neural radiance fields,” arXiv preprint
ing/labelling, image processing, and 3D reconstruction and arXiv:2112.03517, 2021.
view synthesis of human avatars and urban environments. [17] J. Sun, X. Wang, Y. Shi, L. Wang, J. Wang, and Y. Liu, “Ide-
Both the technical improvements and the applications were 3d: Interactive disentangled editing for high-resolution 3d-aware
portrait synthesis,” arXiv preprint arXiv:2205.15517, 2022.
discussed in detail in this survey, during the completion of [18] Y. Chen, Q. Wu, C. Zheng, T.-J. Cham, and J. Cai, “Sem2nerf: Con-
which, we have noticed an ever-growing interest in NeRF verting single-view semantic masks to neural radiance fields,”
models, and an ever-growing number of preprints and arXiv preprint arXiv:2203.10821, 2022.
[19] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk,
publications. W. Yifan, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi
NeRF is an exciting new paradigm for novel view syn- et al., “Advances in neural rendering,” in Computer Graphics
thesis, 3D reconstruction, and neural rendering. By provid- Forum, vol. 41, no. 2. Wiley Online Library, 2022, pp. 703–735.
ing this survey, we hope to introduce more Computer Vision [20] J. T. Kajiya and B. P. Von Herzen, “Ray tracing volume densities,”
ACM SIGGRAPH computer graphics, vol. 18, no. 3, pp. 165–174,
practitioners to this field, to provide a helpful reference 1984.
of existing NeRF models, and to motivate future research [21] M. Niemeyer, J. T. Barron, B. Mildenhall, M. S. Sajjadi, A. Geiger,
with our discussions. We are excited to see future technical and N. Radwan, “Regnerf: Regularizing neural radiance fields
innovations and applications of Neural Radiance Fields. for view synthesis from sparse inputs,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2022, pp. 5480–5490.
[22] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised
R EFERENCES nerf: Fewer views and faster training for free,” in Proceedings of the
[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ra- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
mamoorthi, and R. Ng, “Nerf: Representing scenes as neural 2022, pp. 12 882–12 891.
radiance fields for view synthesis,” in European conference on [23] Y.-C. Guo, D. Kang, L. Bao, Y. He, and S.-H. Zhang, “Nerfren:
computer vision. Springer, 2020, pp. 405–421. Neural radiance fields with reflections,” in Proceedings of the
[2] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revis- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ited,” in Proceedings of the IEEE conference on computer vision and 2022, pp. 18 409–18 418.
pattern recognition, 2016, pp. 4104–4113. [24] D. Xu, Y. Jiang, P. Wang, Z. Fan, H. Shi, and Z. Wang, “Sinnerf:
[3] M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings Training neural radiance fields on complex scenes from a single
of the 23rd annual conference on Computer graphics and interactive image,” 2022.
techniques, 1996, pp. 31–42. [25] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil,
[4] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lu- N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng,
migraph,” in Proceedings of the 23rd annual conference on Computer “Fourier features let networks learn high frequency functions
graphics and interactive techniques, 1996, pp. 43–54. in low dimensional domains,” Advances in Neural Information
[5] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, Processing Systems, vol. 33, pp. 7537–7547, 2020.
R. Ramamoorthi, R. Ng, and A. Kar, “Local light field fusion: [26] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs, “Large
Practical view synthesis with prescriptive sampling guidelines,” scale multi-view stereopsis evaluation,” in Proceedings of the IEEE
ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–14, 2019. conference on computer vision and pattern recognition, 2014, pp. 406–
[6] V. Sitzmann, M. Zollhöfer, and G. Wetzstein, “Scene representa- 413.
tion networks: Continuous 3d-structure-aware neural scene rep- [27] J.-Y. Bouguet, Camera Calibration Toolbox for Mat-
resentations,” Advances in Neural Information Processing Systems, lab. CaltechDATA, May 2022. [Online]. Available:
vol. 32, 2019. https://data.caltech.edu/records/20164
[7] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, [28] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and
and Y. Sheikh, “Neural volumes: Learning dynamic renderable M. Nießner, “Scannet: Richly-annotated 3d reconstructions of
volumes from images,” arXiv preprint arXiv:1906.07751, 2019. indoor scenes,” in Proceedings of the IEEE conference on computer
[8] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Dif- vision and pattern recognition, 2017, pp. 5828–5839.
ferentiable volumetric rendering: Learning implicit 3d represen- [29] A. Dai, M. Nießner, M. Zollöfer, S. Izadi, and C. Theobalt,
tations without 3d supervision,” in Proceedings of the IEEE/CVF “Bundlefusion: Real-time globally consistent 3d reconstruction
Conference on Computer Vision and Pattern Recognition, 2020, pp. using on-the-fly surface re-integration,” ACM Transactions on
3504–3515. Graphics 2017 (TOG), 2017.
[9] K. Genova, F. Cole, A. Sud, A. Sarna, and T. Funkhouser, “Lo- [30] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and
cal deep implicit functions for 3d shape,” in Proceedings of the temples: Benchmarking large-scale scene reconstruction,” ACM
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
2020, pp. 4857–4866. [31] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,
[10] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet:
“Deepsdf: Learning continuous signed distance functions for An information-rich 3d model repository,” arXiv preprint
shape representation,” in Proceedings of the IEEE/CVF conference arXiv:1512.03012, 2015.
on computer vision and pattern recognition, 2019, pp. 165–174. [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[11] F. Dellaert and L. Yen-Chen, “Neural volume rendering: Nerf and quality assessment: from error visibility to structural similarity,”
beyond,” arXiv preprint arXiv:2101.05204, 2020. IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612,
[12] F. Zhan, Y. Yu, R. Wu, J. Zhang, and S. Lu, “Multimodal image 2004.
synthesis and editing: A survey,” arXiv preprint arXiv:2112.13592, [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
2021. “The unreasonable effectiveness of deep features as a perceptual
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
metric,” in Proceedings of the IEEE conference on computer vision and Proceedings of the IEEE/CVF International Conference on Computer
pattern recognition, 2018, pp. 586–595. Vision, 2021, pp. 14 335–14 345.
[34] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. [54] A. Yu, S. Fridovich-Keil, M. Tancik, Q. Chen, B. Recht, and
Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with A. Kanazawa, “Plenoxels: Radiance fields without neural net-
50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint works,” arXiv preprint arXiv:2112.05131, 2021.
arXiv:1602.07360, 2016. [55] C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization:
[35] K. Simonyan and A. Zisserman, “Very deep convolutional Super-fast convergence for radiance fields reconstruction,” in
networks for large-scale image recognition,” arXiv preprint Proceedings of the IEEE/CVF Conference on Computer Vision and
arXiv:1409.1556, 2014. Pattern Recognition, 2022, pp. 5459–5469.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- [56] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial
fication with deep convolutional neural networks,” Advances in radiance fields,” arXiv preprint arXiv:2203.09517, 2022.
neural information processing systems, vol. 25, 2012. [57] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su,
[37] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin- “Mvsnerf: Fast generalizable radiance field reconstruction from
Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representa- multi-view stereo,” in Proceedings of the IEEE/CVF International
tion for anti-aliasing neural radiance fields,” in Proceedings of the Conference on Computer Vision, 2021, pp. 14 124–14 133.
IEEE/CVF International Conference on Computer Vision, 2021, pp. [58] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelNeRF: Neural
5855–5864. radiance fields from one or few images,” in CVPR, 2021.
[38] D. Verbin, P. Hedman, B. Mildenhall, T. Zickler, J. T. Barron, and [59] Y. Liu, S. Peng, L. Liu, Q. Wang, P. Wang, C. Theobalt, X. Zhou,
P. P. Srinivasan, “Ref-nerf: Structured view-dependent appear- and W. Wang, “Neural rays for occlusion-aware image-based
ance for neural radiance fields,” in Proceedings of the IEEE/CVF rendering,” in Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision and Pattern Recognition, 2022, pp. Vision and Pattern Recognition, 2022, pp. 7824–7833.
5491–5500. [60] A. Jain, M. Tancik, and P. Abbeel, “Putting nerf on a diet:
[39] B. Mildenhall, P. Hedman, R. Martin-Brualla, P. P. Srinivasan, Semantically consistent few-shot view synthesis,” in Proceedings
and J. T. Barron, “Nerf in the dark: High dynamic range view of the IEEE/CVF International Conference on Computer Vision, 2021,
synthesis from noisy raw images,” in Proceedings of the IEEE/CVF pp. 5885–5894.
Conference on Computer Vision and Pattern Recognition, 2022, pp. [61] M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as
16 190–16 199. compositional generative neural feature fields,” in Proceedings of
[40] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural tion, 2021, pp. 11 453–11 464.
radiance fields,” ICCV, 2021. [62] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “Graf: Gener-
[41] K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. ative radiance fields for 3d-aware image synthesis,” Advances in
Goldman, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A Neural Information Processing Systems, vol. 33, pp. 20 154–20 166,
higher-dimensional representation for topologically varying neu- 2020.
ral radiance fields,” ACM Trans. Graph., vol. 40, no. 6, dec 2021. [63] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein,
[42] W. Jang and L. Agapito, “Codenerf: Disentangled neural radiance “pi-gan: Periodic implicit generative adversarial networks for 3d-
fields for object categories,” in Proceedings of the IEEE/CVF Inter- aware image synthesis,” in Proceedings of the IEEE/CVF conference
national Conference on Computer Vision, 2021, pp. 12 949–12 958. on computer vision and pattern recognition, 2021, pp. 5799–5809.
[43] B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and [64] Q. Meng, A. Chen, H. Luo, M. Wu, H. Su, L. Xu, X. He, and J. Yu,
M. Nießner, “Dense depth priors for neural radiance fields from “Gnerf: Gan-based neural radiance field without posed camera,”
sparse input views,” in Proceedings of the IEEE/CVF Conference on in Proceedings of the IEEE/CVF International Conference on Computer
Computer Vision and Pattern Recognition, 2022, pp. 12 892–12 901. Vision, 2021, pp. 6351–6361.
[44] Y. Wei, S. Liu, Y. Rao, W. Zhao, J. Lu, and J. Zhou, “Nerfing- [65] A. R. Kosiorek, H. Strathmann, D. Zoran, P. Moreno, R. Schneider,
mvs: Guided optimization of neural radiance fields for indoor S. Mokrá, and D. J. Rezende, “Nerf-vae: A geometry aware 3d
multi-view stereo,” in Proceedings of the IEEE/CVF International scene generative model,” in International Conference on Machine
Conference on Computer Vision, 2021, pp. 5610–5619. Learning. PMLR, 2021, pp. 5742–5752.
[45] K. Rematas, A. Liu, P. P. Srinivasan, J. T. Barron, A. Tagliasacchi, [66] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron,
T. Funkhouser, and V. Ferrari, “Urban radiance fields,” in Pro- A. Dosovitskiy, and D. Duckworth, “NeRF in the Wild: Neural
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Radiance Fields for Unconstrained Photo Collections,” in CVPR,
Recognition, 2022, pp. 12 932–12 942. 2021.
[46] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neu- [67] S. Liu, X. Zhang, Z. Zhang, R. Zhang, J.-Y. Zhu, and B. Rus-
mann, “Point-nerf: Point-based neural radiance fields,” in Pro- sell, “Editing conditional radiance fields,” in Proceedings of the
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern IEEE/CVF International Conference on Computer Vision, 2021, pp.
Recognition, 2022, pp. 5438–5448. 5773–5783.
[47] L. Liu, J. Gu, K. Z. Lin, T.-S. Chua, and C. Theobalt, “Neural [68] K. Zhang, G. Riegler, N. Snavely, and V. Koltun, “Nerf++: Ana-
sparse voxel fields,” NeurIPS, 2020. lyzing and improving neural radiance fields,” arXiv:2010.07492,
[48] D. B. Lindell, J. N. Martel, and G. Wetzstein, “Autoint: Automatic 2020.
integration for fast neural volume rendering,” in Proceedings of the [69] C. Xie, K. Park, R. Martin-Brualla, and M. Brown, “Fig-nerf:
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Figure-ground neural radiance fields for 3d object category mod-
2021, pp. 14 556–14 565. elling,” in 2021 International Conference on 3D Vision (3DV). IEEE,
[49] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural 2021, pp. 962–971.
graphics primitives with a multiresolution hash encoding,” ACM [70] B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, G. Zhang, and
Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022. [Online]. Z. Cui, “Learning object-compositional neural radiance field for
Available: https://doi.org/10.1145/3528223.3530127 editable scene rendering,” in Proceedings of the IEEE/CVF Interna-
[50] P. Hedman, P. P. Srinivasan, B. Mildenhall, J. T. Barron, and tional Conference on Computer Vision, 2021, pp. 13 779–13 788.
P. Debevec, “Baking neural radiance fields for real-time view [71] S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison, “In-place
synthesis,” in Proceedings of the IEEE/CVF International Conference scene labelling and understanding with implicit scene represen-
on Computer Vision, 2021, pp. 5875–5884. tation,” in Proceedings of the IEEE/CVF International Conference on
[51] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa, “PlenOc- Computer Vision, 2021, pp. 15 838–15 847.
trees for real-time rendering of neural radiance fields,” in ICCV, [72] A. Kundu, K. Genova, X. Yin, A. Fathi, C. Pantofaru, L. J. Guibas,
2021. A. Tagliasacchi, F. Dellaert, and T. Funkhouser, “Panoptic neural
[52] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin, fields: A semantic object-aware neural scene representation,” in
“Fastnerf: High-fidelity neural rendering at 200fps,” in Proceed- Proceedings of the IEEE/CVF Conference on Computer Vision and
ings of the IEEE/CVF International Conference on Computer Vision, Pattern Recognition, 2022, pp. 12 871–12 881.
2021, pp. 14 346–14 355. [73] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit map-
[53] C. Reiser, S. Peng, Y. Liao, and A. Geiger, “Kilonerf: Speeding ping and positioning in real-time,” in Proceedings of the IEEE/CVF
up neural radiance fields with thousands of tiny mlps,” in International Conference on Computer Vision, 2021, pp. 6229–6238.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
[74] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, integration for volume rendering,” in Proceedings of the IEEE/CVF
and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding Conference on Computer Vision and Pattern Recognition, 2022, pp.
for slam,” in Proceedings of the IEEE/CVF Conference on Computer 16 200–16 209.
Vision and Pattern Recognition, 2022, pp. 12 786–12 796. [94] Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, L. Zhou, T. Fang, and
[75] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “NeRF−−: L. Quan, “Blendedmvs: A large-scale dataset for generalized
Neural radiance fields without known camera parameters,” arXiv multi-view stereo networks,” in Proceedings of the IEEE/CVF Con-
preprint arXiv:2102.07064, 2021. ference on Computer Vision and Pattern Recognition, 2020, pp. 1790–
[76] C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey, “Barf: Bundle- 1799.
adjusting neural radiance fields,” in IEEE International Conference [95] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock,
on Computer Vision (ICCV), 2021. D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison
[77] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park, et al., “Optix: a general purpose ray tracing engine,” Acm transac-
“Self-calibrating neural radiance fields,” in Proceedings of the tions on graphics (tog), vol. 29, no. 4, pp. 1–13, 2010.
IEEE/CVF International Conference on Computer Vision, 2021, pp. [96] L. Wang, J. Zhang, X. Liu, F. Zhao, Y. Zhang, Y. Zhang, M. Wu,
5846–5854. J. Yu, and L. Xu, “Fourier plenoctrees for dynamic radiance field
[78] B. Deng, J. T. Barron, and P. P. Srinivasan, rendering in real-time,” in Proceedings of the IEEE/CVF Conference
“JaxNeRF: an efficient JAX implementation of NeRF,” on Computer Vision and Pattern Recognition, 2022, pp. 13 524–
2020. [Online]. Available: https://github.com/google-research/ 13 534.
google-research/tree/master/jaxnerf [97] Z. Chen, T. Funkhouser, P. Hedman, and A. Tagliasacchi, “Mo-
[79] P. Hedman, P. P. Srinivasan, B. Mildenhall, J. T. Barron, and bilenerf: Exploiting the polygon rasterization pipeline for ef-
P. Debevec, “Baking neural radiance fields for real-time view ficient neural field rendering on mobile architectures,” arXiv
synthesis,” in Proceedings of the IEEE/CVF International Conference preprint arXiv:2208.00277, 2022.
on Computer Vision, 2021, pp. 5875–5884. [98] T. Hu, S. Liu, Y. Chen, T. Shen, and J. Jia, “Efficientnerf efficient
[80] J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll, “Stereo neural radiance fields,” in Proceedings of the IEEE/CVF Conference
radiance fields (srf): Learning view synthesis for sparse views on Computer Vision and Pattern Recognition, 2022, pp. 12 902–
of novel scenes,” in Proceedings of the IEEE/CVF Conference on 12 911.
Computer Vision and Pattern Recognition, 2021, pp. 7911–7920. [99] A. Trevithick and B. Yang, “Grf: Learning a general radiance
[81] J. Zhang, Y. Zhang, H. Fu, X. Zhou, B. Cai, J. Huang, R. Jia, field for 3d representation and rendering,” in Proceedings of the
B. Zhao, and X. Tang, “Ray priors through reprojection: Im- IEEE/CVF International Conference on Computer Vision, 2021, pp.
proving neural radiance fields for novel view extrapolation,” 15 182–15 192.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [100] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
Pattern Recognition, 2022, pp. 18 376–18 386. wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning
[82] G. Gafni, J. Thies, M. Zollhofer, and M. Nießner, “Dynamic neural transferable visual models from natural language supervision,”
radiance fields for monocular 4d facial avatar reconstruction,” in International Conference on Machine Learning. PMLR, 2021, pp.
in Proceedings of the IEEE/CVF Conference on Computer Vision and 8748–8763.
Pattern Recognition, 2021, pp. 8649–8658. [101] M. M. Johari, Y. Lepoittevin, and F. Fleuret, “Geonerf: General-
[83] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and izing nerf with geometry priors,” in Proceedings of the IEEE/CVF
M. Nießner, “Face2face: Real-time face capture and reenactment Conference on Computer Vision and Pattern Recognition, 2022, pp.
of rgb videos,” in Proceedings of the IEEE conference on computer 18 365–18 375.
vision and pattern recognition, 2016, pp. 2387–2395. [102] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[84] K. Kania, K. M. Yi, M. Kowalski, T. Trzciński, and A. Tagliasacchi, Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
“Conerf: Controllable neural radiance fields,” in Proceedings of the Advances in neural information processing systems, vol. 30, 2017.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [103] D. Rebain, M. Matthews, K. M. Yi, D. Lagun, and A. Tagliasacchi,
2022, pp. 18 623–18 632. “Lolnerf: Learn from one look,” in Proceedings of the IEEE/CVF
[85] S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun, “A large dataset of Conference on Computer Vision and Pattern Recognition, 2022, pp.
object scans,” arXiv preprint arXiv:1602.02481, 2016. 1558–1567.
[86] X. Cheng, P. Wang, and R. Yang, “Learning depth with convolu- [104] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam, “Opti-
tional spatial propagation network,” IEEE transactions on pattern mizing the latent space of generative networks,” arXiv preprint
analysis and machine intelligence, vol. 42, no. 10, pp. 2361–2379, arXiv:1707.05776, 2017.
2019. [105] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
[87] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-
M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: versarial nets,” Advances in neural information processing systems,
Learning from rgb-d data in indoor environments,” arXiv preprint vol. 27, 2014.
arXiv:1709.06158, 2017. [106] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
[88] Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
Freeman, “Learning the depths of moving people by watching resentations using rnn encoder-decoder for statistical machine
frozen people,” in Proceedings of the IEEE/CVF conference on com- translation,” arXiv preprint arXiv:1406.1078, 2014.
puter vision and pattern recognition, 2019, pp. 4521–4530. [107] N. Müller, A. Simonelli, L. Porzi, S. R. Bulò, M. Nießner, and
[89] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth P. Kontschieder, “Autorf: Learning 3d object radiance fields from
inference for unstructured multi-view stereo,” in Proceedings of single view observations,” in Proceedings of the IEEE/CVF Confer-
the European conference on computer vision (ECCV), 2018, pp. 767– ence on Computer Vision and Pattern Recognition, 2022, pp. 3971–
783. 3980.
[90] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep [108] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
learning on point sets for 3d classification and segmentation,” A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
in Proceedings of the IEEE conference on computer vision and pattern multimodal dataset for autonomous driving,” in Proceedings of
recognition, 2017, pp. 652–660. the IEEE/CVF conference on computer vision and pattern recognition,
[91] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Bar- 2020, pp. 11 621–11 631.
ron, R. Martin-Brualla, N. Snavely, and T. Funkhouser, “Ibrnet: [109] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
Learning multi-view image-based rendering,” in Proceedings of the robotics: The kitti dataset,” The International Journal of Robotics
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Research, vol. 32, no. 11, pp. 1231–1237, 2013.
2021, pp. 4690–4699. [110] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[92] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, arXiv preprint arXiv:1312.6114, 2013.
C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, [111] Y. Kim, S. Wiseman, A. Miller, D. Sontag, and A. Rush, “Semi-
S. Wanderman-Milne, and Q. Zhang, “JAX: composable amortized variational autoencoders,” in International Conference
transformations of Python+NumPy programs,” 2018. [Online]. on Machine Learning. PMLR, 2018, pp. 2678–2687.
Available: http://github.com/google/jax [112] J. Marino, Y. Yue, and S. Mandt, “Iterative amortized inference,”
[93] L. Wu, J. Y. Lee, A. Bhattad, Y.-X. Wang, and D. Forsyth, “Diver: in International Conference on Machine Learning. PMLR, 2018, pp.
Real-time and accurate neural radiance fields with deterministic 3403–3412.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
[113] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [132] X. Huang, Q. Zhang, Y. Feng, H. Li, X. Wang, and Q. Wang, “Hdr-
for image recognition,” in Proceedings of the IEEE conference on nerf: High dynamic range neural radiance fields,” in Proceedings
computer vision and pattern recognition, 2016, pp. 770–778. of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
[114] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, nition, 2022, pp. 18 398–18 408.
“Implicit neural representations with periodic activation func- [133] L. Ma, X. Li, J. Liao, Q. Zhang, X. Wang, J. Wang, and P. V. Sander,
tions,” Advances in Neural Information Processing Systems, vol. 33, “Deblur-nerf: Neural radiance fields from blurry images,” in
pp. 7462–7473, 2020. Proceedings of the IEEE/CVF Conference on Computer Vision and
[115] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: Pattern Recognition, 2022, pp. 12 861–12 870.
A dataset and benchmark for large-scale face recognition,” in [134] N. Pearl, T. Treibitz, and S. Korman, “Nan: Noise-aware nerfs
European conference on computer vision. Springer, 2016, pp. 87– for burst-denoising,” in Proceedings of the IEEE/CVF Conference on
102. Computer Vision and Pattern Recognition, 2022, pp. 12 672–12 681.
[116] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, [135] C. Wang, X. Wu, Y.-C. Guo, S.-H. Zhang, Y.-W. Tai, and S.-M.
“Carla: An open urban driving simulator,” in Conference on robot Hu, “Nerf-sr: High quality neural radiance fields using super-
learning. PMLR, 2017, pp. 1–16. sampling,” in Proceedings of the 30th ACM International Conference
[117] W. Zhang, J. Sun, and X. Tang, “Cat head detection-how to effec- on Multimedia, 2022, pp. 6445–6454.
tively exploit shape and texture features,” in European conference [136] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang,
on computer vision. Springer, 2008, pp. 802–816. “Neus: Learning neural implicit surfaces by volume rendering
[118] S. Cai, A. Obukhov, D. Dai, and L. Van Gool, “Pix2nerf: Unsu- for multi-view reconstruction,” Advances in Neural Information
pervised conditional p-gan for single image to neural radiance Processing Systems, vol. 34, pp. 27 171–27 183, 2021.
fields translation,” in Proceedings of the IEEE/CVF Conference on [137] D. Azinović, R. Martin-Brualla, D. B. Goldman, M. Nießner, and
Computer Vision and Pattern Recognition, 2022, pp. 3981–3990. J. Thies, “Neural rgb-d surface reconstruction,” in Proceedings of
[119] V. Sitzmann, M. Zollhöfer, and G. Wetzstein, “Scene representa- the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion networks: Continuous 3d-structure-aware neural scene rep- tion, 2022, pp. 6290–6301.
resentations,” Advances in Neural Information Processing Systems, [138] M. Oechsle, S. Peng, and A. Geiger, “Unisurf: Unifying neural
vol. 32, 2019. implicit surfaces and radiance fields for multi-view reconstruc-
[120] Y. Jin, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. M. Yi, and tion,” in Proceedings of the IEEE/CVF International Conference on
E. Trulls, “Image matching across wide baselines: From paper to Computer Vision, 2021, pp. 5589–5599.
practice,” International Journal of Computer Vision, 2020. [139] Y. Hong, B. Peng, H. Xiao, L. Liu, and J. Zhang, “Headnerf: A
[121] K. Yücer, A. Sorkine-Hornung, O. Wang, and O. Sorkine- real-time nerf-based parametric head model,” in Proceedings of the
Hornung, “Efficient 3d object segmentation from densely sam- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pled light fields with applications to 3d reconstruction,” ACM 2022, pp. 20 374–20 384.
Transactions on Graphics (TOG), vol. 35, no. 3, pp. 1–15, 2016. [140] S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou,
“Neural body: Implicit neural representations with structured
[122] R. Martin-Brualla, R. Pandey, S. Bouaziz, M. Brown, and D. B.
latent codes for novel view synthesis of dynamic humans,” in
Goldman, “Gelato: Generative latent textured objects,” in Euro-
Proceedings of the IEEE/CVF Conference on Computer Vision and
pean Conference on Computer Vision. Springer, 2020, pp. 242–258.
Pattern Recognition, 2021, pp. 9054–9063.
[123] A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grund-
[141] F. Zhao, W. Yang, J. Zhang, P. Lin, Y. Zhang, J. Yu, and L. Xu, “Hu-
mann, “Objectron: A large scale dataset of object-centric videos
mannerf: Efficiently generated human radiance field from sparse
in the wild with pose annotations,” in Proceedings of the IEEE/CVF
inputs,” in Proceedings of the IEEE/CVF Conference on Computer
conference on computer vision and pattern recognition, 2021, pp.
Vision and Pattern Recognition, 2022, pp. 7743–7753.
7822–7831.
[142] Z. Zheng, H. Huang, T. Yu, H. Zhang, Y. Guo, and Y. Liu,
[124] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and “Structured local radiance fields for human avatar modeling,”
T.-Y. Lin, “inerf: Inverting neural radiance fields for pose esti- in Proceedings of the IEEE/CVF Conference on Computer Vision and
mation,” in 2021 IEEE/RSJ International Conference on Intelligent Pattern Recognition, 2022, pp. 15 893–15 903.
Robots and Systems (IROS). IEEE, 2021, pp. 1323–1330. [143] R. Shao, H. Zhang, H. Zhang, M. Chen, Y.-P. Cao, T. Yu, and
[125] M. S. Sajjadi, H. Meyer, E. Pot, U. Bergmann, K. Greff, N. Radwan, Y. Liu, “Doublefield: Bridging the neural surface and radiance
S. Vora, M. Lučić, D. Duckworth, A. Dosovitskiy et al., “Scene fields for high-fidelity human reconstruction and rendering,” in
representation transformer: Geometry-free novel view synthesis Proceedings of the IEEE/CVF Conference on Computer Vision and
through set-latent scene representations,” in Proceedings of the Pattern Recognition (CVPR), June 2022, pp. 15 872–15 882.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [144] S. Peng, J. Dong, Q. Wang, S. Zhang, Q. Shuai, X. Zhou, and
2022, pp. 6229–6238. H. Bao, “Animatable neural radiance fields for modeling dy-
[126] J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, namic human bodies,” in Proceedings of the IEEE/CVF International
and D. Novotny, “Common objects in 3d: Large-scale learning Conference on Computer Vision (ICCV), October 2021, pp. 14 314–
and evaluation of real-life 3d category reconstruction,” in Proceed- 14 323.
ings of the IEEE/CVF International Conference on Computer Vision, [145] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
2021, pp. 10 901–10 911. “Encoder-decoder with atrous separable convolution for seman-
[127] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. tic image segmentation,” in Proceedings of the European conference
Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-nerf: Scalable on computer vision (ECCV), 2018, pp. 801–818.
large scene neural view synthesis,” in Proceedings of the IEEE/CVF [146] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.
Conference on Computer Vision and Pattern Recognition, 2022, pp. Black, “SMPL: A skinned multi-person linear model,” ACM
8248–8258. Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–
[128] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scal- 248:16, Oct. 2015.
able construction of large-scale nerfs for virtual fly-throughs,” [147] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras,
in Proceedings of the IEEE/CVF Conference on Computer Vision and M. Aittala, and T. Aila, “Noise2noise: Learning image restoration
Pattern Recognition, 2022, pp. 12 922–12 931. without clean data,” arXiv preprint arXiv:1803.04189, 2018.
[129] Y. Xiangli, L. Xu, X. Pan, N. Zhao, A. Rao, C. Theobalt, B. Dai, [148] L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, B. Ronen,
and D. Lin, “Bungeenerf: Progressive neural radiance field for and Y. Lipman, “Multiview neural surface reconstruction by
extreme multi-scale scene rendering,” in The European Conference disentangling geometry and appearance,” Advances in Neural
on Computer Vision (ECCV), 2022. Information Processing Systems, vol. 33, pp. 2492–2502, 2020.
[130] D. Derksen and D. Izzo, “Shadow neural radiance fields [149] A. Elluswamy. Tesla, workshop on autonomous driving. CVPR
for multi-view satellite photogrammetry,” in Proceedings of the 2022. [Online]. Available: https://www.youtube.com/watch?v=
IEEE/CVF Conference on Computer Vision and Pattern Recognition, jPCV4GKX9Dw
2021, pp. 1152–1161. [150] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolu-
[131] S. Vora*, N. Radwan*, K. Greff, H. Meyer, K. Genova, M. S. M. tion 3d surface construction algorithm,” ACM siggraph computer
Sajjadi, E. Pot, A. Tagliasacchi, and D. Duckworth, “Neu- graphics, vol. 21, no. 4, pp. 163–169, 1987.
ral semantic fields for generalizable semantic segmentation of
3d scenes,” Transactions on Machine Learning Research, 2022,
https://openreview.net/forum?id=ggPhsYCsm9.