CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
1
pose estimation and scene reconstruction are carried out ac- state of the art in monocular SLAM. Fig. 1, a) shows an
curately, the absolute scale of such reconstruction remains example illustrating the usefulness of carrying out scene re-
inherently ambiguous, limiting the use of monocular SLAM construction with a precise absolute scale such as the one
within most aforementioned applications in the field of aug- proposed in this work. Moreover, tracking can be made
mented reality and robotics (an example is shown in Fig. more robust, as the CNN-predicted depth does not suffer
1,b). Some approaches suggest solving the issue via object from the aforementioned problem of pure rotations, as it is
detection by matching the scene with a pre-defined set of estimated on each frame individually. Last but not least,
3D models, so to recover the initial scale based on the es- this framework can run in real-time since the two processes
timated object size [6], which nevertheless fails in absence of depth prediction from CNNs and depth refinement can
of known shapes in the scene. Another main limitation of be simultaneously carried out on different computational re-
monocular SLAM is represented by pose estimation under sources of the same architecture - respectively, the GPU and
pure rotational camera motion, in which case stereo estima- the CPU.
tion cannot be applied due to the lack of a stereo baseline, Another relevant aspect of recent CNNs is that the same
resulting in tracking failures. network architecture can be successfully employed for dif-
Recently, a new avenue of research has emerged that ad- ferent high-dimensional regression tasks rather than just
dresses depth prediction from a single image by means of depth estimation: one typical example is semantic segmen-
learned approaches. In particular, the use of deep Convolu- tation [3, 29]. We leverage this aspect to propose an exten-
tional Neural Networks (CNNs) [16, 2, 3] in an end-to-end sion of our framework that uses pixel-wise labels to coher-
fashion has demonstrated the potential of regressing depth ently and efficiently fuse semantic labels with dense SLAM,
maps at a relatively high resolution and with a good ab- so to attain semantically coherent scene reconstruction from
solute accuracy even under the absence of monocular cues a single view: an example is shown in Fig. 1, c). Notably,
(texture, repetitive patterns) to drive the depth estimation to the best of our knowledge semantic reconstruction has
task. One advantage of deep learning approaches is that the been shown only recently and only based on stereo [28] or
absolute scale can be learned from examples and thus pre- RGB-D data [15], i.e. never in the monocular case.
dicted from a single image without the need of scene-based We validate our method with a comparison on two public
assumptions or geometric constraints, unlike [10, 18, 1]. A SLAM benchmarks against the state of the art in monocu-
major limitation of such depth maps is the fact that, al- lar SLAM and depth estimation, focusing on the accuracy
though globally accurate, depth borders tend to be locally of pose estimation and reconstruction. Since the CNN-
blurred: hence, if such depths are fused together for scene predicted depth relies on a training procedure, we show ex-
reconstruction as in [16], the reconstructed scene will over- periments where the training set is taken from a completely
all lack shape details. different environment and a different RGB sensor than those
available in the evaluated benchmarks, so to portray the ca-
Relevantly, despite the few methods proposed for sin- pacity of our approach - particularly relevant for practical
gle view depth prediction, the application of depth predic- uses - to generalize to novel, unseen environments. We also
tion to higher-level computer vision tasks has been mostly show qualitative results of our joint scene reconstruction
overlooked so far, with just a few examples existing in lit- and semantic label fusion in a real environment.
erature [16]. The main idea behind this work is to exploit
the best from both worlds and propose a monocular SLAM 2. Related work
approach that fuses together depth prediction via deep net-
works and direct monocular depth estimation so to yield In this Section we review related work with respect to
a dense scene reconstruction that is at the same time un- the two fields that we integrate within our framework, i.e.
ambiguous in terms of absolute scale and robust in terms SLAM and depth prediction.
of tracking. To recover blurred depth borders, the CNN-
predicted depth map is used as initial guess for dense re- SLAM There exists a vast literature on SLAM. From the
construction and successively refined by means of a direct point of view of the type of input data being processed, ap-
SLAM scheme relying on small-baseline stereo matching proaches can be classified into either depth camera-based
similar to the one in [4]. Importantly, small-baseline stereo [21, 30, 11] and monocular camera-based [22, 4, 20]. In-
matching holds the potential to refine edge regions on the stead, from a methodological viewpoint, they are classified
predicted depth image, which is where they tend to be more as either feature-based [12, 13, 20] and direct [22, 5, 4].
blurred. At the same time, the initial guess obtained from Given the scope of this paper, we will focus here only on
the CNN-predicted depth map can provide absolute scale monocular SLAM approaches.
information to drive pose estimation, so that the estimated As for feature-based monocular SLAM, ORB-SLAM
pose trajectory and scene reconstruction can be significantly [20] is arguably the state of the art in terms of pose esti-
more accurate in terms of absolute scale compared to the mation accuracy. This method relies on the extraction of
Every input frame
sparse ORB features from the input image to carry out a maps are fused together with depth measurements obtained
sparse reconstruction of the scene as well as to estimate the from direct monocular SLAM. Additionally, we show how
camera pose, also employing local bundle adjustment and CNN-predicted semantic segmentation can also be coher-
pose graph optimization. As for direct monocular SLAM, ently fused with the global reconstruction model. The flow
the Dense Tracking and Mapping (DTAM) of [22] achieved diagram in Fig. 2 sketches the pipeline of our framework.
dense reconstruction in real-time on s GPU by using short- We employ a key-frame based SLAM paradigm [12, 4, 20],
baseline multiple-view stereo matching with a regulariza- in particular we use as baseline the direct semi-dense ap-
tion scheme, so that depth estimation is smoother on low- proach in [4]. Within such approach, a subset of visually
textured regions in the color image. Moreover, the Large- distinct frames is collected as key-frames, whose pose is
Scale Direct SLAM (LSD-SLAM) algorithm [4] proposed subject to global refinement based on pose graph optimiza-
the use of a semi-dense map representation which keeps tion. At the same time, camera pose estimation is carried
track of depth values only on gradient areas of the input im- out at each input frame, by estimating the transformation
age, this allowing enough efficiency to enable direct SLAM between the frame and its nearest key-frame.
in real-time on a CPU. An extension of LSD-SLAM is the To maintain a high frame-rate, we propose to predict a
recent Multi-level mapping (MLM) algorithm [7], which depth map via CNN only on key-frames. In particular, if
proposed the use of a dense approach on top of LSD-SLAM the currently estimated pose is far from that of existing key-
in order to increase its density and improve the reconstruc- frames, a new key-frame is created out of the current frame
tion accuracy. and its depth estimated via CNN. Moreover an uncertainty
map is constructed by measuring the pixel-wise confidence
Depth prediction from single view Depth prediction of each depth prediction. Since in most cases the camera
from single view has gained increasing attention in the com- used for SLAM differs from the one used to acquire the
puter vision community thanks to the recent advances in dataset on which the CNN is trained, we propose a spe-
deep learning. Classic depth prediction approaches em- cific normalization procedure of the depth map designed to
ploy hand-crafted features and probabilistic graphical mod- gain robustness towards different intrinsic camera param-
els [10, 18] to yield regularized depth maps, usually making eters. When additionally carrying out semantic label fu-
strong assumptions on the scene geometry. Recently devel- sion, we employ a second convolutional network to predict
oped deep convolutional architectures significantly outper- a semantic segmentation of the input frame. Finally, a pose
formed previous methods in terms of depth estimation accu- graph on key-frames is created so to globally optimize their
racy [16, 2, 3, 29, 19, 17]. Interestingly, the work of [16] re- relative pose.
ports qualitative results of employing depth predictions for A particularly important stage of the framework, also
dense SLAM as an application. In particular, the predicted representing one main contribution of our proposal, is
depth map is used as input for Kellers Point-Based Fusion the scheme employed to refine the CNN-predicted depth
RGB-D SLAM algorithm [11], showing that SLAM-based map associated to each key-frame via small-baseline stereo
scene reconstruction can be obtained using depth predic- matching, by enforcing color consistency minimization be-
tion, although it lacks shape details, mostly due to the afore- tween a key-frame and associated input frames. In partic-
mentioned blurring artifacts that are associated with the loss ular, depth values will be mostly refined around image re-
of fine spatial information through the contractive part of a gions with gradients, i.e. where epipolar matching can pro-
CNN. vide improved accuracy. This will be outlined in Subsec-
tions 3.3 and 3.4. Relevantly, the way refined depths are
3. Proposed Monocular Semantic SLAM propagated is driven by the uncertainty associated to each
depth value, estimated according to a specifically proposed
In this section, we illustrate the proposed framework confidence measure (defined in Subsec. 3.3). Every stage of
for 3D reconstruction, where CNN-predicted dense depth the framework is now detailed in the following Subsections.
3.1. Camera Pose Estimation 3.2. CNN-based Depth Prediction and Semantic
Segmentation
The camera pose estimation is inspired by the key-frame
approach in [4]. In particular, the system holds a set of key- Every time a new key-frame is created, an associated
frames k1 , .., kn K as structural elements on which to depth map is predicted via CNN. The depth prediction ar-
perform SLAM reconstruction. Each key-frame ki is as- chitecture that we employ is the state-of-the-art approach
sociated to a key-frame pose Tki , a depth map Dki , and a proposed in [16], based on the extension of the Residual
depth uncertainty map Uki . In contrast to [4], our depth Network (ResNet) architecture [9] to a Fully Convolutional
map is dense because it is generated via CNN-based depth network. In particular, the first part of the architecture
prediction (see Subsec. 3.2). The uncertainty map mea- is based on ResNet-50 [9] and initialized with pre-trained
sures the confidence of each depth value. As opposed to weights on ImageNet [24]. The second part of the archi-
[4] that initializes the uncertainty to a large, constant value, tecture replaces the last pooling and fully connected lay-
our approach initializes it according to the measured con- ers originally presented in ResNet-50 with a sequence of
fidence of the depth prediction (described in Subsec. 3.3). residual up-sampling blocks composed of a combination
In the following, we will refer to a generic depth map ele- of unpooling and convolutional layers. After up-sampling,
ment as u = (x, y), which ranges in the image domain, i.e. drop-out is applied before a final convolutional layer which
u R2 , with u being its homogeneous representation. outputs a 1-channel output map representing the predicted
At each frame t, we aim to estimate the current camera depth map. The loss function is based on the reverse Huber
pose Ttki = [Rt , tt ] SE(3), i.e. the transformation be- function [16].
tween the nearest key-frame ki and frame t, composed of a Following the successful paradigm of other approaches
33 rotation matrix Rt SO(3) and a 3D translation vec- that employed the same architecture for both depth predic-
tor tt R3 . This transformation is estimated by minimiz- tion and semantic segmentation tasks [3, 29], we also re-
ing the photometric residual between the intensity image It trained this network for predicting pixel-wise semantic la-
of the current frame and the intensity image Iki of the near- bels from RGB images. To deal with this task, we modi-
est key-frame ki , via weighted Gauss-Newton optimization fied the network so that it has as many output channels as
based on the objective function the number of categories and employed a soft-max layer
and a cross-entropy loss function to be minimized via back-
X r u, Ttki propagation and Stochastic Gradient Descent (SGD). It is
E(Ttki ) = (1) important to point out that, although in principle any seman-
ki
u r u, Tt tic segmentation algorithm could be used, the primary ob-
jective of this work is to showcase how frame-wise segmen-
where is the Huber norm and is a function measuring the tation maps can be successfully fused within our monocular
residual uncertainty [4]. Here, r is the photometric residual SLAM framework (see Subsec. 3.5).
defined as
3.3. Key-frame Creation and Pose Graph Optimiza-
r u, Ttki = Iki (u) It KTtki Vki (u) . (2) tion
One limitation of using a pre-trained CNN for depth pre-
Considering that our depth map is dense, for the sake diction is that, if the sensor used for SLAM has different
of efficiency, we limit the computation of the photometric intrinsic parameters from those used to capture the train-
residual only on the subset of pixels lying within high color ing set, the resulting absolute scale of the 3D reconstruction
gradient regions, defined by the image domain subset u will be inaccurate. To ameliorate this issue, we propose to
u . Also, in (2), represents the perspective projection adjust the depth regressed via CNN with the ratio between
function mapping a 3D point to a 2D image coordinate the focal length of the current camera, fcur and that of the
sensor used for training, ftr as
T
[xyz]T = (x/z, y/z)
(3)
fcur
Dki (u) = Dki (u) (5)
while Vki (u) represents a 3D element of the vertex map ftr
computed from the key-frames depth map
where Dki is the depth map directly regressed by the CNN
Vki (u) = K 1 uDki (u) (4) from the current key-frame image Ii .
Fig. 3 shows the usefulness of the adjustment proce-
where K is the camera intrinsic matrix. dure defined in (5), on a sequence of the benchmark ICL-
Once Ttki is obtained, the current camera pose in world NUIM dataset [8] (compare (a) with (b) ). As shown, the
coordinate system is computed as Tt = Ttki Tki . performance after the adjustment procedure is significantly
(A) Comparison on Pose Trajectory Accuracy
2500 the nearest key-frame kj as
GroundTruth
2000
Dkj (v)
(a) Raw Depth Ukj (v) = Uk (v) + p2 (7)
1500
Prediction Dki (u) j
1000
(b) With
Adjustment
where v = KTkkji Vki (u) while, following [4], p2 is
500
square error between the estimated camera translation and MODE, as well as the higher density with respect to LSD-
the ground-truth camera translation for each evaluated se- SLAM is also shown in Fig. 4. The figure compares the
quence. In addition, we assess both reconstruction accu- ground-truth with, a refined key-frame using our approach,
racy and density, by evaluating the percentage of depth val- the corresponding raw depth prediction from the CNN, the
ues whose difference with the corresponding ground truth refined key-frame from LSD-SLAM [4] using bootstrap-
depth is less than 10%. Given the observations in the Ta- ping and the estimated dense depth map from REMODE
ble, our approach is able to always report a much higher on a sequence of the ICL-NUIM dataset. Not only our ap-
pose trajectory accuracy with respect to monocular meth- proach demonstrates a much higher density with respect to
ods, due to the their aforementioned absolute scale ambigu- LSD-SLAM, but the refinement procedure helps to drasti-
ity. Interestingly, the pose accuracy of our technique is on cally reduce the blurring artifacts of the CNN-based pre-
average higher than that of LSD-SLAM even after apply- diction, increasing the overall depth accuracy. Also, we can
ing bootstrapping, implying an inherent effectiveness of the note that REMODE tends to fail along low-textured regions,
proposed depth fusion approach rather than just estimating as opposed to our method which can estimate depth densely
the correct scaling factor. The same benefits are present in over such areas by leveraging the CNN-predicted depth val-
terms of reconstruction, being the estimated key-frames not ues.
only dramatically more accurate, but also much denser than
those reported by LSD-SLAM and ORB-SLAM. Moreover, 4.2. Accuracy under pure rotational motion
our approach also reports a better performance in terms As mentioned, one of the advantages of our approach
of both pose and reconstruction accuracy, also comparing compared to standard monocular SLAM is that, under pure
to the technique in [16], where CNN-predicted depths are rotational motion, the reconstruction can still be obtained
used as input for SLAM without any refinement, this again by relying on CNN-predicted depths, while other methods
demonstrating the effectiveness of the proposed scheme to would fail given the absence of a stereo baseline between
refine the blurred edges and wrongly estimated depth values consecutive frames. To portray this benefit, we evaluate our
predicted by the CNN. Finally, we clearly outperform also method on the (fr1/rpy) sequence from the TUM dataset,
REMODE in terms of depth map accuracy. mostly consisting of just rotational camera motion. The
The increased accuracy with respect to the depth maps reconstruction obtained by, respectively, our approach and
estimated by the CNN (as employed in [16]) and by RE- LSD-SLAM compared to ground-truth are shown in Fig-
Ground Truth Ours LSD-SLAM
Figure 5. Comparison on a sequence that includes mostly pure rotational camera motion between the reconstruction obtained by ground
truth depth (left), proposed method (middle) and LSD-SLAM [4] (right).
Figure 6. The results of reconstruction and semantic label fusion on the office sequence (top, acquire by our own) and one sequence
(kitchen 0046) from the NYU Depth V2 dataset [25] (bottom). Reconstruction is shown with colors (left) and with semantic labels (right).
ure 5. As it can be seen, our method can reconstruct the included in the supplementary material.
scene structure even if the camera motion is purely rota-
tional, while the result of LSD-SLAM is significantly noisy, 5. Conclusion
since the stereo baseline required to estimate depth is for
most frames not sufficient. We also tried ORB-SLAM on We have shown how the integration of SLAM with depth
this sequence but it completely fails, given the lack of the prediction via a deep neural network is a promising direc-
necessary baseline to initialize the algorithm. tion to solve inherent limitations of traditional monocular
reconstruction, especially with respect to estimating the ab-
4.3. Joint 3D and semantic reconstruction
solute scale, obtaining dense depths along texture-less re-
Finally, we show some qualitative results of the joint 3D gions and dealing with pure rotational motions. The pro-
and semantic reconstruction achieved by our method. Three posed approach to refine CNN-predicted depth maps with
examples are shown in Fig. 6, which reports an office scene small baseline stereo matching naturally overcomes these
reconstructed from a sequence acquired with our own setup issues while retaining the robustness and accuracy of direct
and two sequences from the test set of the NYU Depth V2 monocular SLAM in presence of camera translations and
dataset [25]. Another example from the sequence living0 high image gradients. The overall framework is capable of
of the ICL-NUIM dataset is shown in Fig.1,c). The Figures jointly reconstructing the scene while fusing semantic seg-
also report, in green, the estimated camera trajectory. To mentation labels with the global 3D model, opening new
the best of our knowledge, this is the first demonstration perspectives towards scene understanding with a monocular
of joint 3D and semantic reconstruction with a monocular camera. A future research avenue is represented by closing
camera. Additional qualitative results in terms of pose and the loop with depth prediction, i.e. improving depth estima-
reconstruction quality as well as semantic label fusion are tion by means of geometrically refined depth maps.
References Vision (3DV) (arXiv:1606.00373), October 2016. 2, 3, 4, 6,
7
[1] E. Delage, H. Lee, and A. Y. Ng. A dynamic bayesian net-
[17] B. Li, C. Shen, Y. Dai, A. V. den Hengel, and M. He. Depth
work model for autonomous 3d reconstruction from a single
and surface normal estimation from monocular images using
indoor image. In Proc. Int. Conf. on Computer Vision and
regression on deep features and hierarchical CRFs. In Proc.
Pattern Recognition (CVPR), 2006. 2
Conf. Computer Vision and Pattern Recognition (CVPR),
[2] D. Eigen and R. Fergus. Predicting depth, surface normals pages 11191127, 2015. 3
and semantic labels with a common multi-scale convolu-
[18] B. Liu, S. Gould, and D. Koller. Single image depth estima-
tional architecture. In In Proc. Int. Conf. Computer Vision
tion from predicted semantic labels. In In Computer Vision
(ICCV), 2015. 2, 3
and Pattern Recognition (CVPR), 2010. 2, 3
[3] D. Eigen, C. Puhrsch, and R. Fergus. Prediction from a sin- [19] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields
gle image using a multi-scale deep network. In Proc. Conf. for depth estimation from a single image. In Proc. Conf.
Neural Information Processing Systems (NIPS), 2014. 2, 3, Computer Vision and Pattern Recognition (CVPR), pages
4 51625170, 2015. 3
[4] J. Engel, T. Schps, and D. Cremers. LSD-SLAM: Large- [20] R. Mur-Artal, J. M. M. Montiel, and J. D. Tards. Orb-slam: A
Scale Direct Monocular SLAM. In European Conference on versatile and accurate monocular slam system. IEEE Trans.
Computer Vision (ECCV), 2014. 1, 2, 3, 4, 5, 6, 7, 8 Robotics, 31(5):11471163, 2015. 1, 2, 3, 6, 7
[5] J. Engel, J. Sturm, and D. Cremers. Semi-dense visual odom- [21] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli,
etry for a monocular camera. In IEEE International Confer- O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim,
ence on Computer Vision (ICCV), December 2013. 2, 5 and A. Fitzgibbon. KinectFusion: Real-time dense surface
[6] D. Galvez-Lopez, M. Salas, J. D. Tardos, and J. Mon- mapping and tracking. In 10th IEEE International Sympo-
tiel. Real-time monocular object slam. Robot. Auton. Syst., sium on Mixed and Augmented Reality, pages 127136, oct
75(PB), jan 2016. 2 2011. 1, 2
[7] W. N. Greene, K. Ok, P. Lommel, and N. Roy. Multi- [22] R. A. Newcombe, S. Lovegrove, and A. J. Davison. Dtam:
level mapping: Real-time dense monocular slam. In 2016 Dense tracking and mapping in real-time. In IEEE Interna-
IEEE International Conference on Robotics and Automation tional Conference on Computer Vision (ICCV), pages 2320
(ICRA), May 2016. 3 2327, 2011. 1, 2, 3
[8] A. Handa, T. Whelan, J. McDonald, and A. Davison. A [23] M. Pizzoli, C. Forster, and D. Scaramuzza. REMODE: Prob-
benchmark for RGB-D visual odometry, 3D reconstruction abilistic, monocular dense reconstruction in real time. In
and SLAM. In IEEE Intl. Conf. on Robotics and Automa- IEEE International Conference on Robotics and Automation
tion, ICRA, Hong Kong, China, May 2014. 4, 5, 6, 7 (ICRA), 2014. 6, 7
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
ing for image recognition. Proc. Conf. Computer Vision and S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
Pattern Recognition (CVPR), 2016. 4 A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
[10] D. Hoiem, A. Efros, and M. Hebert. Geometric context from Recognition Challenge. International Journal of Computer
a single image. In In Computer Vision and Pattern Recogni- Vision (IJCV), 115(3):211252, 2015. 4
tion (CVPR), 2005. 2, 3 [25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
[11] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and segmentation and support inference from rgbd images. In
A. Kolb. Real-Time 3D Reconstruction in Dynamic Scenes ECCV, 2012. 6, 8
Using Point-Based Fusion. In International Conference on [26] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-
3D Vision (3DV), pages 18. Ieee, 2013. 1, 2, 3, 6 mers. A benchmark for the evaluation of RGB-D SLAM
[12] G. Klein and D. Murray. Parallel Tracking and Mapping for systems. In 2012 IEEE/RSJ International Conference on In-
Small AR Workspaces. In In Proc. International Symposium telligent Robots and Systems, pages 573580, oct 2012. 6
on Mixed and Augmented Reality (ISMAR), 2007. 2, 3 [27] K. Tateno, F. Tombari, and N. Navab. Real-time and scalable
[13] G. Klein and D. Murray. Improving the agility of keyframe- incremental segmentation on dense slam. 2015. 6
based SLAM. In European Conference on Computer Vision [28] V. Vineet, O. Miksik, M. Lidegaard, M. Niener,
(ECCV), 2008. 2 S. Golodetz, V. A. Prisacariu, O. Kahler, D. W. Murray,
[14] R. Kuemmerle, G. Grisetti, H. Strasdat, K. Konolige, and S. Izadi, P. Perez, and P. H. S. Torr. Incremental dense se-
W. Burgard. g2o: A General Framework for Graph Opti- mantic stereo fusion for large-scale semantic scene recon-
mization. In IEEE International Conference on Robotics and struction. In IEEE International Conference on Robotics and
Automation (ICRA), 2011. 5 Automation (ICRA), 2015. 2
[15] K. Lai, L. Bo, and D. Fox. Unsupervised feature learning for [29] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L.
3d scene labeling. In Int. Conf. on Robotics and Automation Yuille. Towards unified depth and semantic prediction from
(ICRA), 2014. 2 a single image. In Proc. Conf. Computer Vision and Pattern
[16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and Recognition (CVPR), pages 28002809, 2015. 2, 3, 4
N. Navab. Deeper depth prediction with fully convolutional [30] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J.
residual networks. In IEEE International Conference on 3D Leonard, and J. Mcdonald. Real-time large scale dense
RGB-D SLAM with volumetric fusion. Intl. J. of Robotics
Research, IJRR, 2014. 1, 2