Mobile 3d Recon

3446
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 12, DECEMBER 2020
Mobile3DRecon: Real-time Monocular 3D Reconstruction on a

Mobile Phone
Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang,
Hujun Bao, Member, IEEE, and Guofeng Zhang, Member, IEEE
Fig. 1: Examplar case “Indoor office” reconstructed in real-time on MI8 using our Mobile3DRecon system. Note on the left
column that surface mesh is incrementally reconstructed online as we navigate through the environment. With the reconstructed
3D environment, realistic AR interactions with the real scene can be achieved, including virtual object placement and collision, as
shown on the right column.
Abstract—We present a real-time monocular 3D reconstruction system on a mobile phone, called Mobile3DRecon. Using an embed-
ded monocular camera, our system provides an online mesh generation capability on back end together with real-time 6DoF pose
tracking on front end for users to achieve realistic AR effects and interactions on mobile phones. Unlike most existing state-of-the-art
systems which produce only point cloud based 3D models online or surface mesh offline, we propose a novel online incremental mesh
generation approach to achieve fast online dense surface mesh reconstruction to satisfy the demand of real-time AR applications. For
each keyframe of 6DoF tracking, we perform a robust monocular depth estimation, with a multi-view semi-global matching method fol-
lowed by a depth refinement post-processing. The proposed mesh generation module incrementally fuses each estimated keyframe
depth map to an online dense surface mesh, which is useful for achieving realistic AR effects such as occlusions and collisions.
We verify our real-time reconstruction results on two mid-range mobile platforms. The experiments with quantitative and qualitative
evaluation demonstrate the effectiveness of the proposed monocular 3D reconstruction system, which can handle the occlusions and
collisions between virtual objects and real scenes to achieve realistic AR effects.
Index Terms—real-time reconstruction, monocular depth estimation, incremental mesh generation
1 I NTRODUCTION
Augmented reality (AR) is playing an important role in presenting virtual object and the real scene. To achieve consistent localization in
tual 3D information in the real world. A realistic fusion of a virtual space, a SLAM system can be performed for real-time alignment of
3D object with a real scene relies on the consistency in 3D space be- virtual object to the real environment. More and more visual-inertial
tween virtuality and reality, including consistent localization, visibil- 6DoF odometry systems like VINS-Mobile [19] has been studied to
ity, shadow and interaction like physical occlusion between the vir- run on mobile phones, which greatly extends the applications of AR
in mobile industry. However, most of these works focused on sparse
mapping without special consideration on how to reconstruct dense ge-
• Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, and Yuanbo ometrical structures of the scene. A dense surface scene representation
Wang are with Sensetime Research. E-mail: is the key to a full 3D perception of our environment, and an impor-
{yangxingbin,zhouliyang,jianghanqing,tangzhongliang,wangyuanbo} tant prerequisite for more realistic AR effects including consistency of
@sensetime.com. occlusion, shadow mapping and collision between virtual objects and
• Hujun Bao and Guofeng Zhang are with the State Key Lab of CAD&CG, real environment. More recent works such as KinectFusion [24] and
Zhejiang University. E-mail: {bao,zhangguofeng}@cad.zju.edu.cn. BundleFusion [1] tried to perform 6DoF odometry with dense map-
• Corresponding Author: Guofeng Zhang. ping. However, these works rely on videos augmented with precise
• Xingbin Yang, Liyang Zhou, and Hanqing Jiang assert equal contribution depths captured by a RGB-D camera as input, which is currently im-
and joint first authorship. possible for most mid-range mobile phones. Besides, due to the huge
computation of dense mapping process, most of these systems work
on desktop PC or high-end mobile devices. Some works [23, 26] also
tried to reconstruct dense surface on mobile phones with monocular
Manuscript received 18 May 2020; revised 26 July 2020; accepted 17 Aug. 2020.
Date of publication 21 Sept. 2020; date of current version 3 Nov. 2020.
camera, but are limited in reconstruction scale due to complexity of
Digital Object Identifier no. 10.1109/TVCG.2020.3023634 both computation and memory. Moreover, these systems cannot gen-
1077-2626 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See University.
Authorized licensed use limited to: Zhejiang https://www.ieee.org/publications/rights/index.html
Downloaded on December 02,2020 for more information.
at 08:45:32 UTC from IEEE Xplore. Restrictions apply.
YANG ET AL.: MOBILE3DRECON: REAL-TIME MONOCULAR 3D RECONSTRUCTION ON A MOBILE PHONE3447
erate surface mesh online, which certainly degrates their applications work well in outdoor environments. These limitations encouraged re-
in real-time realistic AR. searchers to explore real-time dense reconstruction with monocular
To make seamless AR available to more middle-end mobile plat- RGB cameras which are smaller, cheaper, and widely equipped on a
forms, this paper introduces a new system for real-time dense surface mobile phone. Without input depths, a real-time dense reconstruction
reconstruction on mobile phones, which we name as Mobile3DRecon. system should estimate depths of the input RGB frames, and fuse the
Our Mobile3DRecon system can perform real-time surface mesh re- estimated depths into a global 3D model by TSDF or surfels. Mono-
construction on mid-range mobile phones with monocular camera we Fusion [30] presented a real-time dense reconstruction with a single
usually have in our pockets, without any extra hardware or depth sen- web camera and MobileFusion [26] proposed a real-time 3D model
sor support. We focus on the 3D reconstruction requirement of realis- scanning tool on mobile devices with monocular camera. Both works
tic AR effects by proposing a keyframe-based real-time surface mesh perform volume-based TSDF fusion without voxel hashing, and there-
generation approach, which is essential for AR applications on mobile fore can only reconstruct in a limited 3D space. For large-scale scenes,
phones. Our main contributions can be summarized as: Tanskanen et al. [36] proposed a live reconstruction system on mobile
phones, which can perform inertial metric scale estimation while pro-
• We propose a multi-view keyframe depth estimation method, ducing dense surfels of the scanned scenes online. Kolev et al. [18]
which can robustly estimate dense depths even in textureless re- enhanced the pipeline of [36] by introducing a confidence-based depth
gions with a certain pose error. We introduce a multi-view semi- map fusion method. Garro et al. [3] proposed a fast metric reconstruc-
global matching (SGM) with a confidence-based filtering to re- tion algorithm on mobile devices, which solves metric scale estma-
liably estimate depths and remove unreliable depths caused by tion problem with a novel RANSAC-based alignment approach in in-
pose errors or textureless regions. Noisy depths are further opti- ertial measurement unit (IMU) acceleration space. Schöps et al. [32]
mized by a deep neural refinement network. estimates sparse depths via motion stereo with a monocular fisheye
camera on the GPU of Google’s Project Tango Tablets, and integrates
• We propose an efficient incremental mesh generation approach the filtered depths by the Tango’s volumetric fusion pipeline for large-
which fuses the estimated keyframe depth maps to reconstruct scale outdoor environments. CHISEL [17] proposed a system for real-
the surface mesh of the scene online with local mesh triangles time dense 3D reconstruction on a Google’s Project Tango. Yang et
updated incrementally. This incremental meshing approach not al. [42] infers online depths by two-frame motion stereo with temporal
only provides online dense 3D surface for seamless AR effects cost aggregation and semi-global optimization, and fuses depths into
on front end, but also ensures real-time performance of mesh gen- an occupany map with uncertainty-aware TSDF, to achieve a 10Hz on-
eration as back-end CPU module on mid-range mobile platforms, line dense reconstruction system with the help of Nvidia Jetson TX1
which is difficult for previous online 3D reconstruction systems. GPU on aerial robots. Very few works achieve real-time outdoor scene
reconstruction on mid-range mobile phones with monocular camera.
• We present a real-time dense surface mesh reconstruction
pipeline with a monocular camera. On mid-range mobile plat- There are also some offline dense reconstruction works on mobile
form, our monocular keyframe depth estimation and incremental devices. 3DCapture [23] presented a dense textured model reconstruc-
mesh updating are performed at no more than 125 ms/keyframe tion system, which starts with an online RGB and IMU data capturing
on back end, which is fast enough to keep up with the front-end stage followed by an offline post-processing reconstruction. The main
6DoF tracking at more than 25 frames-per-second (FPS). reconstruction steps including pose tracking, depth estimation, TSDF
depth fusion, mesh extraction and texture mapping, are all done as the
This paper is organized as follows: Section 2 briefly presents related post-processing stage on mobile devices. Poiesi et al. [28] described
work. Section 3 gives an overview of the proposed Mobile3DRecon another cloud-based dense scene reconstruction system that performs
system. The monocular depth estimation method and the incremental Structure-from-Motion (SfM) and local bundle adjustment on monoc-
mesh generation approach are described in Sections 4 and 5 respec- ular videos from smartphones to reconstruct a consistent point cloud
tively. Finally, we evaluate the proposed solution in Section 6. map for each client, and run periodic full bundle adjustments to align
the maps of various clients on a cloud server. Some other works pre-
2 R ELATED W ORK sented real-time dense scene reconstruction with GPU acceleration
The development of consumer RGB-D cameras such as Microsfot on desktop PC. For example, Merrell et al. [22] proposed a real-time
Kinect and Intel RealSense gives rise to a large number of real-time 3D reconstruction pipeline on PC, which utilizes visibility-based and
dense reconstruction systems. With an RGB-D camera, a real-time confidence-based fusion for merging multiple depth maps to online
dense reconstruction system simultaneously localizes the camera us- large-scale 3D model. Pollefeys et al. [29] presented a complete sys-
ing iterative closest point (ICP), and fuses all the tracked camera tem for real-time video-based 3D reconstruction, which captures large-
depths into a global dense map represented by TSDF voxels or sur- scale urban scenes with multiple video cameras mounted on a driving
fels. An impressive work is KinectFusion [24], which tracks an in- vehicle.
put Kinect to the ray-casted global model using ICP, and fuses depths As depth estimation is a key stage of our proposed reconstruction
into the global map using TSDF. KinectFusion is not able to work for pipeline, our work is related to a great amount of works on binocu-
large scale scenes, due to the computation and memory limitations of lar and multi-view stereo, which have been thoroughly investigated
TSDF voxels. More recent dense mapping works such as BundleFu- in [27, 31, 33, 35]. REMODE [27] carries out a probabilistic depth
sion [1] used voxel hashing [25] to break through the limitations of measurement model for archieving real-time depth estimation on a
TSDF fusion. InfiniTAM [15] proposed a highly efficient implemen- laptop computer, with the help of CUDA parallelism. For achieving
tation of voxel hashing on mobile devices. A more light-weight spa- higher quality depth inference from multi-view images, more recent
tially hashed SDF strategy on CPU for mobile platform was proposed works such as [5, 10–12, 34, 40, 41, 43] employ deep neural networks
in CHISEL [17]. ElasticFusion [37] uses surfel representation for (DNN) to address depth estimation or refinement problems. However,
dense frame-to-model tracking and explicitly handles loop closures us- the generalization performance of most DNN-based methods will be
ing non-rigid warping. Surfel map representation is more suitable for affected by camera pose errors or texturelessness in practical applica-
online refinement of both pose and underlying 3D dense map, which tions. Besides, most of these depth estimation works are not applicable
is diffcult to handle by TSDF voxel fusion. However, TSDF is more to mobile phones as they do not meet the underlying efficiency require-
suitable for online 3D model visualization and ray intersection in AR ments by mobile platforms. Online depth from motion on mobile de-
applications, which is hard for surfels. An online TSDF refinement is vices has been almost exclusively studied in the literature of dense 3D
performed in [1] by voxel deintegration. reconstruction, such as [18, 26, 32, 36, 42]. Due to the limited comput-
Although impressive dense reconstruction quality can be achieved ing power provided by mobile platforms, sparse depth maps are online
by an RGB-D camera, few mobile phones have been equipped with estimated in these methods, with historical frames selected as refer-
it today. Besides, most consumer RGB-D cameras are not able to ences for stereo. A multi-resolution scheme is used to speed up dense
Authorized licensed use limited to: Zhejiang University. Downloaded on December 02,2020 at 08:45:32 UTC from IEEE Xplore. Restrictions apply.
3448 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 12, DECEMBER 2020
Fig. 2: System framework.
stereo matching in [36], with GPU acceleration. Valentin et al. [39] After the 6DoF tracking is initialized normally on the front end, for
proposed a low latency dense depth map estimation method using a a latest incoming keyframe from the keyframe pool with its globally
single CPU on mid-range mobile phones. The proposed method incor- optimized pose, its dense depth map is online estimated by multi-view
porates a planar bilateral solver to filter planar depths while refining SGM, with a part of previous keyframes selected as reference frames.
depth boundaries, which is more suitable for occlusion rendering pur- A convolutional neural network follows the multi-view SGM to refine
pose. But for our dense reconstruction purpose, accurate depths should depth noise. The refined key-frame depth map is then fused to generate
be estimated online with limited mobile platform resources, which is dense surface mesh of the surrounding environment. Our pipeline per-
indeed a difficult problem for most existing depth estimation works. forms an incremental online mesh generation, which is more suitable
Another important stage of our pipeline is surface extraction. Some for the requirement of real-time 3D reconstruction by AR application
real-time dense reconstruction works such as [18, 36, 37] only gener- on mobile phone platform. Both the depth estimation and the incre-
ate surfel cloud. Poisson surface reconstruction [16] should be done mental meshing are done as back-end modules. As a dense mesh is
as a post-process to extract surface model from surfels. However, AR gradually reconstructed at the back end, High level AR applications
applications usually require real-time surface mesh generation for in- can utilize this real-time dense mesh and the 6DoF SLAM poses to
teractive anchor placement and collision. Although works such as [26] achieve realistic AR effects for the user on the front end, including
update TSDF volume in real-time, surface mesh extraction from TSDF AR occlusions and collisions, as shown in Fig. 1. In the following
is done offline by Marching cubes [21]. [42] uses TSDF only to main- sections, the main steps of our pipeline will be described in detail.
tain a global occupancy map instead of 3D mesh. CHISEL [17] pro-
posed a simple incremental Marching cubes scheme, in which polyg- 4 M ONOCULAR D EPTH E STIMATION
onal meshes are generated only for part of the scene that need to be
With the global optimized keyframe poses from SenseAR SLAM plat-
rendered. In comparison, our work achieves a truly incremental real-
form, we estimate a depth map for each incoming keyframe online.
time surface mesh generation to ensure both incremental mesh updat-
Some recent works [5, 10, 43] tried to solve multi-view stereo prob-
ing and concise mesh topology, which satisfies the requirements of
lem with DNN, but they turn out to have generalization problem in
graphics operations by realistic AR effects to a larger extent.
practical applications with pose errors, and textureless regions. As can
be seen in Fig. 5(a), especially for online SLAM tracking, there are
3 S YSTEM OVERVIEW usually inevitable inaccurate poses with epipolar errors more than 2
We now outline the steps of the proposed real-time scene reconstruc- pixels, which will certainly affect the final multi-view depth estima-
tion pipeline as in Fig. 2. tion results. Instead, our system solves multi-view depth estimation
As a user navigates into his environment by a mobile phone with by a generalized multi-view SGM algorithm. Different from the gen-
a monocular camera, our pipeline tracks 6DoF poses of the mobile eral two-view SGM [7,8], our proposed SGM approach is more robust
phone using a keyframe-based visual-inertial SLAM system, which because depth noise can be suppressed by means of multi-view costs.
tracks the 6DoF poses on the front end, while maintaining a keyframe Besides, no stereo rectification needs to be done ahead as we compute
pool on the back end with a global optimization module to refine poses costs at depth space directly instead of disparity space. Moreover, con-
of all the keyframes as feedback to the front end tracking. We use sidering the inevitable depth errors causes by pose errors, textureless
SenseAR SLAM [SenseTime] 1 in our pipeline for pose tracking. Note region or occlusions, a DNN-based depth refinement scheme is used to
that any other keyframe-based VIO or SLAM system such as ARCore filter and refine the initial depth noise. We will show that our scheme
[Google] can be used at this stage. of a multi-view SGM followed by a DNN-based refinement performs
better than the end-to-end multi-view stereo network.
1 http://openar.sensetime.com The whole depth estimation process runs on a single back-end
CPU thread to avoid occupancy of front-end resources used by SLAM where l ∈ {0, 1, 2..., L − 1}, and dl is the sampled depth at the l-th
tracking or GPU graphics. Since depth estimation is done for each level. Given a pixel x with depth dl on keyframe t, its projection pixel
keyframe, we only need to keep up with the frequency of keyframes xt→t ′ (dl ) on frame t ′ by dl can be calculated by:
(almost 5 keyframes-per-second) in order to achieve a real-time per-
formance of depth estimation on mid-range mobile platforms such as xt→t ′ (dl ) ∼ dl Kt ′ Rt ′ RtT Kt−1 x̂ + Kt ′ (Tt ′ − Rt ′ RtT Tt ), (6)
OPPO R17 Pro. We will describe our monocular depth estimation in
key details. where Kt ′ , Kt are the camera intrinsic matrices of keyframe t and t ′ .
Rt ′ , Rt are rotation matrices, and Tt , Tt ′ are translation matrices re-
4.1 Reference Keyframes Selection spectively. Note that x̂ is the homogeneous coordinate of x.
For each incoming keyframe, we seek a set of neighboring keyframes We resort to a variant of Census Transform (CT) [6] as the feature
as reference frames that are good for multi-view stereo. First, the ref- descriptor to compute patch similarity. Compared to other image simi-
erence keyframes should provide sufficient parallax for stable depth larity measurements such as Normalized Cross Correlation, CT has the
estimation. This is because in a very short baseline camera setting, a characteristic of boundary preservation. Besides, mobile applications
small pixel matching error in image domain can cause a large fluctu- considerably benefit from the binary representation of CT. Therefore,
ation in depth space. In a second aspect, to achieve complete scene our matching cost is determined as follows:
reconstruction, current selected keyframe should have as large over-
lap as possible with the reference ones. In other words, we want to C(x, l) = ∑ wt ′ CT (x, xt→t ′ (dl ))
t ′ ∈N(t)
make sure there is enough overlap between the current keyframe and , (7)
wt ′ = S(t ′ )/ ∑ S(τ )
the selected reference ones. τ ∈N(t)
To meet the requirements above, we select those neighboring
keyframes far away from the current keyframe, while avoiding too where N is the number of selecting keyframes. wt ′ is the cost weight
large baseline which may cause low overlap. For simplicity, we use of reference frame t ′ , and S(t ′ ) is the final score of t ′ . CT (x, xt→t ′ (dl ))
t to denote a keyframe at time t. Therefore, for each keyframe t, we is the Census cost of the two patches centered at x and xt→t ′ (dl ) re-
expoit a baseline score between t and another keyframe t ′ as: spectively. We use a lookup-table to calculate the hamming distance
� � between two Census bit string. We traverse every pixel corresponding
2 to each slice with label l and compute its matching cost according to
wb (t,t ′ ) = exp −(b(t,t ′ ) − bm ) /δ 2 , (1)
Eq. 7. After that, we get a cost volume with size W × H × L, where
W and H are width and height of the frame. Then the cost volume
where b(t,t ′ ) is the baseline between t and t ′ . bm is the expectation
is aggregated, with a Winner-Take-All strategy taken subsequently to
and δ is standard deviation. We set bm = 0.6m and δ = 0.2 in our
obtain the initial depth map.
experiments. Meanwhile, high overlap should be kept between t and
Compared to the conventional binocular setting [8], we select mul-
t ′ to cover as larger common perspective of view as possible for bet-
tiple keyframes as reference to accumulate costs for suppression of
ter matching. We compute the angle of optical axis between the two
depth noise caused by camera pose errors or textureless regions. Al-
frames. Generally, a larger angle indicates more different perspectives.
though pose errors can be suppressed by multi-view matching to some
To encourage a higher overlap, we define a viewing score between t
extent, there are still generally ambiguous matches in repetitive pat-
and t ′ as:
terns or texture-less regions, which result in noisy depth map. Inspired
wv (t,t ′ ) = max(αm /α (t,t ′ ), 1), (2)
by the method proposed in [7, 8], we add an additional regularization
where α (t,t ′ ) is the angle between the optical viewing directions be- to support smoothness by penalizing depth labeling changes of pixel
tween t and t ′ . The function scores the angles below αm , which is set neighborhood. Specifically, for image pixel x with label l, the cost
to 10 degrees in our experiments. aggregation is done by recursive computation of costs in neighboring
The scoring function is simply the product of these two terms: directions as:
S(t,t ′ ) = wb (t,t ′ ) ∗ wv (t,t ′ ). (3)  
Ĉr (x − r, l),
For each new keyframe, we search in the historical keyframe list  Ĉr (x − r, l − 1) + P1 , 
for reference frames. The historical keyframes are sorted by the scores Ĉr (x, l) = C(x, l) + min 

 − min Ĉr (x − r, k),

Ĉr (x − r, l + 1) + P1 , k ,
computed by function (3), in which the top keyframes are chosen as  
the reference frames for stereo. Too many reference frames are help- min Ĉr (x − r, i) + P2
i
ful for ensuring accuracy of depth estimation, but will certainly slow
down the stereo matching computation, especially on mobile phones. (8)
Therefore, we only choose the top two as a trade-off between accuracy where Ĉr (x, l) is the aggregated cost of x with label l in neighboring di-
and computation efficiency, as shown in the “Indoor stairs” case of Fig. rection r, and r ∈ Nr , which is the neighboring direction set (we use 8
5(a-b). We compute a final score for each new keyframe t as the sum neighborhood). P1 and P2 are the penalty values. Since intensity differ-
of the scores between t as its reference keyframes as follows: ence usually indicates depth discontinuity, we set P2 = –a∗|∇It (x)|+b.
Here ∇It (x) is the intensity gradient of x in image It at keyframe t,
S(t) = ∑ S(t,t ′ ), (4) which we find out useful for preserving depth boundaries. The value of
t ′ ∈N(t) Ĉr increases along the path, which may lead to extremely large values
or even overflow. Therefore, the last term min Ĉr (x − r, k) is subtracted
where N(t) denotes the reference keyframes of t. The computed final k
score will be used as a weight for the following multi-view stereo cost to avoid permanent increase of the aggregated cost without changing
computation. the minimum cost depth level. The aggregated costs of x at depth la-
bel l are recursively computed for all the neighboring directions and
4.2 Multi-View Stereo Matching summed up. The final cost volume is formed by accumulating each
For each new keyframe, we propose to estimate its depth map using pixel in the image as follows:
an SGM based multi-view stereo approach. We uniformly sample the
inverse depth space to L levels. Suppose the depth measurement is Ĉ (x, l) = ∑ Ĉr (x, l). (9)
bounded to a range from dmin to dmax . The l-th sampled depth can be r∈Nr
computed as follows: ˆ is given by a Winner-Take-All strategy as:
The final depth label l(x)
(L − 1)dmin dmax
dl = , (5) ˆ = min Ĉ(x, l).
l(x) (10)
(L − 1 − l)dmin + l (dmax − dmin ) l
In order to get a sub-level depth value, we use parabola fitting to ac-

quire a refined depth level lˆs (x) as: As can be seen in Fig.3, compared to [2], the proposed confidence-
based depth filtering method handles the depth noise more effectively,
ˆ − 1 − Ĉ x, l(x)
ˆ +1 especially along the depth map borders.

Ĉ x, l(x)
lˆs (x) = l(x)
ˆ + . In our implementation, we remove all pixels with confidence lower
ˆ − 1 − 4Ĉ x, l(x)ˆ ˆ +1

2Ĉ x, l(x) + 2Ĉ x, l(x) f
(11) than 0.15 to get a filtered depth map Dt with fewer abnormal noisy
We replace l in Eq. (5) with the refined depth level lˆs (x) to get a more depths for each keyframe t, as shown in Fig. 5(c). With this
accurate sub-level depth measurement for x. With the sub-level depths, confidence-based filtering, most depth outliers caused by pose errors
we get an initial depth map Dti for each keyframe t, as shown in Fig. or textureless regions can be suppressed.
5(b).
On mobile platform, we use NEON instruction set to significantly 4.3.2 DNN-based Depth Refinement
accelerate the cost volume computation and aggregation optimization.
Take OPPO R17 Pro with Qualcomm Snapdragon 710 chip as exam-
ple, computational speed by NEON acceleration is 2 times faster for
cost volume computation and 8 times faster for aggregation, which
ensures a real-time performance on mobile phone.
4.3 Depth Refinement
Although our multi-view stereo matching is robust to unstable SLAM
tracking to some extent, there are still noisy depths in the initialized
depth maps which come from the mistaken matching costs caused by
camera pose errors or textureless regions, as seen in Fig. 5(b). There-
Fig. 4: Depth Refinement Network Flow.
fore, a depth refinement strategy is introduced as a post-processing
step to correct depth noise. The proposed refinement scheme consists
of a confidence-based depth filtering and a DNN-based depth refine- After the confidence-based depth filtering, we employ a deep neural
ment, which will be described in more details. network to refine the remaining depth noise. Our network can be re-
garded as a two-stage refinement structure, as illustrated in Fig. 4. The
4.3.1 Confidence Based Depth Filtering first stage is an image-guided sub-network CNNG which combines the
f
filtered depth Dt with the corresponding gray image It of keyframe t
to reason a coarse refinement result Dtc . Here, gray image plays a role
as guidance for depth refinement. It provides the prior knowledge of
object edge and semantic information for CNNG . With this guidance,
the sub-network CNNG is able to distinguish the actual depth noise to
be refined. The second stage is a residual U-Net CNNR which further
Fig. 3: Confidence-base depth map filtering compared to Drory et al. refines the previous coarse result Dtc to get the final refined depth Dt .
[2]. (a) A representative keyframe. (b) The depth map estimated by The U-Net structure mainly contributes to making learning process
multi-view SGM. (c) The confidence map measured by [2]. (d) The more stable and overcoming feature degradation problem.
depth map filtered by (c). (e) Our confidence map. (f) Our depth map To satisfy our refinement purpose, we follow [4, 9, 14] to exploit
filtered by (e). three spatial loss functions to penalize depth noise while maintaining
object edges. The training loss is defined as:
Although the depth maps obtained by our semi-global optimization Φ = Φedge + Φ pd + λ Φvgg . (15)
has good completeness, the depth noise in textureless regions is ob-
vious, which requires confidence measurement for further depth filter- Φedge is an edge-maintainance loss, which contains three terms respec-
ing. Drory et al. [2] proposed an uncertainty measurement for SGM, in tively defined as:
which they assume the difference between the sum of the lower bound
aggregated costs and the minimum of the aggregated costs equals zero. Φedge= Φx + Φy + Φn
The uncertainty measurement for pixel x can be expressed as follows: Φx = 1 g
∑ ln ∇x Dt (x) − ∇x Dt (x) + 1

|It |
x∈It
g

Nr − 1 Φy = |I1 | ∑ ln ∇y Dt (x) − ∇y Dt (x) + 1 , (16)

U(x) = min ∑ Ĉr (x, l) − ∑ min Ĉr (x, l) − C(x, l) . (12) t
x∈It
l r∈N
r r∈Nr l Nr 1 g
Φn = |It | ∑ (1 − ηt (x) ∗ ηt (x))
x∈It
However, this method ignores the neighborhood depth informations
when calculating the uncertainty measure U(x), resulting in some iso- where |It | is the number of valid pixels in keyframe image It . Dt (x)
g
lated noise in the depth map. Considering the fact that neighborhood represents the ground-truth depth value of pixel x, and Dt (x) is the fi-
pixel depths do not change greatly for the correct depth measurements, nal refined depth of x. ∇x and ∇y denote the Sobel derivatives along
we calculate a weight W (x) for the uncertainty measurement of x in its the x-axis and y-axis ηt (x) is approx-
5 × 5 local window Ω(x), which measures the neigboring depth level respectively. The depth
normalg
imated by ηt (x) = −∇x Dt (x), ∇y Dt (x), 1 , and ηt (x) is the ground-
differences as: truth normal estimated in the same way. Φx and Φy measure the depth
1 differences along edges. Φn measures the similarities between surface
∑ ˆls (x) − lˆs (y) > 2 ,

ω (x) = (13) normals.
|Ω(x)| y∈Ω(x)
Φ pd is a photometric-depth loss, which minimizes depth noise by
forcing the gradient consistency between the gray image It and the
where |Ω(x)| is the number of pixels in local window Ω(x). We nor- refined depth Dt as follows:
malize the weighted uncertainty measurement as the final confidence
M(x) of x as: 1
∇x Dt (x) e−α |∇x It (x)| + ∇2y Dt (x) e−α |∇y It (x)| ,
2
Φ pd = ∑

ω (x)U(x) |It | x∈It
M(x) = 1 − . (14)
max(ω (u)U(u)) (17)
u∈It
where α = 0.5 in our experiments. ∇2x and ∇2y are the second deriva- 5 I NCREMENTAL M ESH G ENERATION
tives along the x-axis and y-axis respectively. As demonstrated in The estimated depths are then fused simultaneously to generate online
Eq. (17), Φ pd encourages Dt to be consistent with corresponding gray surface mesh. Real-time surface mesh generation is required by AR
image It in pixel areas with smaller gradients. We use second deriva- applications for interactive graphics operations such as anchor place-
tives to make the refinement process more stable. ments, occlusions and collisions on mobile phones. Although TSDF
Φvgg is the high-level perceptual loss commonly used in generative is maintained and updated online in many real-time dense reconstruc-
adversarial networks like [13, 14]. By minimizing the difference be- tion systems such as [26, 42], mesh is however generated offline by
tween high-level features, perceptual loss contributes to maintaining these works. Some of the systems like [26] render or interact with the
global data distribution and avoiding artifacts. reconstructed model by directly raytracing TSDF on the GPU without
We train the proposed depth refinement network using Demon meshing, which requires TSDF data to be stored always on GPU. How-
dataset proposed by University of Freiburg [38] together with our own ever, with limited computing resources on mid-range mobile phones,
dataset. We first run our multi-view SGM and confidence-based fil- dense reconstruction is usually required to run with only CPU on back-
tering on Demon dataset to generate a set of initially filtered depth end, so as not to occupy resources of front-end modules or GPU ren-
maps and reassemble them with the corresponding ground truth (GT) dering. In such cases, an explicit surface meshing should be done on
as training pairs to pre-train the proposed network. The pre-trained CPU for the realistic AR operations like occlusions and collisions. To
refinement model is then finetuned on our own captured sequences, achieve an online incremental meshing on CPU, CHISEL [17] gener-
which contains 3,700 training pairs. These pairs consist of 13 indoor ates a polygon mesh for each chunk of the scene, and meshing is only
and outdoor scenes, which are all captured by OPPO R17 Pro. Each done for the chunks that will participate in rendering. This kind of in-
training pair contains an initially filtered depth map and a GT depth cremental meshing cannot guarantee consistency and non-redundancy
map from the embedded ToF sensor. Finally, Adam policy is adopted of mesh topology, and is diffcult to handle real-time updating TSDF.
to optimize the depth refinement network. In this section, we present a novel incremental mesh generation ap-
Fig. 5(d) shows the depth maps refined by our DNN-based refine- proach which can update surface mesh in real-time and is more suit-
ment network, which contains less spatial depth noise. We will show able for AR applications. Our mesh generation performs a scalable
in the following section that our DNN-based depth refinement network TSDF voxel fusion to avoid voxel hashing conflicts, while incremen-
can also improve the final surface mesh reconstruction results. As fur- tally updating the surface mesh according to the TSDF variation of the
ther seen in Fig. 10(d-f) and Table 1, our depth refinement network newly fused voxels. With this incremental mesh updating, real-time
following the multi-view SGM performs better than directly using end- mesh expansion can be done on mobile phones using only a single
to-end learning-based depth estimation algorithms like [11, 40] in gen- CPU core. In the following, we will describe the incremental meshing
eralization. We also compare the time cost of our scheme with other approach in detail.
end-to-end networks. On OPPO R17 Pro, our monocular depth estima-
tion takes 70.46 ms/frame, while MVDepthNet [40] takes 5.91 s/frame.
DPSNet [11] is unable to run on OPPO R17 Pro or MI8 because of its
complicated network structure. Therefore, applying DNN for depth
refinement is a more economical way in time efficiency for a real-time
mobile system with limited computing resources.
Fig. 6: (a) shows the structure of TSDF cubes, each of which consists
of 8 voxels, and each voxel is shared by 6 neighboring cubes. (b) show
the purpose of the new scalable hash function. A cube with integer
coordinates (x y z) located inside the volume can be indexed normally
the conventional hash function h(x, y, z). But for another cube (x′ y′ z′ )
outside the volume, function h will lead to a key conflict. Instead,
using the newly proposed scalable hash function ĥ will avoid conflict
for both cubes.
5.1 Scalable Voxel Fusion

TSDF fusion has demonstrated its effectiveness in literature [1, 24].
Fig. 5: Our monocular depth estimation results on two representative Despite of the simplicity of the conventional TSDF, its poor scalabil-
keyframes from sequences “Indoor stairs” and “Sofa”: (a) The source ity and memory consumption prevented its further applications, such
keyframe image and its two selected reference keyframe images. Two as large scale reconstruction. Besides, the complex computation and
representative pixels and their epipolar lines in reference frames of large memory requirement made it difficult to achieve real-time perfor-
“Indoor stairs” are drawn out to demonstrate certain camera pose er- mance on mobile phones, especially for outdoor environments. Voxel
rors from 6DoF tracking on front end. (b) The depth estimation re- hashing [25] has been proved to be more scalable for large scale scenes
sult of multi-view SGM and the corresponding point cloud by back- because of its dynamical voxel allocation scheme and lower mem-
projection. (c) The result after confidence-based depth filtering and its ory consumption, but it needs to deal with conflicts caused by hash
corresponding point cloud. (d) The final depth estimation result after function. Inspired by the above approaches, we propose a new scal-
DNN-based refinement with its corresponding point cloud. able hashing scheme that combines both the simplicity of conventional
TSDF voxel indexing and the scalability owned by voxel hashing. The
basic idea is to dynamically enlarge the bound of volume whenever a
Fig. 7: Illustration of dynamic objects removal on two cases: the first row shows an object removed out of view, and the second row is a
pedestrian who walks by, stands for a while and walks away. In both cases, our voxel fusion algorithm can gradually remove the unwanted
dynamic object in the reconstructed mesh when it finally goes away.
3D point falls outside the pre-defined volume range. According to this two arbitary different points without confict, which ensures a more
scheme, we don’t need to handle voxel hashing conflicts. efficient way of voxel hashing than conventional hashing scheme. Be-
sides, the reconstruction wouldn’t be bounded to the predefined vol-
5.1.1 Scalable Hash Function ume size G with the scalability provided by the proposed hash func-
Marching cubes [21] extracts surface from a cube, each of which con- tion. Using this scalable hashing scheme, we can expand the real-time
sists of 8 voxels from TSDF, as shown in Fig. 6(a). In our meshing pro- reconstruction freely in 3D space without limitation caused by volume
cess, each cube and its associated voxels can be indexed by a unique range.
code generated with a novel scalable hash function.
Suppose we have a 3D volume with pre-defined size γ , each dimen-
sion of which has a range [−γ /2, +γ /2). G = γ /δ is the corresponding
volume size of each dimension in voxels, with δ an actual voxel size,
such as 0.06 meter. The located cube of a 3D point V = ( f x f y f z)
inside the volume can be indexed by a hash function as follows:
h(x, y, z) = g(x) + g(y) ∗ G + g(z) ∗ G2 , (18)

where (x y z) is the lower integer coordinates of V divided by the voxel
size δ , i.e., (x y z) = (⌊ f x/δ ⌋ ⌊ f y/δ ⌋ ⌊ f z/δ ⌋). g(i) = i + G/2, which
converts i ∈ [−G/2, +G/2) to range [0, G).
With Eq. (18), we obtain a unique identifier for a 3D point located
inside the pre-defined volume. However, a key conflict will happen
when one point falls outside the volume. Suppose we have defined a
volume with G = 5. A point Va inside the volume with coordinates
(g(x) g(y) g(z)) = (1, 1, 0) will have the same identifier 6 as another
point Vb with coordinates (6, 0, 0) outside the volume according to
Eq. (18), as depicted in Fig. 6(b). To also handle the case outside
volume range for better scalability, we propose a new hash function by
reformulating the hash function in Eq. (18) into the following form:
ĥ(x, y, z) = OC + ĝ(x) + ĝ(y) ∗ G + ĝ(z) ∗ G2 ,

(19)
ĝ(i) = i + GOG /2,
where OG is a local offset for larger voxel index range in each dimen- Fig. 8: Illustration of incremental mesh updating on three incoming
sion, and OC is a global offset to ensure uniqueness of voxel indexing, keyframes. For each keyframe, the triangles colored with light yellow
which are defined as: are updated by the current depth map, and the green color indicates the
newly generated triangles.
⌊ 2Gî ⌋ + 1 î > 0
OG = −(2î+1) î = arg max |i|
⌊ G ⌋ + 1 î ≤ 0 i∈{x,y,z} (20)
OG −1 5.1.2 Voxel Fusion with Dynamic Objects Removal
OC = G3 ∑ k3
k=1 Following our scalable voxel hashing scheme, we integrate our esti-
mated depth measurements into the TSDF voxels where their corre-
With the new hash function of Eq. (19), Vb will have new coordi- sponding global 3D points occupy.
nates (ĝ(x) ĝ(y) ĝ(z)) = (11, 5, 5) and a new identifier 286 different Suppose we have an estimated depth map Dt at time t. For each
from Va . We avoid voxel index conflict handling by enlarging the in- depth measurement d ∈ Dt at pixel x = (u, v), we project it back to
dexing range from [0, G3 ) to [0, (GOG )3 ) for the case outside the vol- −1
ume, with the help of the local and global offsets. Therefore, unique point by P = Mt ρ (u, v, d), where ρ (u, v, d) =
get a global 3D space
u−cu v−cv
identifiers are generated by the newly proposed hashing approach for fu d, fv d, d is the back projection function, with ( fu , fv ) the
When users perform real-time reconstruction with some AR appli-

cations on mobile phones, there are usually dynamic objects such as
walking pedestrians or moved objects, as shown in Fig. 7. These
dynamic objects do not follow multi-view geometry prerequisites in
temporal times. However, a more complicated case occurs when a
pedestrian walks into the front of the camera, stands for a while, and
then walks away, as illustrated in the first row of Fig. 7. Multi-view
geometry is satisfied when the pedestrian stands still, so that the TSDF
voxels are updated to implicitly contain the human body, which will be
reflected in the reconstructed surface mesh. Works such as [20, 39, 42]
are customized for static scenes, whose voxel fusion scheme cannot
handle such complicated situation well. As far as we know, we are
the first work to deal with the influence of dynamic objects on mobile
phones with monocular camera.
In addition to the updating of the voxels associated with the current
estimated depth measurements, we also project each existing voxel V
to the current frame t for depth visibility checking. If d(Mt V) < Dt (x),
which means the depth measurement is behind the model, a visibility
conflict occurs. It happens when an object enters into the camera view
and stand still at some previous frames, and then disappears in the
next frames. The voxels fused in the previous frames when the object
is almost static, will have conflicts with the next frames when it goes
away. Similar situations could also be caused by some broken triangle
pieces generated with an inaccurate camera pose. We solve this situa-
tion by further updating the TSDF value of voxel V in the same way
as Eq. (21). If d(Mt V) ≪ Dt (x), the TSDF value will drop quickly be-
low zero, and its associated cubes are invalid. This strategy will make
the cubes occupied by the dynamic human disappear quickly. With
this visibility checking strategy, users can continue to scan the area
occupied by the human body or the moved object after he goes away.
The reconstructed foreground will gradually be removed and the back-
Fig. 9: Our surface mesh generation results of our four experimental ground structure will come out soon, as can be seen from the changes
sequences “Indoor stairs”, “Sofa”, “Desktop” and “Cabinet” captured of the reconstructed surface meshes shown in Fig.7.
by OPPO R17 Pro. (a) shows some representative keyframes of each
sequence. (b) The generated global surface mesh of each sequence
without DNN-based depth refinement. (c) Our generated global sur-
face mesh with DNN-based depth refinement.
focal lengths in u and v directions, and (cu , cv ) the optical center. Mt is

the transformation matrix from global 3D space to local camera space
at time t. The hash index of the cube occupied by P is determined
by Eq. (19). As illustrated in Fig. 6, the occupied cube has eight
associated voxels. Each voxel V would be a new one when travelled
for the first time or be updated as follows:
Tt (V) = Tt−1 (V) + d(Mt V) − Dt (π (Mt V))

, (21)
Wt (V) = Wt−1 (V) + 1
where π (x, y, z) = xz fu + cu , yz fv + cv is the projection function. For

Fig. 10: Comparison of our monocular depth estimation with other
clarity, we rewrite π (Mt V) as x. d(Mt V) represents the projection state-of-the-art methods: (a) A representative keyframe in sequence
depth of V at local camera space of keyframe t. D(x) is the depth “Outdoor stairs” by OPPO R17 Pro. (b) Ground truth ToF depth
measurement at pixel x. Tt (V) and Wt (V) represent the TSDF value map by OPPO R17 Pro. (c) REMODE [27]. (d) DPSNet [11]. (e)
and weight of V respectively at time t. For a newly generated voxel, MVDepthNet [40]. (f) Our Mobile3DRecon.
Tt (V) = d(Mt V) − D(π (Mt V)) and Wt (V) = 1.
With the TSDF voxel updating method in Eq. (21), we gradually
generate or update all the voxels associated with the cubes occupied
by the depth measurements from every incoming estimated depth map 5.2 Incremental Mesh Updating
in real-time. Specifically, we maintain a cube list to record all the Marching cubes [21] is an effective algorithm to extract iso-surface
generated cubes. For each cube in the list, its associated TSDF voxel from TSDF volume, which is widely used in dense reconstruction sys-
hash indices are also recorded, so that two neighboring cubes can share tems like [1, 24–26]. However, most of these systems perform surface
voxels with same hash indices. The isosurface is extracted from the extraction as a post-process after real-time reconstruction has finished,
cube list by Marching cubes algorithm [21]. Note that if any associated due to its performance issue caused by frequent interpolation opera-
voxel of a cube is found to be projected outside the depth map border tions. Raycasting technology is employed in [24] to render isosurface
or onto an invalid depth pixel, all the TSDF voxel updates associated for the current frame, but no mesh is actually extracted. Most AR ap-
with this cube caused by the depth measurement need to be reverted. plications require real-time mesh generation which supports incremen-
This rolling-back strategy effectively reduces the probability of broken tal updating, especially on mobile phones. We propose an incremental
triangles. Besides, a cube will be removed if the updated TSDF values mesh updating strategy, which is particularly suitable for real-time per-
are lower than zero for all of its 8 voxels, because no triangle should formance on mobile devices. Considering the surface extraction pro-
be extracted in that cube. cess is done on back-end, we run our incremental mesh generation on
Fig. 11: Comparison of the finally fused surface meshes by fusing the estimated depth maps of our Mobile3DRecon and those of [11, 40] on
sequence “Outdoor stairs” by OPPO R17 Pro. (a) Some representative keyframes. (b) Surface mesh generated by fusing ToF depth maps. (c)
Mesh by DPSNet [11]. (d) Mesh by MVDepthNet [11]. (e) Our generated surface mesh.
a single CPU thead of the mobile phone, so as not to occupy resources map against REMODE [27], DPSNet [11] and MVDepthNet [40]. We
of front-end modules or GPU rendering. The incremental updating use the pretrained MVDepthNet model and DPSNet model to gener-
strategy is based on the observation that only part of the cubes need to ate depth maps for comparisons. Both models are released on their
be updated for each keyframe. The iso-surface could be extracted only github websites 3 4 , which were trained with Demon dataset [38].
for these cubes. Since DPSNet cannot be run on OPPO R17 Pro with limited com-
In order to know which cubes should partici- puting resources due to its heavy network structure, we run it on a PC
pate in surface extraction, a status variable χ (V) ∈ for comparison. As shown in Fig. 10, REMODE can ensure certain
{ADD,UPDAT E, NORMAL, DELET E} is assigned to each voxel depth accuracy only in well-textured regions. Our depth estimation
V. If V is a newly allocated voxel, χ (V) is set to ADD. If the TSDF performs better in generalization than DPSNet and MVDepthNet, and
value Tt (V) is updated for V at time t, χ (V) is set to UPDAT E. If produce depth measurements with more accurate details. We further
Tt (V) ≤ 0 or Wt (V) ≤ 0, χ (V) is set to DELET E, which means that give the comparison of the surface meshes fusing our depth maps and
V has to be deleted from the existing voxel list and moved to an empty the depth maps estimated by other state-of-the-art methods [11, 40] in
voxel list for new voxel reallocation. We define a cube as updated Fig. 11. Our system performs better than the other works in the finally
if it has at least one associated voxel whose status is UPDAT E or generated surface structure with less noisy triangles. We can also see
ADD. After finishing TSDF voxel fusion and depth visibility handling from the depth and mesh accuracy evaluation in Table 1 that our Mo-
mentioned in Section 5.1.2, we extract mesh triangles only from the bile3DRecon reconstruct the scenes with a centimeter-level accuracy
updated cubes. If an updated cube already has triangles extracted, on depths and surface meshes, which are competitive in both Root
these triangles are removed and replaced with the newly extracted Mean Squared Error (RMSE) and Mean Absolute Error (MAE), even
ones. After triangle extraction finishes for all the updated cubes, the on the “Desktop” and “Cabinet” sequences with textureless regions.
status variables of the updated voxels are all set to NORMAL. If a Table 2 gives the time statistics of our method in stages of monoc-
cube has at least one associated voxel whose status is DELET E, the ular depth estimation and incremental meshing seperately on two mid-
cube is considered as deleted, and its extracted triangles are removed. range mobile platforms: OPPO R17 Pro with Qualcomm Snapdragon
Fig. 8 illustrates the incremental mesh updating process. 710 (SDM710) computing chip and MI8 with Qualcomm Snapdragon
Fig. 9 shows the surface mesh reconstruction results of our four se- 845 (SDM845), both with Android OS. All the time statistics are col-
quences: “Indoor stairs”, “Sofa”, “Desktop” and “Cabinet”. All these lected on the two mobile platforms in runtime with 6DoF tracking on
sequences are captured and processed in real-time with OPPO R17 front end and other modules such as global pose optimization, monoc-
Pro, which demonstrates the robustness of our real-time reconstruc- ular keyframe depth estimation and incremental mesh generation on
tion system for large-scale and textureless scenes. We also compare back end. Generally, our Mobile3DRecon performs almost 2 times
the results with and without DNN-based depth refinement. As can faster on MI8 than on OPPO R17 Pro because of the more power-
be seen in Figs. 9 (b) and (c), our DNN-based depth refinement not ful SDM845 computing chip. Note that even with the slower perfor-
only reduce depth noise, but can also eliminate noisy mesh triangles mance on OPPO R17 Pro, our Mobile3DRecon can still achieve real-
to significantly improve the final surface mesh quality. time, because our monocular depth estimation and incremental mesh
generation steps are done for each new keyframe and the reported
6 E XPERIMENTAL E VALUATION time consumption on OPPO R17 Pro is fast enought to keep up with
In this section, we perform evaluation of our Mobile3DRecon pipeline, the keyframe frequency, which is almost 5 keyframes-per-second in
which is implemented in C++ code and uses third-party libraries SenseAR SLAM framework.
OpenCV 2 for image I/O and Eigen 3 for numerical computation. We
report quantitative comparisons as well as qualitative comparisons of
our work with the state-of-the-art methods on our experimental bench- 6.2 AR Applications
mark by mid-range mobile phone, showing that our Mobile3DRecon
With our Mobile3DRecon system integrated on Unity mobile platform,
is among the top performers on the benchmark. We also report the
we can achieve real-time realistic AR effects such as occlusions and
time consumption on each stage of our approach to show the real-time
collisions on various scenes with different geometric structures on
performance of our pipeline on mid-range mobile phone. Finally, we
OPPO R17 Pro and MI8, which are illustrated in Fig. 12. Note that
show how the method performs in AR applications on some mid-range
the example of the “Indoor stairs” shows the interesting effect of the
mobile platforms.
virtual balls rolling down the stairs to verify the physically true inter-
6.1 Quantitative and Qualitative Evaluations actions between virtual objects and real environment based on our ac-
curate 3D reconstruction. Please refer to the supplementary materials
We qualitatively and quantitatively compare our monocular depth es- for the complete demo videos. It is worth mentioning that a real-time
timation approach to other state-of-the-art algorithms on the gener- surface mesh is crucial for simple implementation of occlusion and
ated depth maps and surface meshes of the five sequences captured collision effects on most graphics engines like Unity, which is difficult
by OPPO R17 Pro. Since OPPO R17 Pro is equipped with a rear to fulfill by other 3D representations such as surfels or TSDF volume.
ToF sensor, the ToF depth measurements can be used as GT for the
quantitative evaluation. In Fig. 10, we compare our estimated depth 3 https://github.com/HKUST-Aerial-Robotics/MVDepthNet
2 http://cloudcompare.org 4 https://github.com/sunghoonim/DPSNet
Table 1: We report RMSEs and MAEs of the depth and surface mesh results by our Mobile3DRecon and [11, 27, 40] on our five experimental
sequences captured by OPPO R17 Pro with ToF depth measurements as GT. For depth evaluation, only the pixels with valid depths in both GT
and the estimated depth map will participate in error calculation. For common depth evaluation, only the pixels with common valid depths in all
the methods and GT will participate in evaluation. Note that for REMODE, we only take into calculation those depths with errors smaller than
35 cm. For mesh evaluation, we use CloudCompare 2 to compare the mesh results by fusing depths of each method to the GT mesh by fusing
ToF depths. For REMODE, we are unable to get a depth fusion result due to its severe depth errors.
Sequences RMSE/MAE [cm] REMODE [27] DPSNet [11] MVDepthNet [40] Mobile3DRecon
Depth 23.38/18.95 12.48/7.71 10.54/7.82 7.41/3.98
Indoor stairs Common depth 24.11/19.15 9.78/6.30 9.25/7.43 7.11/4.19
Mesh / 6.34/8.67 6.04/8.98 4.51/4.40
Depth 22.19/14.86 12.82/8.54 11.74/8.01 9.66/6.10
Sofa Common depth 24.72/19.78 9.87/6.55 9.27/6.64 9.19/5.59
Mesh / 5.92/7.20 5.90/8.10 5.31/5.74
Depth 19.39/12.91 9.09/6.42 7.46/5.06 5.69/3.01
Outdoor stairs Common depth 24.57/19.57 7.88/6.03 6.81/5.10 6.05/3.44
Mesh / 6.22/8.10 5.23/5.10 4.17/3.86
Depth 25.59/23.39 13.42/9.99 11.14/8.91 9.45/5.42
Desktop Common Depth 25.55/23.43 12.51/9.36 10.46/8.49 9.42/5.41
Mesh / 5.93/9.88 5.76/9.65 5.58/6.91
Depth 19.22/16.48 11.89/8.57 9.58/7.12 10.43/6.07
Cabinet Common Depth 18.76/16.07 10.14/7.22 10.46/8.48 9.54/5.92
Mesh / 5.65/10.96 5.48/11.13 5.02/7.96
Table 2: We report detailed per-keyframe time comsumptions (in milliseconds) of our Mobile3DRecon in all the substeps. The time statistics
are given on two mobile platforms: OPPO R17 Pro with SDM710 and MI8 with SDM845.
Monocular depth estimation Incremental mesh

Time [ms/keyframe] Total
Cost volume Cost volume Confidence-based DNN-based generation
Total
computation aggregation filtering refinement
OPPO R17 Pro (SDM710) 16.75 28.55 2.26 22.9 70.46 31.13 101.59
MI8 (SDM845) 11.92 17.68 1.1 7.62 38.32 18.89 57.21
an online incremental mesh generation and is more suitable for achiev-

ing seamless AR effects such as occlusions and collisions between
virtual objects and real scenes. Due to the limitation of TSDF inte-
gration, our dense surface mesh reconstruction is currently unable to
keep the reconstructed mesh consistently updated with the changes of
the global keyframe poses after bundle adjustment. An online dein-
tegration and reintegration mechanism is preferred in future to make
the incremental mesh generation more consistent with global pose op-
timization and accumulative error compensation. Additionally, how
to reasonably handle the limitations of computation and memory re-
sources on mobile platforms when the reconstruction scale becomes
larger is a problem worth further studying in our future work.
ACKNOWLEDGMENTS
The authors would like to thank Feng Pan and Li Zhou for their kind
help in the development of the mobile reconstruction system and the
experimental evaluation. This work was partially supported by NSF of
China (Nos. 61672457 and 61822310), and the Fundamental Research
Funds for the Central Universities (No. 2019XZZX004-09).
Fig. 12: AR applications of Mobile3DRecon on mobile platforms: The
first row shows the 3D reconstruction and an occlusion effect of an
R EFERENCES
indoor scene on OPPO R17 Pro. The second and third rows illustrate
AR occlusion and collision effects of another two scenes on MI8. [1] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt. Bundle-
Fusion: Real-time globally consistent 3d reconstruction using on-the-fly
surface re-integration. ACM Transactions on Graphics, 36(4):1, 2017.
[2] A. Drory, C. Haubold, S. Avidan, and F. A. Hamprecht. Semi-global
7 C ONCLUSION matching: A principled derivation in terms of message passing. In Ger-
man Conference on Pattern Recognition, pp. 43–53. Springer, 2014.
We have presented a novel real-time surface mesh reconstruction sys- [3] V. Garro, G. Pintore, F. Ganovelli, E. Gobbetti, and R. Scopigno. Fast
tem which can run on a mid-range mobile phones. Our system allows metric acquisition with mobile devices. In Vision, Modeling and Visual-
users to reconstruct the dense surface mesh of the environments with ization, pp. 29–36, 2016.
a mobile device with only an embedded monocular camera. Unlike [4] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocu-
existing state-of-the-art methods which produce only surfels or TSDF lar depth estimation with left-right consistency. In IEEE Conference on
volume in real-time, our Mobile3DRecon is unique in that it performs Computer Vision and Pattern Recognition, pp. 270–279, 2017.
[5] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan. Cascade cost volume [28] F. Poiesi, A. Locher, P. Chippendale, E. Nocerino, F. Remondino, and
for high-resolution multi-view stereo and stereo matching. ArXiv Preprint L. Van Gool. Cloud-based collaborative 3D reconstruction using smart-
ArXiv:1912.06378, 2019. phones. In European Conference on Visual Media Production, 2017.
[6] M. Heikkilä, M. Pietikäinen, and C. Schmid. Description of interest re- [29] M. Pollefeys, D. Nister, J. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp,
gions with local binary patterns. Pattern Recognition, 42(3):425–436, C. Engels, D. Gallup, S. J. Kim, P. Merrell, et al. Detailed real-time urban
2009. 3D reconstruction from video. International Journal of Computer Vision,
[7] H. Hirschmüller. Accurate and efficient stereo processing by semi-global 78(2):143–167, 2008.
matching and mutual information. In IEEE Computer Society Conference [30] V. Pradeep, C. Rhemann, S. Izadi, C. Zach, M. Bleyer, and S. Bathiche.
on Computer Vision and Pattern Recognition, vol. 2, pp. 807–814, 2005. MonoFusion: Real-time 3d reconstruction of small scenes with a single
[8] H. Hirschmüller. Stereo processing by semiglobal matching and mutual web camera. In IEEE International Symposium on Mixed and Augmented
information. IEEE Transactions on Pattern Analysis and Machine Intelli- Reality, 2013.
gence, 30(2):328–341, 2007. [31] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of
[9] J. Hu, M. Ozay, Y. Zhang, and T. Okatani. Revisiting single image depth dense two-frame stereo correspondence algorithms. International Jour-
estimation: Toward higher resolution maps with accurate object bound- nal of Computer Vision, 47(1):7–42, 2001.
aries. In IEEE Winter Conference on Applications of Computer Vision, [32] T. Schöps, T. Sattler, C. Hne, and M. Pollefeys. Large-scale outdoor 3D
pp. 1043–1051, 2019. reconstruction on a mobile device. Computer Vision and Image Under-
[10] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang. DeepMVS: Learn- standing, 157:151–166, 2017.
ing multi-view stereopsis. In IEEE Conference on Computer Vision and [33] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A com-
Pattern Recognition, pp. 2821–2830, 2018. parison and evaluation of multi-view stereo reconstruction algorithms.
[11] S. Im, H.-G. Jeon, S. Lin, and I. S. Kweon. Dpsnet: End-to-end deep In IEEE Computer Society Conference on Computer Vision and Pattern
plane sweep stereo. In International Conference on Learning Represen- Recognition, 2006.
tations, 2019. [34] V. Sterzentsenko, L. Saroglou, A. Chatzitofis, S. Thermos, N. Zioulis,
[12] J. Jeon and S. Lee. Reconstruction-based pairwise depth dataset for depth A. Doumanoglou, D. Zarpalas, and P. Daras. Self-supervised deep depth
image enhancement using CNN. In European Conference on Computer denoising. In IEEE International Conference on Computer Vision, pp.
Vision, pp. 422–438, 2018. 1242–1251, 2019.
[13] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, [35] C. Strecha, W. V. Hansen, L. V. Gool, P. Fua, and U. Thoennessen. On
and Z. Wang. Enlightengan: Deep light enhancement without paired su- benchmarking camera calibration and multi-view stereo for high resolu-
pervision. ArXiv Preprint ArXiv:1906.06972, 2019. tion imagery. IEEE Computer Society Conference on Computer Vision
[14] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style and Pattern Recognition, 2008.
transfer and super-resolution. In European Conference on Computer Vi- [36] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, and M. Polle-
sion, pp. 694–711. Springer, 2016. feys. Live metric 3D reconstruction on mobile phones. In IEEE Interna-
[15] O. Kahler, V. Prisacariu, C. Ren, X. Sun, P. Torr, and D. Murray. Very tional Conference on Computer Vision, pp. 65–72, 2013.
high frame rate volumetric integration of depth images on mobile devices. [37] W. Thomas, F. S.-M. Renato, G. Ben, J. D. Andrew, and L. Stefan. Elas-
IEEE Transactions on Visualization and Computer Graphics, 21(11):1–1, ticFusion: Real-time dense slam and light source estimation. The Inter-
2015. national Journal of Robotics Research, 35(14):1697–1716, 2016.
[16] M. Kazhdan. Poisson surface reconstruction. In Eurographics Sympo- [38] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy,
sium on Geometry Processing, 2006. and T. Brox. Demon: Depth and motion network for learning monocular
[17] M. Klingensmith, I. Dryanovski, S. S. Srinivasa, and J. Xiao. Chisel: stereo. In IEEE Conference on Computer Vision and Pattern Recognition,
Real time large scale 3D reconstruction onboard a mobile device using pp. 5038–5047, 2017.
spatially hashed signed distance fields. In Robotics: Science and Systems, [39] J. Valentin, A. Kowdle, J. T. Barron, N. Wadhwa, M. Dzitsiuk, M. Schoen-
2015. berg, V. Verma, A. Csaszar, E. Turner, I. Dryanovski, et al. Depth from
[18] K. Kolev, P. Tanskanen, P. Speciale, and M. Pollefeys. Turning mobile motion for smartphone AR. ACM Transactions on Graphics, 37(6):1–19,
phones into 3D scanners. In IEEE Conference on Computer Vision and 2018.
Pattern Recognition, 2014. [40] K. Wang and S. Shen. MVDepthNet: Real-time multiview depth estima-
[19] P. Li, T. Qin, B. Hu, F. Zhu, and S. Shen. Monocular visual-inertial state tion neural network. In IEEE International Conference on 3D Vision, pp.
estimation for mobile augmented reality. In IEEE International Sympo- 248–257, 2018.
sium on Mixed and Augmented Reality, pp. 11–21, 2017. [41] S. Yan, C. Wu, L. Wang, F. Xu, L. An, K. Guo, and Y. Liu. DDRNet:
[20] Y. Ling, K. Wang, and S. Shen. Probabilistic dense reconstruction from a Depth map denoising and refinement for consumer depth cameras using
moving camera. In IEEE International Conference on Intelligent Robots cascaded CNNs. In European Conference on Computer Vision, pp. 151–
and Systems, pp. 6364–6371, 2018. 167, 2018.
[21] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3D [42] Z. Yang, G. Fei, and S. Shen. Real-time monocular dense mapping on
surface construction algorithm. ACM SIGGRAPH Computer Graphics, aerial robots using visual-inertial fusion. In IEEE International Confer-
21(4):163–169, 1987. ence on Robotics and Automation, 2017.
[22] P. Merrell, A. Akbarzadeh, W. Liang, P. Mordohai, J. M. Frahm, R. Yang, [43] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. MVSNet: Depth inference
D. Nister, and M. Pollefeys. Real-time visibility-based fusion of depth for unstructured multi-view stereo. In European Conference on Computer
maps. In IEEE International Conference on Computer Vision, pp. 1–8, Vision, pp. 767–783, 2018.
2007.
[23] O. Muratov, Y. V. Slynko, V. V. Chernov, M. M. Lyubimtseva, A. Sham-
suarov, and V. Bucha. 3DCapture: 3D reconstruction for a smartphone.
In IEEE Conference on Computer Vision and Pattern Recognition, pp.
893–900, 2016.
[24] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, and A. W. Fitzgib-
bon. KinectFusion: Real-time dense surface mapping and tracking. In
IEEE International Symposium on Mixed and Augmented Reality, 2011.
[25] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-time 3D
reconstruction at scale using voxel hashing. ACM Transactions on Graph-
ics, 32(6CD):169.1–169.11, 2013.
[26] P. Ondruska, P. Kohli, and S. Izadi. MobileFusion: Real-time volumet-
ric surface reconstruction and dense tracking on mobile phones. IEEE
Transactions on Visualization and Computer Graphics, 21(11):1–1.
[27] M. Pizzoli, C. Forster, and D. Scaramuzza. Remode: Probabilistic,
monocular dense reconstruction in real time. In IEEE International Con-
ference on Robotics and Automation, 2014.

Mobile 3d Recon

Uploaded by

Copyright:

Available Formats

Mobile 3d Recon

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mobile 3d Recon

Uploaded by

Copyright:

Available Formats

3446

Mobile3DRecon: Real-time Monocular 3D Reconstruction on a

Fig. 2: System framework.

In order to get a sub-level depth value, we use parabola fitting to ac-

5.1 Scalable Voxel Fusion

h(x, y, z) = g(x) + g(y) ∗ G + g(z) ∗ G2 , (18)

ĥ(x, y, z) = OC + ĝ(x) + ĝ(y) ∗ G + ĝ(z) ∗ G2 ,

When users perform real-time reconstruction with some AR appli-

focal lengths in u and v directions, and (cu , cv ) the optical center. Mt is

Tt (V) = Tt−1 (V) + d(Mt V) − Dt (π (Mt V))

Monocular depth estimation Incremental mesh

an online incremental mesh generation and is more suitable for achiev-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.