0% found this document useful (0 votes)
8 views17 pages

A Cluster

This document presents a novel cluster-based disambiguation method for Structure from Motion (SfM) that utilizes pose consistency verification to address challenges posed by ambiguous scenes with duplicate structures. The proposed method combines local and global information to generate reliable camera poses, which are then used to filter out incorrect matches through rotation and translation consistency verification at the cluster level. Experimental results demonstrate that this approach outperforms several state-of-the-art methods in terms of robustness and accuracy in ambiguous image sequences.

Uploaded by

qwer155433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

A Cluster

This document presents a novel cluster-based disambiguation method for Structure from Motion (SfM) that utilizes pose consistency verification to address challenges posed by ambiguous scenes with duplicate structures. The proposed method combines local and global information to generate reliable camera poses, which are then used to filter out incorrect matches through rotation and translation consistency verification at the cluster level. Experimental results demonstrate that this approach outperforms several state-of-the-art methods in terms of robustness and accuracy in ambiguous image sequences.

Uploaded by

qwer155433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing


journal homepage: www.elsevier.com/locate/isprsjprs

A cluster-based disambiguation method using pose consistency verification


for structure from motion
Ye Gong a , Pengwei Zhou a , Changfeng Liu b , Yan Yu c , Jian Yao a,d , Wei Yuan e , Li Li a ,∗
a
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
b
State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing 400023, PR China
c
University of California, Berkeley 94720, USA
d
Wuhan University Shenzhen Research Institute, Shenzhen 518057, China
e
International Research Institute of Disaster Science, Tohoku University, Sendai 980-8572, Japan

ARTICLE INFO ABSTRACT

Keywords: Structure from motion (SfM) recovers scene structures and camera poses based on feature matching, and faces
Structure from motion challenges from ambiguous scenes. There are a large number of ambiguous scenes in real environment, which
Duplicate structure disambiguation contain many duplicate structures and textures. The ambiguity leads to incorrect feature matches between
Pose consistency verification
images with similar appearance, and makes geometric misalignment in SfM. To address this problem, recent
Image-based 3D reconstruction
methods have focused on investigating the inconsistencies in feature topology among multi-view images.
However, the feature topology is directly derived from 2D images. Thus, it is susceptible to feature occlusion
caused by changes in perspective. Therefore, we propose a new method that disambiguates scenes using pose
consistency rather than feature consistency. The pose consistency is conducted in 3D geometric space which
is less sensitive to feature occlusion. Thus, the pose consistency is more robust than feature consistency. Our
core motivation lies that the incorrect matches between ambiguous images will cause pose deviation from the
global poses generated by correct matches. To detect this pose deviation, we first combine local and global
information of the scene to generate the global reliable camera poses. The local information of each image
is obtained by image clustering, and it strengthens the global information that is represented as the verified
maximum spanning tree of clusters. Then, the global poses serve as the reference for further pose consistency
verification. The global poses also enable us to perform both rotation and translation consistency verification
for uncertain matches. During the pose consistency verification, the pose deviation calculated on image-level
may be too small to be noticed. Thus, we propose to perform pose consistency verification at cluster-level
instead of image-level to amplify the pose deviation. In the experiments, we compared our approach with
several state-of-the-art methods, including COLMAP, Geodesic-SfM and TC-SfM, on both ambiguous and regular
datasets. The results demonstrate that our approach achieves the best robustness, only our approach succeeds
on all ambiguous image sequences (14/14). The quantitative evaluation results on image sequences with
ground truth also show that our approach achieves the best accuracy (average RMSE of translation = 0.109,
average RMSE of rotation = 0.827) among all methods. The source code of our approach is publicly available
at https://github.com/gongyeted/MA-SfM.

1. Introduction photogrammetry (Cui et al., 2019; Jiang et al., 2020). However, the
methods face challenges from duplicate structures (Yan et al., 2017),
Structure from motion (SfM) is a method for recovering cam- which are especially common in ambiguous scenes such as buildings,
era poses and reconstructing sparse structures from a set of two- temples and office areas. Duplicate structures cause different objects
dimensional (2D) images (Schonberger and Frahm, 2016; Chen et al., to share similar appearance, leading to deceptive correspondences
2020). It has been extensively applied in areas such as three-
between ambiguous images. However, the SfM methods largely rely on
dimensional (3D) reconstruction (Seitz et al., 2006; James and Robson,
correct visual connectivity across images. Deceptive correspondences
2012), autonomous driving (Geiger et al., 2012; Yurtsever et al., 2020),
augmented reality (Carmigniani et al., 2011; Yang et al., 2013) and cause SfM to register similar but distinct structures to the same location,

∗ Corresponding author.
E-mail address: li.li@whu.edu.cn (L. Li).

https://doi.org/10.1016/j.isprsjprs.2024.02.016
Received 8 July 2023; Received in revised form 19 February 2024; Accepted 20 February 2024
Available online 27 February 2024
0924-2716/© 2024 Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 1. An example of duplicate structures in Street data (Roberts et al., 2011). (a): Three duplicate facades share similar appearance; (b): COLMAP registers the first and the
third facades to the same location; (c): The correct result provided by our approach. The dash box in (b) highlights the cameras with wrong poses.

resulting in wrong poses and structures. In Fig. 1, we visually show an scene with the use of pose consistency verification. The pose con-
example of this problem. sistency (Zach et al., 2010; Shen et al., 2016; Cui et al., 2021) can
Disambiguating ambiguous scene is a challenging task for SfM effectively avoid the influence of feature occlusion caused by changes
methods in these years (Yan et al., 2017). To evaluate the ambiguity in perspective. The motivation of our method lies that the matches
of matches between image pair, the most convenient way is to use between ambiguous images will cause pose deviation from the global
their background areas that have different image textures, namely, pose prior generated by correct matches. Therefore, the core of our
have visual contradictions (Zach et al., 2008). However, there are method is how to establish a reliable global pose prior and how to
many ambiguous images that lack noticeable visual contradictions, detect the pose deviation. In this paper, we propose to combine local
as depicted in Fig. 2. The information of two images presented in and global information of the scene to build a global pose prior as
Fig. 2(a) is not sufficient for disambiguation. Some recent state-of- the human observers do. The local information is obtained by image
the-art methods (Wang et al., 2019, 2022) apply feature topology clustering. A cluster represents a local area that consists of several
consistency among multi-view images to address this issue, but fea- closely connected images. The global information is established by
tures are highly sensitive to feature occlusion caused by changes in extracting a verified maximum spanning tree (VMST) from the cluster
perspective. However, human observers usually disambiguate images graph. The reliable edges involved in the VMST are applied to recover
the global pose prior of the scene. Based on the global pose prior,
using pose information constructed in 3D geometric space. In general,
we further perform rotation and translation consistency verification at
there are two stages for human observers. In the first stage, we try to
cluster-level to filter out matches that connect ambiguous images. The
find several most closely related images and group them to a cluster,
major contributions of this paper are summarized as follows:
as shown in Fig. 2(b). A cluster can help to establish a brief knowledge
of the local area for involved images. In the second stage, we read • We propose a cluster-based disambiguation method to filter out
through all images carefully to build a global pose prior, as shown in ambiguous matches between duplicate structures and textures
Fig. 2(c). The global pose prior represents the approximate rotation and using pose consistency. This method can significantly improve the
location of each camera in the whole scene. Using this pose prior as the robustness of SfM methods on ambiguous scenes.
reference, ambiguous matches are easy to detect because they connect • As human observers do, we propose to combine local and global
similar but distinct clusters and result in pose deviation from the global information of the scene to construct the reliable global pose
prior. prior, which serves as the reference for disambiguation.
Inspired by the above-mentioned disambiguation strategy of human • We propose to perform both rotation and translation pose consis-
observers, we propose a novel approach to disambiguate ambiguous tency verification to find incorrect matches. In addition, the pose

399
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 2. An example of how to disambiguate images using local and global information of the scene. The global pose prior presented in (c) is represented by the colorful dots,
and each dot indicates a camera. The global pose prior represents the pose of each camera in the whole scene. Two small dash boxes in (c) correspond to clusters (b) of two
ambiguous images presented in (a). The matches between these two clusters are not used when generating the global pose prior. They will cause pose deviation from the global
pose prior, and this characteristic is applied to detect ambiguous matches. The Palace data is provided by Knapitsch et al. (2017).

deviation from the global reference is detected in cluster-level 2.1. Visual feature-based method
instead of image-level.
Visual feature-based methods detect ambiguous images based on
2. Related work the inconsistency of feature topology among multi-view images, such
as contradictory visual context. Zach et al. (2008) explored the co-
To remove images of duplicate structures, many SfM disambiguation occurrence of features among image triplets. For three images with
methods have been proposed. In general, we can divide these images overlap, if a large proportion of matched features of two images do
into two categories. The first type of methods disambiguates images not appear in the third image, the third image is more likely to be
using visual features, and the second type relies on geometric rea- mismatched. However, the changes of perspective may also cause this
soning. We name them as visual feature-based method and geometric problem. Thus, this method may filter out correct matches. Roberts
reasoning-based method, respectively. et al. (2011) incorporated image triplets and image timestamps into

400
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

an Expectation-Maximization based estimation to label pairwise im- structure. Geometric reconstruction of all image pairs are subsequently
age matches as correct and erroneous. While this method improves conducted. To filter incorrect matches, pose change is assessed before
accuracy, it is restricted to ordered image sequences because their and after bundle adjustment optimization when registering images to
method uses image timestamps to solve the problem. In addition, both two-view models. The method relies on image pair reconstruction and
Zach et al. (2008) and Roberts et al. (2011) analyzed incorrect image is sensitive to initial parameters. Most of these geometric reasoning-
correspondences locally, some incorrect matches may pass this local based methods only use geometric information of local image pairs
verification. To solve this problem, Jiang et al. (2012) defined a new or triplets to disambiguate images. They ignore the global geometric
objective function that using complete 3D reconstruction instead of lo- information of the scene. In addition, the translation verification is also
cal image triplet information. However, this method uses greedy search ignored by most of methods. As a result, they may fail when there
to solve the function and may get stuck at local minimum. Wilson are only pure translation motion between ambiguous images. To solve
and Snavely (2013) assessed the reliability of feature tracks using a these problems, we propose a cluster-based disambiguation method for
new measurement inspired by social network theory. The method first SfM using both rotation and translation consistency verification. In our
constructs a visibility graph and then detect unreliable tracks from method, the local and global information of the scene are fused together
this graph using bipartite local clustering coefficient. However, the to generate the robust global pose prior. The global prior enable us
method highly relies on background contextual cues to calculate the to perform both rotation and translation consistency verification for
coefficients, and may fail in disambiguating images with small over- uncertain matches.
lap. Wang et al. (2019) proposed a hypothesis that correctly-matched
image pairs have more conjugate points than ambiguous pairs if given 3. Method
a constant overlap. They detect incorrect relative orientations based
on this hypothesis. Kataria et al. (2020) believed that long feature The overview of our pipeline is shown in Fig. 3. At the beginning,
tracks are usually repetitive structures, so the weight of these features we use feature matching module of COLMAP (Schonberger and Frahm,
should be reduced when recovering poses and structures. Wang et al. 2016) to find matches for the input images. Then an original view
(2022) proposed a method named TC-SfM which explores the scene graph 𝐺 = {, } is constructed based on the matching relationships. As
contextual information from tack communities to disambiguate images. shown in Fig. 3(b), view vertices are denoted by the blue dots, and the
For each track community, its surrounding contents are applied to blue edges represent view edges. Although basic verification for feature
identify whether it is erroneous. However, the above methods largely matching has been conducted, there are still erroneous edges in this
rely on the visual contradictions of images. If the visual contradictions view graph. Our goal is to detect these incorrect edges and remove them
between images are few or difficult to detect, these methods may fail from view graph. Based on the view graph, a cluster graph shown in
to recover the correct camera poses and scene structures. Fig. 3(c) is constructed by image clustering to build local information.
The red circles represent clusters. We construct a verified maximum
2.2. Geometric reasoning-based method spanning tree (VMST) on this cluster graph to build global information
of the scene, as shown in Fig. 3(d). The global pose prior is further
Geometric reasoning-based methods detect ambiguous images us- obtained based on this VMST. The reliable cluster edges involved in
ing high-level spatial contextual information instead of image fea- this VMST are shown in green. Other cluster edges that need to be
tures. Zach et al. (2010) proposed to verify loop consistency in the view verified are colored in light blue in Fig. 3(e). After the rotation and
graph, and considered that the accumulated rotations in a loop should translation verification, we can obtain the final reliable cluster edges,
be an identity matrix. However, this measurement may fail when as shown in Fig. 3(g). The corresponding view graph is the output of
the loops are large due to the accumulated geometric errors. Ceylan our disambiguation method. The reliable view graph is subsequently
et al. (2014) proposed another SfM method based on the idea of loop used as the input of SfM pose optimization.
constraint. This method first detects repetitive elements based on a
user-marked template in a single image, then performs a graph-based 3.1. Cluster graph construction
optimization to yield a globally consistent 3D geometry reconstruction.
However, this method relies on user-marked pattern and is limited to The first step of our approach is to construct a cluster graph from
the regular repetitions appearing on planar facades. Shen et al. (2016) the view graph. Each cluster consists of a set of closely matched images.
also applied loop consistency in image triplets to detect ambiguous The cluster can provide sufficient local information for disambiguation.
images. This method first constructs a maximum spanning tree (MST) Matches within the same cluster are reliable because the images in-
to build local relative poses. Then, the tree is incrementally expanded volved in the cluster are close enough. They provide a great number of
to form locally consistent strong triplets by checking loop consistency. reliable view edges for the next global pose prior construction. Matches
Finally, a global community-based graph algorithm is introduced to between the same cluster pair are treated as a whole in the processing
perform longer loop consistency checking. However, this method solely of pose deviation detection. The pipeline of cluster graph construction
applies pairwise rotation to check loop consistency and ignores trans- is presented in Fig. 4. The details are introduced in the following
lation verification. Yan et al. (2017) proposed a novel method to sections.
discriminate ambiguous images based on geodesic context information
instead of loop consistency, named Geodesic-SfM. The method builds 3.1.1. Similarity forest
a path network based on the geodesic relationships between images. In recent years, there are several light-weight image clustering
When the connected matches in the network fail to form a consistent methods (Omran et al., 2005; Dehariya et al., 2010; Yang et al.,
feature track, the current image is potentially ambiguous. However, de- 2010; Chang et al., 2017) have been proposed to generate clusters
riving the path network from images is still challenging especially when of images. However, these methods directly perform clustering on the
the overlap is not sufficient. Therefore, the final results are prone to be original view graph that contains unreliable edges caused by ambigu-
split into disconnected parts. Instead of disambiguating images during ous matches. It makes them vulnerable to be affected by ambiguous
the initial registration, Heinly et al. (2014) proposed a post-processing matches. Therefore, there are still ambiguous images in the same cluster
method to detect the potential ambiguity using the consistency among if using these methods. In our proposed method, we select only the most
projections of 3D points. But this method introduces additional com- similar image for each image in the view graph to enhance the internal
putation costs to obtain reconstructions. Cui et al. (2021) prioritized reliability of clustering. This strategy is inspired by the statistical ob-
the construction of view graph by first connecting images with a large servation that, although there may be numerous ambiguous images, the
number of matches until all images are included in a connected tree image pair with the highest similarity is likely to be the most reliable

401
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 3. Overview of our pipeline. A cluster graph is constructed from the original view graph to extract local information of the scene. Then a verified maximum spanning tree
is established to obtain global information. Other uncertain cluster edges are verified by pose consistency verification.

Fig. 4. Pipeline of cluster graph construction. A similarity forest is constructed at first using the replaceability metric (Gong et al., 2023). The view vertices are divided into
different trees. After the cut of over-sized trees, each tree represents an image cluster. Then, a cluster graph is constructed.

match (Cui et al., 2021; Shen et al., 2016). The direct approach is In Eq. (1), 𝑤𝑘 is a weight and it is positively related to the value of
using the number of matched features to evaluate the similarity of two 𝑛𝑏𝑎𝑐𝑘 . For the detailed description of replaceability metric, please refer
images. However, this approach is easily affected by different image to Gong et al. (2023).
textures and incorrect matches. To address this problem, we adopt the In Fig. 4(b), we present a similarity forest constructed by our
replaceability metric presented in our previous work (Gong et al., 2023) method. In the similarity forest, each vertex still denotes an image, and
to identify the most similar image for each image. The metric uses grids each edge represents the most reliable matching relationship. The other
to avoid the effect of different textures and introduces a third image to edges of the view graph are removed in the similarity forest. The weight
improve robustness. of each edge is the value of replaceability metric. Since the edges in the
Here, we briefly introduce the definition of the replaceability met- similarity forest represent the most similar relationship, they are rarely
ric. For two image vertices 𝑣𝑎 and 𝑣𝑏 , similarity score 𝑀𝑎𝑏 is calculated ambiguous.
as:
∑𝐾 3.1.2. Cluster graph construction
𝑘=1 (𝑤𝑘 ⋅ 𝑀𝑎𝑏𝑐𝑘 )
𝑀𝑎𝑏 = ∑𝐾 , (1) In the similarity forest, each vertex is linked to its most similar
𝑘=1 𝑤𝑘 vertex, resulting in many similarity trees. Despite each similarity tree
where 𝑀𝑎𝑏𝑐𝑘 is the replaceability score of 𝑣𝑎 in an image triplet typically is small, some trees may occasionally grow to a large size. The
⟨ ⟩ tree with large size means that feature tracks in this tree is long. It may
𝑣𝑎 , 𝑣𝑏 , 𝑣𝑐𝑘 . Gong et al. (2023) introduce a third vertex 𝑣𝑐𝑘 ∈  (𝑣𝑎 )
cause that there are still ambiguous matches inside the same tree. To
to calculate similarity score, where  (𝑣𝑎 ) denotes the set of adjacent
avoid this problem, we attempt to prune trees with a size larger than a
vertices of 𝑣𝑎 . 𝐾 is the number of vertices in  (𝑣𝑎 ). The definition of
threshold 𝑇𝑠 (empirically set to 5).
𝑀𝑎𝑏𝑐𝑘 is calculated as:
To achieve this, we follow the steps presented in Fig. 5 to prune
( )
𝑔𝑎𝑏 𝑛𝑏𝑎𝑐𝑘 the over-sized trees. First, we merge all leaf vertices (degree = 1)
𝑀𝑎𝑏𝑐𝑘 = 0.5 ⋅ + , (2) with their connected vertices to form sub-trees. Different leaf vertices
𝑔𝑎 𝑛𝑏𝑐𝑘
that share a connected vertex are merged to the same sub-tree. If a
where 𝑔𝑎 indicates the number of grids containing features in 𝑣𝑎 , 𝑔𝑎𝑏 vertex does not belong to any sub-tree, it is also regarded as a sub-tree.
denotes the number of grids in 𝑣𝑎 matched with 𝑣𝑏 , 𝑛𝑏𝑎𝑐𝑘 indicates In Fig. 5(b), the sub-trees are highlighted by gray ellipses. Next, we
the number of features in 𝑣𝑎 simultaneously matched with 𝑣𝑏 and 𝑣𝑐𝑘 , sort edges connecting different sub-trees according to their weights in
and 𝑛𝑏𝑐𝑘 indicates the number of matched features between 𝑣𝑏 and 𝑣𝑐𝑘 . descending order, and then iteratively merge the two sub-trees whose

402
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 5. The illustration of over-sized tree pruning. Gray ellipses represent the initial sub-trees, and red dash ellipses denote the final merged sub-trees. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

Table 1
Some symbols defined for better description.
Definitions Symbols
Cluster edge between clusters 𝐶𝐼 and 𝐶𝐽 𝐸𝐼𝐽
View vertices that belong to cluster 𝐶𝐼 and 𝐶𝐽 𝐼 , and 𝐽
Intra-cluster edges that belong to cluster 𝐶𝐼 and 𝐶𝐽 𝐼 , and 𝐽
Inter-cluster edges between 𝐶𝐼 and 𝐶𝐽 𝐼𝐽

All view edges of 𝐶𝐼 and 𝐶𝐽 𝐼𝐽 = 𝐼 ∪ 𝐽 ∪ 𝐼𝐽

All view vertices of 𝐶𝐼 and 𝐶𝐽 𝐼𝐽 = 𝐼 ∪ 𝐽

the total size is smaller than 𝑇𝑠 . In this way, the large trees are cut into
many small ones. Fig. 6. Visual illustration of some concepts used in our methods. The blue dots denote
Vertices that belong to the same tree form a cluster. The cluster view vertices. The intra-cluster edges and inter-cluster edges are represented as gray
and green lines. Two clusters 𝐶𝐼 and 𝐶𝐽 are highlighted by red ellipses, and the cluster
contains local information of its images. Images in the same cluster are
edge is denoted by red line. (For interpretation of the references to color in this figure
unambiguous because they are connected by the most reliable matching legend, the reader is referred to the web version of this article.)
relationships and the cluster size is limited. Then, we categorize edges
of the original view graph into two groups: intra-cluster edges and
inter-cluster edges. The intra-cluster edge connects images included in
a verified maximum spanning tree (VMST) from the cluster graph to
the same cluster, while the inter-cluster edge connects two images in
build the reliable global pose prior. We apply an improved Kruska’s
different clusters, as shown in Fig. 4(c). Intra-cluster edges are reliable
since the images in the same cluster are unambiguous. They provide algorithm (Kruskal, 1956) to construct the maximum spanning tree.
a great number of reliable view edges for the next global pose prior There are three steps in VMST construction. In the first step, we sort
construction. cluster edges according to their weights in descending order. In the
{ } second step, we select the edge with largest weight and check whether
We build a cluster graph 𝐺̂ = , ̂ ̂ based on these clusters, where
̂ ̂
 is the set of clusters and  denotes all cluster edges. For each cluster it can be added to the spanning tree. There are two conditions that
edge 𝐸𝐼𝐽 ∈ ̂ between two clusters 𝐶𝐼 and 𝐶𝐽 , the weight is defined should be met before adding this edge. The first is that the cycle cannot
as: be produced after adding this edge. The second is that the rotation
2 ⋅ 𝑁𝐼𝐽 consistency verification of this edge should be passed. The second
𝑊𝐼𝐽 = , (3) condition is designed to ensure that cluster edges of the final VMST are
𝑁𝐼 + 𝑁𝐽
reliable. Specifically, we compute rotation error of this edge according
where 𝑁𝐼𝐽 denotes the max number of matched features among inter- to the definition illustrated in Eq. (5). If the rotation error is larger than
cluster edges between 𝐶𝐼 and 𝐶𝐽 . 𝑁𝐼 and 𝑁𝐽 indicate the max number
15◦ , this edge is rejected. In the last step, we repeat step 2 until all
of matched features among intra-cluster edges in 𝐶𝐼 and 𝐶𝐽 , respec-
clusters are connected by the VMST. The final reliable VMST is shown
tively. The weight evaluates the reliability of cluster edge between two { }
in Fig. 7(b). Let 𝐺̂ 𝑣𝑚𝑠𝑡 = ,̂ ̂ 𝑣𝑚𝑠𝑡 represent the VMST, where ̂ is the
clusters.
set of all clusters and ̂ 𝑣𝑚𝑠𝑡 denotes all cluster edges included in the
In Fig. 6, we visually illustrate the concepts of cluster graph, intra-
VMST.
cluster edge, inter-cluster edge and cluster edge. For cluster 𝐶𝐼 , it
In our method, we actually use the view edges  𝑣𝑚𝑠𝑡 involved in
includes a set of view vertices 𝐼 and a set of intra-cluster edges 𝐼 . 𝐼
VMST to build global poses. We first construct the corresponding view
actually is a set of view edges between view vertices included in 𝐼 . For
graph for VMST and then collect all view edges  𝑣𝑚𝑠𝑡 . There are two
cluster 𝐶𝐽 , it also includes 𝐽 and 𝐽 . The inter-cluster edges between
𝐶𝐼 and 𝐶𝐽 is denoted as 𝐼𝐽 . 𝐼𝐽 is comprised of a set of view edges kinds of view edges in  𝑣𝑚𝑠𝑡 . The first is intra-cluster view edges
between 𝐼 and 𝐽 . In addition, there is also an edge 𝐸𝐼𝐽 between 𝐶𝐼 inside the clusters. The second is inter-cluster view edges between two
and 𝐶𝐽 . This edge is named as cluster edge. In addition, we define the connected clusters. In this way, the view edges  𝑣𝑚𝑠𝑡 includes both local
union of 𝐼 , 𝐽 and 𝐼𝐽 as 𝐼𝐽 ∗ . Namely, all view edges in Fig. 6 are and global information of the scene. In Fig. 7, we first illustrate why we
denoted as 𝐼𝐽 ∗ . We call  ∗ as all view edges of 𝐶 and 𝐶 . We also build VMST at cluster-level to find reliable view edges  𝑣𝑚𝑠𝑡 instead of
𝐼𝐽 𝐼 𝐽
define the union of 𝐼 and 𝐽 as 𝐼𝐽 ∗ . We call  ∗ as all view vertices directly building MST of view graph at image-level. The original view
𝐼𝐽
of 𝐶𝐼 and 𝐶𝐽 . We listed all of these symbols in Table 1. graph and the corresponding MST are presented in Fig. 7(a) and (d),
respectively. In Fig. 7(b) and (e), we present the VMST generated by
3.2. Verified maximum spanning tree construction our method and the corresponding view graph. From Fig. 7(d) and (e),
it can be seen that the number of edges in  𝑣𝑚𝑠𝑡 is greater than the
The cluster graph contains all global information of the scene. edges in MST. These additional but reliable edges can provide accurate
However, some cluster edges may be ambiguous. Therefore, we extract and robust constraints for global pose prior construction. Some reliable

403
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 7. (a) is the original view graph and (d) is its MST. (b) is the VMST constructed from (a), and its corresponding view graph is shown in (e). (c) shows the uncertain cluster
edges (light blue) that need to be verified. (f) is the corresponding view graph of (c), and the edges colored by light blue can provide extra loop closures but they are uncertain.
(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

and robust view edges may be removed in the MST of view graph. In edge as the detection unit. A cluster edge contains multiple view edges.
addition, there still are some reliable inter-cluster view edges that are These view edges together will enlarge the pose deviation. Besides,
not included in the VMST. These extra edges contain important loop these multiple view edges can be processed in batch and in parallel.
closures that benefit to pose and structure recovery, as illustrated in
Fig. 7(c) and (f). This is also the reason why we need to further perform 3.3.1. Rotation consistency verification
{ }
pose consistency verification to find these extra reliable edges. For each uncertain cluster edge 𝐸𝐼𝐽 ∈ ̂ − ̂ 𝑣𝑚𝑠𝑡 between two
To verify the ambiguity of cluster edges excluded from the VMST, clusters 𝐶𝐼 and 𝐶𝐽 , we perform rotation consistency verification to
we need to build the global pose prior using the view edges  𝑣𝑚𝑠𝑡 . The determine whether it is ambiguous. Rotation averaging (Sweeney et al.,
2015) is performed using 𝐼𝐽 ∗ (the definition is listed in Table 1) as
global poses are recovered by motion averaging (Zhu et al., 2018),
which is a technique commonly used in global SfM. This technique inputs. The outputs are denoted as 𝑅𝑒𝑑𝑔𝑒 , and we use 𝑅𝑒𝑑𝑔𝑒 𝑖 ∈ 𝑅𝑒𝑑𝑔𝑒
consists of sequential rotation averaging (Sweeney et al., 2015) and ∗
to represent the rotation of each involved view 𝑣𝑖 ∈ 𝐼𝐽 . If the
translation averaging (Zhuang et al., 2018). The illustration of motion cluster edge 𝐸𝐼𝐽 is ambiguous, the incorrect inter-cluster view edges
averaging is presented in Fig. 8. The inputs of motion averaging are will cause drastic pose deviation from the reference global rotations
relative poses computed from the epipolar geometry of view edges. 𝑅𝑣𝑚𝑠𝑡 . However, we cannot directly compare the absolute rotations that
The outputs are absolute rotations and translations of all cameras. Let belong to different coordinate systems. Thus, we evaluate the deviation
𝑁 represent the number of all vertices in the view graph. Then global
{ } using relative rotation. Specifically, the rotation error 𝑒𝑟𝑟𝑜𝑟𝑅
𝑚𝑛 of each
rotations are represented by 𝑅𝑣𝑚𝑠𝑡 = 𝑅𝑣𝑚𝑠𝑡 , … , 𝑅𝑣𝑚𝑠𝑡 , and translations view edge 𝑒𝑚𝑛 ∈ 𝐼𝐽∗ is defined:
𝑣𝑚𝑠𝑡
{ 𝑣𝑚𝑠𝑡 1 }
𝑣𝑚𝑠𝑡
𝑁
are represented by 𝑡 = 𝑡1 , … , 𝑡𝑁 . These global poses of all 𝑇 𝑇
𝑚𝑛 = ‖𝑅𝑛
𝑒𝑟𝑟𝑜𝑟𝑅 ) ‖,
𝑣𝑚𝑠𝑡
images are regarded as the reference for the following pose consistency ⋅ (𝑅𝑣𝑚𝑠𝑡 𝑒𝑑𝑔𝑒
𝑚 ) ⋅ 𝑅𝑚 ⋅ (𝑅𝑒𝑑𝑔𝑒
𝑛 (4)
verification. where 𝑅𝑣𝑚𝑠𝑡 ⋅(𝑅𝑣𝑚𝑠𝑡
𝑇
computes the relative rotation between 𝑣𝑚 and
𝑛 𝑚 )
𝑣𝑛 in the reference coordinate system. There is the similar meaning for
3.3. Pose consistency verification 𝑇
𝑅𝑒𝑑𝑔𝑒
𝑚 ⋅ (𝑅𝑒𝑑𝑔𝑒
𝑛 ) . This error evaluates the pose deviation caused by view
edge 𝑒𝑚𝑛 . The rotation error 𝐸𝑟𝑟𝑜𝑟𝑅 𝐼𝐽
of the cluster edge is defined as
The pose consistency of uncertain cluster edges are verified using
the maximum rotation error of all involved view edges:
global poses 𝑅𝑣𝑚𝑠𝑡 and 𝑡𝑣𝑚𝑠𝑡 as the reference. During the verification,
{ }
poses are recovered by motion averaging using the uncertain am- 𝐸𝑟𝑟𝑜𝑟𝑅𝐼𝐽 = max∗ 𝑒𝑟𝑟𝑜𝑟𝑅
𝑚𝑛 . (5)
𝑒𝑚𝑛 ∈𝐼𝐽
biguous cluster edge. Then, we detect the pose deviation from the
reference and use this to determine whether the edge is ambiguous. This error enlarges the difference between the correct and incorrect
If one cluster edge between two clusters is identified as ambiguous, the cluster edges, and makes it easy to detect rotation inconsistency.
corresponding inter-cluster edges will be regarded as unreliable. We evaluate each cluster edge depending on its error. We only
The verification is feasible for the following three reasons. First, reserve the promising cluster edges ̂ 𝑟𝑜𝑡 for the following translation
the reference poses are accurate and robust enough because they are verification. We first reserve cluster edges that errors are less than 5◦ .
constructed using the local and global information of the scene. Second, If the ratio of reserved edges is less than 60%, we relax the angle
motion averaging is sensitive to outliers. Thus, every incorrect edge threshold to 15◦ . Then, we iteratively push cluster edge with smallest
will cause a pose deviation. Third, we use cluster edge instead of view error to ̂ 𝑟𝑜𝑡 until 60% edges are reserved. This strategy filters out edges

404
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 8. Illustration of motion averaging. Relative poses of  𝑣𝑚𝑠𝑡 are inputs, and absolute poses 𝑅𝑣𝑚𝑠𝑡 and 𝑡𝑣𝑚𝑠𝑡 are outputs.

Fig. 9. The incorrect matches between two similar walls in opposite directions (180◦ ) can be easily detected by the rotation consistency verification.

with large rotation error, and ensures that there are sufficient edges 3.3.2. Translation consistency verification
reserved for the next translation consistency verification. It should be In general, rotation consistency verification can filter out a large
noted that our approach is insensitive to the angle threshold. The part of outlier edges. However, there are many ambiguous scenes that
reason lies that errors of correct and incorrect cluster edges are sig- are observed with almost pure translation, as shown in Fig. 1. In this
nificantly different in general. The detailed illustration will be given case, the ambiguous matches cannot be detected by rotation consis-
in Section 4.5. The rotation consistency verification is concluded in tency verification. The method proposed by Shen et al. (2016) fails on
Algorithm 1. The method can effectively find incorrect rotations caused this data because it could not disambiguate matches between images
by ambiguous matches, as illustrated in Fig. 9. with pure translational motion. To solve this problem, we propose
to further perform translation consistency verification for the cluster
Algorithm 1 Rotation consistency verification
{ } { } edges.
Input: Cluster graph 𝐺̂ = , ̂ ̂ , VMST 𝐺̂ 𝑣𝑚𝑠𝑡 = ,̂ ̂ 𝑣𝑚𝑠𝑡 , global We use translation averaging (Zhuang et al., 2018) to perform trans-
rotations 𝑅 𝑣𝑚𝑠𝑡
lation consistency verification. The translation averaging takes global
Output: Reserved cluster edges ̂ 𝑟𝑜𝑡 ; rotations and relative translations of matched image pairs as inputs,
{ }
1: for each 𝐸𝐼𝐽 ∈ ̂ − ̂ 𝑣𝑚𝑠𝑡 do and it attempts to find the absolute locations of all cameras (up to a
2: perform rotation averaging on 𝐼𝐽∗ to compute 𝑅𝑒𝑑𝑔𝑒 ;
gauge freedom). Translation averaging is sensitive to outliers (Zhuang
3: ∗
for each 𝑒𝑚𝑛 ∈ 𝐼𝐽 do et al., 2018), so we can use this method to detect the incorrect edges.
𝑇
𝑣𝑚𝑠𝑡 ⋅ (𝑅𝑣𝑚𝑠𝑡 ) ⋅ 𝑅𝑒𝑑𝑔𝑒 ⋅ (𝑅𝑒𝑑𝑔𝑒 ) ‖ 𝑇
However, the scales and coordinate systems of the reference global
4: 𝑚𝑛 = ‖𝑅𝑛
compute 𝑒𝑟𝑟𝑜𝑟𝑅 𝑚 𝑚 𝑛
5: end for poses and the output of translation averaging are different, we should
{ }
6: 𝐸𝑟𝑟𝑜𝑟𝑅 = max∗ 𝑒𝑟𝑟𝑜𝑟𝑅 align two coordinate systems before detecting incorrect edges. Our
𝐼𝐽 𝑚𝑛
𝑒𝑚𝑛 ∈𝐼𝐽 reference global poses can provide enough reliable camera poses to
7: end for
{ } align two coordinate systems. This is also the main reason that the
8: for each 𝐸𝐼𝐽 ∈ ̂ − ̂ 𝑣𝑚𝑠𝑡 do proposed method can perform translation verification.
9: if 𝐸𝑟𝑟𝑜𝑟𝑅
𝐼𝐽
<= 5◦ then For each cluster edge 𝐸𝐼𝐽 ∈ ̂ 𝑟𝑜𝑡 , translation averaging is performed
10: reserve this edge, and add it to ̂ 𝑟𝑜𝑡 . ∗ and rotations 𝑅𝑣𝑚𝑠𝑡 as inputs. The outputs are denoted
using  𝑣𝑚𝑠𝑡 ∪ 𝐼𝐽
11: end if as 𝑡𝑒𝑑𝑔𝑒 . The scales and coordinate systems of 𝑡𝑒𝑑𝑔𝑒 and the reference
12: end for 𝑡𝑣𝑚𝑠𝑡 are different, and we need more reliable locations of cameras to
13: compute the ratio 𝑅𝑟𝑒𝑠 of reserved edges. align 𝑡𝑒𝑑𝑔𝑒 and 𝑡𝑣𝑚𝑠𝑡 . Thus, we use  𝑣𝑚𝑠𝑡 and 𝐼𝐽 ∗ together to perform
14: if 𝑅𝑟𝑒𝑠 >= 60% then translation averaging. To align 𝑡 𝑒𝑑𝑔𝑒 and 𝑡 𝑣𝑚𝑠𝑡 , we need to estimate
15: return ̂ 𝑟𝑜𝑡 the transformation matrix between two coordinate systems. Because
16: else locations of view vertices 𝐼𝐽∗ may be inaccurate if the current cluster
17: relax angle threshold to 15◦ , iteratively push cluster edge with edge 𝐸𝐼𝐽 is an outlier, so we estimate a SIM(3) transformation using the
smallest error to rest view vertices that are not included in 𝐼𝐽 ∗ , as illustrated in Fig. 10.
18: ̂ 𝑟𝑜𝑡 until 𝑅𝑟𝑒𝑠 >= 60%. Then, return ̂ 𝑟𝑜𝑡 . Transformation parameters (rotation, translation and scale) are denoted
19: end if
as 𝑅, 𝑡 and 𝑠, respectively. Subsequently, we transform the coordinate

405
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 10. Illustration of transformation matrix calculation. This matrix is calculated using vertices (blue dots in (b)) that do not belong to 𝐶𝐼 and 𝐶𝐽 . (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

system of 𝑡𝑒𝑑𝑔𝑒 to that of 𝑡𝑣𝑚𝑠𝑡 with the use of 𝑅, 𝑡 and 𝑠. For each view edges from the huge view graphs. The WHU-BL (Gong et al., 2023)
∗ , the translation error 𝑒𝑟𝑟𝑜𝑟𝑡 is defined:
vertex 𝑣𝑚 ∈ 𝐼𝐽 𝑚 dataset is captured in regular scenes. The images included in WHU-
BL have ground truth camera poses. Thus, we performed quantitative
𝑒𝑟𝑟𝑜𝑟𝑡𝑚 = ‖𝑅 ⋅ 𝑠 ⋅ 𝑡𝑒𝑑𝑔𝑒
𝑚 𝑚 ‖2 ,
+ 𝑡 − 𝑡𝑣𝑚𝑠𝑡 (6) evaluation for different methods on Roberts et al. (2011) dataset and
where 𝑅 ⋅ 𝑠 ⋅ 𝑡𝑒𝑑𝑔𝑒
+ 𝑡 is coordinate transformation. It transforms 𝑡𝑒𝑑𝑔𝑒 WHU-BL dataset.
𝑚 𝑚
to the reference coordinate system. 𝑡𝑣𝑚𝑠𝑡
𝑚 is the reference location of 𝑣𝑚 .
Thus, 𝑒𝑟𝑟𝑜𝑟𝑡𝑚 actually calculates the Euclidean distance between 𝑡𝑒𝑑𝑔𝑒 𝑚 4.2. Comparison methods and evaluation metrics
and 𝑡𝑣𝑚𝑠𝑡 𝑡
𝑚 . Then, we define the translation error 𝐸𝑟𝑟𝑜𝑟𝐼𝐽 of 𝐸𝐼𝐽 as the
mean instead of maximum error of 𝑣𝑚 ∈ 𝐼𝐽 ∗ because the translation
For all experiments, we adopted COLMAP1 (Schonberger and Frahm,
error of a vertex is not stable enough. The definition is: 2016) as the baseline method because of its competitive performance
∑ 𝑡
∗ (𝑒𝑟𝑟𝑜𝑟 )
𝑣𝑚 ∈𝐼𝐽 𝑚
reported in Michelini and Mayer (2020). We integrated the proposed
𝑡
𝐸𝑟𝑟𝑜𝑟𝐼𝐽 = , (7) cluster-based disambiguation method into COLMAP and name it MA-
𝑀

SfM for comparison. Geodesic-aware disambiguation (Yan et al., 2017)
where 𝑀 is the number of view vertices included in 𝐼𝐽 . After all
is a SOTA method for ambiguous scenes in SfM. The original code2
errors of cluster edges have been computed, we sort errors in ascending
is implemented on Bundler (Snavely et al., 2006), and we fuse its
order and reserve the first 80% edges. Ultimately, the cluster edges
output to COLMAP for fairness and name it Geodesic-SfM. In addition,
that pass both rotation and translation consistency are added into the
we also compared our method with a newly proposed disambiguation
VMST. Then, we construct the corresponding view graph of the final
method TC-SfM (Wang et al., 2022). This method uses track consistency
VMST. The view graph is further sent into a SfM method such as
to disambiguate matches. In ablation studies, we designed two new
COLMAP (Schonberger and Frahm, 2016) to obtain final accurate and
methods for comparison. The first one named NoClustPose directly
complete structures and camera poses.
uses view graph instead of cluster graph to generate global poses. The
By removing view edges that cause drastic pose deviation from the
second one named NoClustVeri directly uses view edge error instead
global poses, the pose consistency verification improves the robustness
of cluster edge error to perform consistency verification. Namely,
of SfM in ambiguous scenes. Furthermore, the accuracy of pose esti-
mation in SfM is also improved even for unambiguous scenes because NoClustVeri performs pose consistency verification for each view edge
potential incorrect edges that may cause pose deviation are also filtered instead of cluster edge. These two methods are applied to illustrate that
out. global pose generation and pose consistency verification at cluster-level
are effective. It should be noted that the above-mentioned methods use
4. Experimental results the same feature matching module with default parameters of COLMAP
to find matches for the input images.
4.1. Experimental datasets Although the ground truth is unknown in most of ambiguous
datasets, mismatches of ambiguous images are visually obvious in
Although the proposed SfM approach is designed for images col- the final 3D reconstruction results, such as misaligned or superim-
lected from ambiguous scenes, it should also be applicable on regu- posed structures. As a result, the analysis of experimental results
lar scenes. Therefore, we conducted experiments on both ambiguous on ambiguous datasets primarily relies on qualitative evaluation. To
datasets and regular datasets. The brief information of all datasets convincingly illustrate the effectiveness of our method, we also perform
are listed in Table 2. The WHU-XT20 consists of two self-collected quantitative evaluation on Books, Cereal and Street data. In addition,
ambiguous image sequences captured in Wuhan University. Some of the regular dataset used in our experiments has ground truth camera
their images are shown in Fig. 11. For Books, Cereal and Street, which poses. Therefore, we further conducted a quantitative evaluation of
are three small data, we manually construct ground truth camera poses pose accuracy on this dataset as a supplementary analysis. We assessed
for them. For these three data, we first manually remove ambiguous pose accuracy using the absolute pose error (APE) metric calculated by
edges from the original view graphs and then directly perform COLMAP
using correct image matches to obtain ground truth camera poses. For
the rest ambiguous data, because the number of images included in 1
Downloaded from https://github.com/colmap/colmap/releases.
2
these data is large, it is really hard to manually remove ambiguous Downloaded from https://github.com/yanqingan/SfM_Disambiguation.

406
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Table 2
The datasets used in our experiments. ✓ and × mean that ground truth is known and unknown, respectively.
Datasets Image sequences Source Ambiguity Image resolution Ground truth
Books Public High 1067 × 800 ✓
Roberts et al. (2011) Cereal Public High 1067 × 800 ✓
Street Public High 800 × 1067 ✓
Jiang et al. (2012) Temple Public High 4368 × 2912 ×
Alex Public Relatively high 972 × 694 ×
Heinly et al. (2014)
Berliner Public Relatively high 964 × 690 ×
B1 Shared Relatively high 3696 × 2448 ×
Wang et al. (2019) B2 Shared Relatively high 3936 × 2624 ×
B3 Shared Relatively high 3936 × 2624 ×
Blendedmvs Buildingcar Public Medium 768 × 576 ×
Quad6k Quad6k Public Medium 1788 × 1210 ×
Auditorium Public Relatively high 3840 × 2160 ×
Tank-and-Temples Palace Public Relatively high 3840 × 2160 ×
Courthouse Public Relatively high 1920 × 1080 ×
Luojia Self-collected Relatively high 960 × 540 ×
WHU-XT20
Xingzheng Self-collected Relatively high 960 × 540 ×
Building Self-collected Regular scenes 1920 × 1080 ✓
WHU-BL
Library Self-collected Regular scenes 1920 × 1080 ✓

Fig. 11. Some images of the self-collected ambiguous dataset.

the evo-toolbox.3 It follows the same process of ATE (Sturm et al., 2012) Table 3
Computer configuration specifications.
by aligning two trajectories at first. Then, the translation distances
Processor (CPU) i9-10920 (12 cores, 3.50 GHz)
and rotation errors of frames are computed. In this way, we analyzed
the consistency between SfM poses and ground truth poses. When Graphic Card (GPU) 2 × NVIDIA GeForce RTX 3090
(10496 CUDA cores, 1.70 GHz)
evaluating efficiency, the computational times of feature extraction
Memory 128GB DDR4
and matching are not included in the final cost because they are the
same for all experiments. All tests are conducted on the same computer Operating System Ubuntu 18.04.1

whose configuration specifications are listed in Table 3.

3
https://github.com/MichaelGrupp/evo/blob/master/notebooks/metrics.
py_API_Documentation.ipynb

407
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Table 4
The comparison results of all methods on ambiguous datasets. #images represents the number of images included in the data. #reg represents the number of registered images in
SfM results. #pts represents the number of reconstructed 3D points. rt represents runtime in minutes. ✓represents success. × represents failure.
Datasets COLMAP Geodesic-SfM TC-SfM MA-SfM
Indexes and Names #images #reg #pts rt #reg #pts rt #reg #pts rt #reg #pts rt
Books 21 21 9K 0.15 × 21 8K 0.32 × 21 9K 0.25 ✓ 21 9K 0.18 ✓
(1) Cereal 25 25 12K 0.25 × 25 11K 0.48 × 25 12K 0.58 × 25 12K 0.25 ✓
Street 19 19 5K 0.08 × 19 5K 0.15 × 19 4K 0.28 × 19 5K 0.13 ✓
(2) Temple 341 341 197K 85 × 340 179K 80 ✓ 341 198K 96 ✓ 341 214K 48 ✓
Alex 448 446 227K 23 × 425 98K 19 ✓ 440 86K 30 × 447 120K 21 ✓
(3)
Berliner_dom 1618 1603 238K 99 × 1525 223K 155 ✓ 1541 256K 541 ✓ 1600 234K 145 ✓
(4) B3 342 342 85K 39 × 138 64K 29 × 111 37K 21 × 342 87K 14 ✓
(5) Buildingcar 181 181 71K 5 × 154 50K 4 ✓ 142 32K 3 × 181 63K 3 ✓
(6) Quad6k 6514 6152 1,386K 579 × 3541 827K 10,080 × more than one week 6051 1362K 1,204 ✓
Auditorium 298 298 36K 10 ✓ 291 26K 22 ✓ 296 31K 6 ✓ 296 34K 3 ✓
(7) Palace 730 719 222K 40 × 172 41K 13 × 158 31K 19 × 653 151K 33 ✓
Courthouse 1105 1105 461K 153 × 1105 194K 168 ✓ 1105 212K 69 ✓ 1105 209K 126 ✓
Luojia 872 872 215K 38 × 785 96K 45 × 232 62K 25 × 872 196K 36 ✓
(8)
Xingzheng 713 713 123K 32 × 712 108K 40 ✓ 713 117K 23 ✓ 713 119K 26 ✓

4.3. Experiments on ambiguous datasets are in pure translation motion. For B3, the result of COLAMP has a
large geometric misalignment because of ambiguous matches, while
We illustrated the robustness and effectiveness of the proposed Geodesic-SfM and TC-SfM generate broken results because they delete
approach on ambiguous datasets. We also qualitatively compared all some correct matches. Only MA-SfM succeeds on this data. The Palace
methods on these datasets. All results are summarized in Table 4. If data includes numerous ambiguous images that lack noticeable visual
there are obvious misaligned or superimposed structures in SfM recon- contradictions, as illustrated in Fig. 2. Our approach robustly generates
structions, or results are obviously incomplete, the method is marked complete and correct structures for this data, while the rest of three
as failure (×) in the table. It can be seen that COLMAP algorithm methods all fail on this image sequence. COLMAP generates misaligned
itself has poor robustness on ambiguous data. COLMAP generates the structures because of similar pillars. Geodesic-SfM reconstructs two
correct SfM result only for Auditorium data, and fails in the rest of separate parts, and TC-SfM only recoveries a part of the scene.
13 data (1/14). Geodesic-SfM method successfully recoveries poses and The Quad6k is a large-scale data, and it consists of over 6,000
structures for half of the data (7/14). However, when the data is large images. Although TC-SfM takes more than one week to process this
and the matching relationships are dense, the efficiency of this method data, it still fails to generate any results. Thus, we only presented the
is greatly reduced. The computational time cost drastically increases results of COLMAP, Geodesic-SfM and MA-SfM in Fig. 13. This data
in Quad6k (6514 images). In addition, the method tends to eliminate is challenging, because there are many similar symmetrical walls, as
the correct edges between images with perspective change. Therefore, shown in Fig. 13(a). In addition, this data covers a large spatial area,
the final number of registered images is significantly less than our the image correlation between some local regions is weak. Some local
approach. TC-SfM succeeds in less than half (6/14) of the data. In regions are only connected by a limited number of feature matches.
addition, the computational time of this method is high for the large From Fig. 13(b), we observed that Geodesic-SfM generates incomplete
data. For Quad6k, although this method take more than one week, it structures for this data, especially in the top-left of the picture. Be-
still fails to give the final result. However, our approach (MA-SfM) can cause Geodesic-SfM uses feature topology to perform disambiguation,
successfully remove the wrong edges caused by ambiguous images, so it tends to eliminate some correct matches between images with weak
it can obtain correct results for all these data (14/14). In addition, the correlation. In Fig. 13(c), we presented the reconstruction result of
time efficiency is also improved compared with the baseline algorithm COLMAP. Although COLMAP generates the complete structures, a ghost
in most of data (11/14). This is because cost constraints offered by structure appears. The main reason is that the ambiguous edges are
the wrong edges have been removed, so the efficiency of BA can be not removed in COLMAP before performing pose and structure opti-
improved to a certain extent. In addition, our approach registers more mization. However, for this challenging data, the proposed MA-SfM
images and 3D points in almost all successful reconstructions. still successfully recoveries the correct structures and camera poses,
The visual results of some image sequences are shown in Fig. 12. as shown in Fig. 13(d). MA-SfM can effectively eliminate ambiguous
The scenes of these image sequences are various. The scales, ambigu- matches while preserving correct matches as much as possible. Thus,
ities of these scenes are different. In addition, the cameras of Street MA-SfM can avoid the appearance of ghost structure and generate the
data are in pure translation motion. In Fig. 12, some obvious structure complete and correct reconstruction result.
errors are marked by green circles. From Fig. 12, we observed that the The visual results of all methods on self-collected ambiguous WHU-
proposed MA-SfM generates complete and correct results for all these XT20 are shown in Fig. 14. The Luojia data is relatively difficult due to
image sequences, and significantly outperforms COLMAP, Geodesic- the highly repetitive textures and geometric structures. For this data,
SfM, and TC-SfM. In addition, both Geodesic-SfM and TC-SfM tend COLMAP, Geodesic-SfM and TC-SfM generate incorrect results. In the
to make a compromise between the completeness of reconstruction result of COLMAP, the cameras located in the green circles are wrongly
and disambiguation. They generate incomplete reconstruction results aligned because the bilateral symmetrical walls are highly similar. In
on Buildingcar, B3, Auditorium, and Palace data. We selected sev- addition, a large number of cameras are dislocated or missed. Geodesic-
eral representative results from Fig. 12 to illustrate the superiority of SfM alleviates this problem to a certain extent, but still has some
our method. For Cereal, all methods except for MA-SfM reconstruct misaligned cameras. TC-SfM relies on a strict mechanism to distinguish
completely wrong structures and poses. For Street data, only MA-SfM different local regions, but these local reconstructions cannot be merged
reconstructs three instead of two facades. The rest of methods fail correctly, resulting in incomplete final results. However, for this chal-
on this data because they are difficult to disambiguate images that lenging data, the proposed method still generates correct and complete

408
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 12. Visual results of all methods on several ambiguous data. From the second column to the last column, the pictures are results reconstructed by COLMAP (Schonberger
and Frahm, 2016), Geodesic-SfM (Yan et al., 2017), TC-SfM (Wang et al., 2022) and the proposed MA-SfM, respectively. The camera poses are shown in red. The green circles
mark some incorrect parts. Only our method can succeed on all these image collections. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)

reconstruction result. For Xingzheng data, MA-SfM and Geodesic-SfM Table 5. The proposed MA-SfM offers the best scores for all three data.
methods generates the correct results. However, COLMAP recovers For Books data, both COLMAP and Geodesic-SfM reconstruct wrong
the dislocated camera poses in the areas marked by the green circles structures and poses, as shown in Fig. 12, so their scores are very
because of ambiguous images. TC-SfM generates discontinuous camera
low. Although TC-SfM generates correct structures for this data, its
poses, as marked by the green circle.
At last, to convincingly illustrate the superiority of our method, scores are slightly lower than ours. For Cereal and Street data, only
we quantitatively evaluated the performance of different methods on the proposed MA-SfM generates correct structures, so the scores of our
Books, Cereal and Street data. The evaluation results are reported in method are significantly better than that of other methods.

409
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 13. Visual results of all methods on Quad6k. The result of Geodesic-SfM is incomplete, while COLMAP’s result has a ghost structure (highlighted by purple ellipse). Only our
method generates the correct result. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 5
The pose error comparison on Books (K), Cereal (C) and Street (S) data. The best results are set bold. B represents Building, L represents Library. 𝜇(𝑒) represents mean error. 𝑚(𝑒)
represents median error. 𝑟(𝑒) represents RMSE.
COLMAP Geodesic-SfM TC-SfM MA-SfM (ours)
Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ )
𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒)
K 3.08 3.07 3.40 94.9 81.8 96.3 3.18 3.03 3.45 92.7 89.4 98.8 0.0033 0.0030 0.0037 0.134 0.136 0.136 0.0032 0.0028 0.0035 0.124 0.125 0.125
C 3.12 3.28 3.49 108.9 105.1 109.0 3.06 2.63 3.38 75.8 75.3 76.7 1.768 1.619 1.972 160.6 160.6 161.6 0.0624 0.0541 0.069 0.854 0.855 0.863
S 3.11 3.25 3.58 174.9 174.9 174.9 3.06 3.23 3.58 167.9 167.9 168.0 2.951 2.934 2.97 152.5 152.6 152.6 0.167 0.158 0.17 1.314 1.321 1.327

4.4. Experiments on regular datasets We also found that Geodesic-SfM recovers two separate reconstructions
on Library data and TC-SfM fails on both image collections. These
Except for the ambiguous datasets, we also conducted experiments results demonstrate that both Geodesic-SfM and TC-SfM are prone to
on a regular dataset (Gong et al., 2023). In this dataset, there are two remove correct matches even they are not ambiguous. To quantitatively
image collections Building and Library. In addition, the ground truth evaluate the performance of different methods, we compute the pose
camera poses are provided in this dataset, so we can perform quantita- errors of these methods on this dataset, and the results are reported in
tive evaluation for all methods. The visual results of different methods Table 6. The pose errors of TC-SfM are not reported because it fails
on this dataset are shown in Fig. 15. We observed that COLMAP and on both image collections. The proposed MA-SfM achieves the best
the proposed MA-SfM successfully recover the correct structures and accuracy on both image collections. This is because that our approach
poses in this dataset. It demonstrates that the proposed disambiguation not only can eliminate the influence of ambiguous images, but also can
approach does not influence the performance of SfM on regular data. filter out some other mismatches that affect the pose optimization.

410
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 14. Visual results of the self-collected ambiguous dataset WHU-XT20. The green circles mark some incorrect parts. Only our method succeeds on both image sequences. (For
interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 15. Visual results of all methods on WHU-BL dataset. The camera poses are shown in red. The green circles mark some incorrect parts. Our method and COLMAP succeed
on both image sequences. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

4.5. Ablation studies 4.5.1. Ablation study of global pose generation


To explore the effectiveness of cluster graph in generating global
To convincingly illustrate the effectiveness of the cluster graph used poses, we designed NoClustPose that directly uses MST of view graph
in the proposed disambiguation method, we compared MA-SfM with instead of VMST of cluster graph to generate global poses. In Table 7,
NoClustPose and NoClustVeri. we reported the experimental results of NoClustPose on ambiguous

411
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Fig. 16. The histograms of edge rotation errors. It is easy to distinguish reliable and unreliable edges in (b) and (c) because their errors are very different. In (b) and (c), we use
the red bins to represent unreliable edges. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 6
The pose error comparison on WHU-BL dataset. The best results are set bold. B represents Building, L represents Library. 𝜇(𝑒) represents mean error. 𝑚(𝑒) represents median error.
𝑟(𝑒) represents RMSE.
COLMAP Geodesic-SfM MA-SfM (ours)
Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ )
𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒)
B 0.206 0.166 0.249 0.691 0.538 0.832 0.225 0.194 0.257 3.618 3.042 3.596 0.174 0.132 0.213 0.514 0.375 0.612
L 0.100 0.100 0.150 1.983 1.857 2.047 0.129 0.097 0.217 4.810 3.970 4.778 0.074 0.062 0.090 1.402 0.915 1.208

datasets. NoClustPose succeeds on 4/14 of these data. However, MA- Table 7


The results of NoClustPose on ambiguous datasets.
SfM generates correct results for all of these data, as shown in Table 2.
The improvement of MA-SfM rises from the combination of local and Index Name Result Index Name Result

global information offered by the cluster graph. To further explore Books ✓ (5) Buildingcar ×
(1)
this improvement, we also did quantitative experiments on WHU-BL Cereal ✓ (6) Quad6k ×
dataset with ground truth camera poses. The reference pose errors of Street × Audi ×
NoClustPose and MA-SfM are reported in Table 8. The results illustrate (7)
(2) Temple ✓ Palace ×
that the local and global information used in our method significantly Alex × Courthouse ✓
improves the accuracy of reference poses, leading to more accurate pose (3)
Berliner_dom × Luojia ×
consistency verification. (8)
(4) B3 × Xingzheng ×

4.5.2. Ablation study of pose consistency verification


To illustrate the effectiveness of cluster graph in pose consistency
verification, we tested NoClustVeri on ambiguous datasets. The ref- MA-SfM on Cereal data. The pictures show that most edges are within
erence poses used in NoClustVeri are the same with MA-SfM. The an error no more than 25◦ for both methods. However, the errors of
qualitative experimental results of noClustVeri are listed in Table 9. It view graph edges in NoClustVeri continuously distribute from 0◦ to
succeeds on 9/14 of these data. In contrast, MA-SfM generates correct 180◦ , while the rotation errors in MA-SfM have a more favorable vari-
results on all these data, as shown in Table 4. This is because that ance. The distribution of rotation errors generated by MA-SfM is more
MA-SfM performs pose consistency verification based on cluster edges reasonable. In addition, due to the error differences between reliable
instead of directly on view edges. To further explore this improvement, and unreliable edges are large, so our method is insensitive to angle
we compared the distribution of rotation errors in both NoClustVeri and threshold. These experimental results demonstrate that performing pose
MA-SfM. In Fig. 16, we presented the rotation errors of NoClustVeri and consistency at cluster-level benefits to detect unreliable edges.

412
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Table 8
The reference pose errors of NoClustPose and MA-SfM.
Data NoCluster MA-SfM (cluster-based)
Translation (m) Rotation (◦ ) Translation (m) Rotation (◦ )
𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒) 𝜇(𝑒) 𝑚(𝑒) 𝑟(𝑒)
Building 4.66 3.90 5.23 7.79 6.91 6.91 0.82 0.74 0.92 1.60 1.39 1.60
Library 2.81 2.76 2.77 3.64 3.32 3.34 0.53 0.51 0.51 0.50 0.48 0.49

Table 9 Ceylan, D., Mitra, N.J., Zheng, Y., Pauly, M., 2014. Coupled structure-from-motion and
The results of NoClustVeri on ambiguous datasets. 3D symmetry detection for urban facades. ACM Trans. Graph. 33 (1), 1–15.
Index Name Result Index Name Result Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C., 2017. Deep adaptive image clustering.
In: IEEE International Conference on Computer Vision, ICCV.
Books × (5) Buildingcar ✓
Chen, Y., Shen, S., Chen, Y., Wang, G., 2020. Graph-based parallel large scale structure
(1)
Cereal ✓ (6) Quad6k × from motion. Pattern Recognit. 107, 107537.
Cui, H., Shen, S., Gao, W., Liu, H., Wang, Z., 2019. Efficient and robust large-
Street × Audi ✓
scale structure-from-motion via track selection and camera prioritization. ISPRS
(7)
(2) Temple ✓ Palace × J. Photogramm. Remote Sens. 156, 202–214.
Alex ✓ Courthouse ✓ Cui, H., Shi, T., Zhang, J., Xu, P., Meng, Y., Shen, S., 2021. View-graph construction
(3) framework for robust and efficient structure-from-motion. Pattern Recognit. 114,
Berliner_dom ✓ Luojia ✓
(8) 107712.
(4) B3 × Xingzheng ✓ Dehariya, V.K., Shrivastava, S.K., Jain, R., 2010. Clustering of image data set
using k-means and fuzzy k-means algorithms. In: International Conference on
Computational Intelligence and Communication Networks.
Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driving? The
KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern
5. Conclusions
Recognition, CVPR.
Gong, Y., Zhou, P., Liu, Y., Dong, H., Li, L., Yao, J., 2023. View-graph key-subset
The incorrect feature matches between images with similar and extraction for efficient and robust structure from motion. Photogramm. Rec. 00,
repeated textures will significantly affect the result of SfM method, and 1–45.
Heinly, J., Dunn, E., Frahm, J., 2014. Correcting for duplicate scene structure in sparse
the geometric misalignment may appear. To alleviate this problem, we
3D reconstruction. In: European Conference on Computer Vision, ECCV.
propose a disambiguation method in this paper to remove the incorrect James, M.R., Robson, S., 2012. Straightforward reconstruction of 3D surfaces and
view edges between ambiguous images. We first combine local and topography with a camera: Accuracy and geoscience application. J. Geophys. Res.:
global information of the scene to construct a reliable global pose prior. Earth Surf. 117 (F3).
Jiang, S., Jiang, C., Jiang, W., 2020. Efficient structure from motion for large-scale UAV
Then, we further perform pose (rotation and translation) consistency
images: A review and a comparison of SfM tools. ISPRS J. Photogramm. Remote
verification to robustly filter out incorrect matches based on the global Sens. 167, 230–251.
pose prior. The experimental results illustrate that the proposed method Jiang, N., Tan, P., Cheong, L.F., 2012. Seeing double without confusion: Structure-from-
can significantly improve the robustness of SfM methods because it can motion in highly ambiguous scenes. In: IEEE Conference on Computer Vision and
Pattern Recognition, CVPR.
accurately remove incorrect matches while preserving correct matches.
Kataria, R., DeGol, J., Hoiem, D., 2020. Improving structure from motion with reliable
In addition, the pose accuracy of SfM methods is also improved because resectioning. In: International Conference on 3D Vision (3DV).
the proposed method also can filter out some other mismatches that Knapitsch, A., Park, J., Zhou, Q., Koltun, V., 2017. Tanks and temples: Benchmarking
affect the pose optimization. The comparison experiments on various large-scale scene reconstruction. ACM Trans. Graph. (ToG) 36 (4), 1–13.
Kruskal, J.B., 1956. On the shortest spanning subtree of a graph and the traveling
datasets also convincingly illustrate that the proposed disambiguation
salesman problem. Proc. Amer. Math. Soc. 7 (1), 48–50.
method outperforms the state-of-the-art methods and achieve the best Michelini, M., Mayer, H., 2020. Structure from motion for complex image sets. ISPRS
performance. J. Photogramm. Remote Sens. 166, 140–152.
Omran, M., Engelbrecht, A.P., Salman, A., 2005. Particle swarm optimization method
for image clustering. Int. J. Pattern Recognit. Artif. Intell. 19 (03), 297–321.
Declaration of competing interest Roberts, R., Sinha, S.N., Szeliski, R., Steedly, D., 2011. Structure from motion for scenes
with large duplicate structures. In: IEEE Conference on Computer Vision and Pattern
The authors declare that they have no known competing finan- Recognition, CVPR.
Schonberger, J.L., Frahm, J., 2016. Structure-from-motion revisited. In: IEEE Conference
cial interests or personal relationships that could have appeared to
on Computer Vision and Pattern Recognition, CVPR.
influence the work reported in this paper. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R., 2006. A comparison and
evaluation of multi-view stereo reconstruction algorithms. In: IEEE Conference on
Acknowledgments Computer Vision and Pattern Recognition, CVPR.
Shen, T., Zhu, S., Fang, T., Zhang, R., Quan, L., 2016. Graph-based consistent matching
for structure-from-motion. In: European Conference on Computer Vision, ECCV.
This work was partially supported by the National Natural Sci- Snavely, N., Seitz, S.M., Szeliski, R., 2006. Photo tourism: Exploring photo collections
ence Foundation of China (No. U22A2009, No. 42271445), the Fun- in 3D. In: ACM SIGGRAPH. pp. 835–846.
damental Research Funds for the Central Universities, China (No. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D., 2012. A benchmark for
the evaluation of RGB-D SLAM systems. In: IEEE/RSJ International Conference on
2042023kf0174), the State Key Laboratory of Intelligent Vehicle Safety Intelligent Robots and Systems, IROS.
Technology, the Chongqing Technical Innovation and Application De- Sweeney, C., Sattler, T., Hollerer, T., Turk, M., Pollefeys, M., 2015. Optimizing the
velopment Special Project (CSTC2021JSCX-DXWTBX0023), the Foun- viewing graph for structure-from-motion. In: IEEE International Conference on
dation of Anhui Province Key Laboratory of Physical Geographic En- Computer Vision, ICCV.
Wang, L., Ge, L., Luo, S., Yan, Z., Cui, Z., Feng, J., 2022. TC-SfM: Robust
vironment (2022PGE008), the Shenzhen Science and Technology Pro- track-community-based structure-from-motion. arXiv preprint arXiv:2206.05866.
gram (JCYJ20230807090201003), and the Japan Society for the Pro- Wang, X., Xiao, T., Gruber, M., Heipke, C., 2019. Robustifying relative orientations
motion of Science, Japan (No. 23k13419). with respect to repetitive structures and very short baselines for global SfM. In:
IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW.
Wilson, K., Snavely, N., 2013. Network principles for SfM: Disambiguating repeated
References structures with local context. In: IEEE International Conference on Computer Vision,
ICCV.
Carmigniani, J., Furht, B., Anisetti, M., Ceravolo, P., Damiani, E., Ivkovic, M., 2011. Yan, Q., Yang, L., Zhang, L., Xiao, C., 2017. Distinguishing the indistinguishable:
Augmented reality technologies, systems and applications. Multimedia Tools Appl. Exploring structural ambiguities via geodesic context. In: IEEE Conference on
51, 341–377. Computer Vision and Pattern Recognition, CVPR.

413
Y. Gong et al. ISPRS Journal of Photogrammetry and Remote Sensing 209 (2024) 398–414

Yang, M.D., Chao, C.F., Huang, K.S., Lu, L.Y., Chen, Y.P., 2013. Image-based 3D scene Zach, C., Klopschitz, M., Pollefeys, M., 2010. Disambiguating visual relations using
reconstruction and exploration in augmented reality. Autom. Constr. 33, 48–60. loop constraints. In: IEEE Conference on Computer Vision and Pattern Recognition,
Yang, Y., Xu, D., Nie, F., Yan, S., Zhuang, Y., 2010. Image clustering using local CVPR.
discriminant models and global integration. IEEE Trans. Image Process. 19 (10), Zhu, S., Zhang, R., Zhou, L., Shen, T., Fang, T., Tan, P., Quan, L., 2018. Very large-
2761–2773. scale global SfM by distributed motion averaging. In: IEEE Conference on Computer
Yurtsever, E., Lambert, J., Carballo, A., Takeda, K., 2020. A survey of au- Vision and Pattern Recognition, CVPR.
tonomous driving: Common practices and emerging technologies. IEEE Access 8, Zhuang, B., Cheong, L.F., Lee, G.H., 2018. Baseline desensitizing in translation
58443–58469. averaging. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
Zach, C., Irschara, A., Bischof, H., 2008. What can missing correspondences tell us about
3D structure and motion? In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR.

414

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy