Exploring ways to enhance robustnessof 3D reconstruction using COLMAP

Spatial AI Network勉強会 - 2025/11/4 Exploring ways to enhance robustness
of 3D reconstruction using COLMAP 発表者: 李東岳（名古屋大学出口研究室）メンター: 石田岳志（株式会社サイバーエージェント AI Lab ）

2 Exploring ways to enhance robustness of 3D reconstruction using
COLMAP Background 01 Related works 02 Analyze the necessary conditions for successful reconstruction 03 Identify a suitable feature matching method 04 Future work 05 Conclusion 06

3 Abstract Motivation: • We want to realize the robust
reconstruction from images for generating 2D floor maps. Baseline method: • We use the COLMAP for execution on a low computational resource environment. What we did: • Found the conditions required for successful reconstruction • Found the best performing feature extraction and matching methods that can satisfy the conditions for successful reconstruction • Improved the robustness of COLMAP reconstruction What we achieved: • Robustly reconstruct point cloud from the actual cleaner robot images

4 Background 01

5 Automatic image collection by a cleaner robot • We
are experimenting automatic image collection in an actual grocery store called EZOHUB TOKYO. • Mount a camera on a cleaner robot and automatically collect grocery store images. • Much lower cost than installing a specific robot for data collection because it utilizes a cleaner robot already operating in a grocery store. A smartphone camera is mounted on a cleaner robot.

6 We want to create a floor map only from
camera images: • Cameras are generally cheaper than LiDAR. • We can navigate customers in a store. • We can add semantic information to the map by recognizing products and objects. • We investigated 3D reconstruction for floor map generation. What we did: Robust 3D reconstruction from cleaner robot images The example image of the grocery store.

7 What we achieved Successfully reconstructed the point cloud from
the cleaner robot images by improving the feature matching performance Reconstruction with DISK + LightGlue The best performing feature matching method we found Reconstruction with classic SIFT image features

8 Related works 02

9 Neural network based methods The pipeline of VGGT. There
exists neural network based methods: VGGT, dust3r, mast3r. • Advantages: They can robustly create dense reconstruction results • Disadvantages: Requires heavy computational resources. • Even with 100 images, a general GPU such as A100 cannot handle the large amount of memory. [1] Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, et al, VGGT: Visual Geometry Grounded Transformer, CVPR, 2025. The reconstruction example of VGGT.

10 COLMAP [2] The example of the feature extraction and
matching in COLMAP. COLMAP is a de facto standard software for 3D reconstruction. [2] Schonberger, Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. Flaws of COLMAP • Lacks the feature extraction and matching performance ◦ COLMAP uses the classic SIFT feature • We needed to improve the feature matching performance for robust grocery store reconstruction

11 We use HLoC [4] for efficient experiment [4] Paul-Edouard
Sarlin, Cesar Cadena, Roland Siegwart, Marcin Dymczyk, From Coarse to Fine: Robust Hierarchical Localization at Large Scale, CVPR, 2019. We use HLoC as a feature extraction & matching library for improving feature extraction and matching robustness. • HLoC contains several feature extraction and matching methods. • We can easily switch them and find the best performing one. → HLoC simplifies debugging and performance analysis

12 How we improved COLMAP [2] COLMAP basically consists of
two parts: correspondence search, and incremental reconstruction. What we did to improve the performance: • Investigated the key conditions for successful reconstruction. • Investigated the best performing feature extraction & matching methods that satisfy the key conditions. [2] Schonberger, Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. Improved robustness of feature matching with HLoC Analyzed the reconstruction pipeline and found the key conditions for successful reconstruction

13 Analyze the necessary conditions for successful reconstruction 03

14 Investigation of required conditions for successful reconstruction [2] Schonberger,
Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. We investigated the key conditions for successful reconstruction

15 Details of COLMAP reconstruction pipeline The illustration of incremental
reconstruction The incremental reconstruction process of COLMAP has four main steps: Initialization Create initial landmarks from keypoint matches of two cameras. Image registration Associate new camera keypoints and existing landmarks. Estimate the camera pose from the correspondence. Triangulation Add new landmarks from new keypoint matches. Bundle adjustment Fine-tune camera poses and 3D points at once to minimize reprojection error.

16 Investigation of the conditions for successful reconstruction We investigated
key conditions for successful reconstruction Method 1. Conduct two extremely challenging datasets. 2. Analyze the reconstruction results of these two challenging datasets. We found four key conditions for successful reconstruction 1. Sufficient 2D-2D inlier matches 2. Sufficient triangulation angles 3. Sufficient number of 3D landmarks 4. Geometrically accurate, sufficient 2D-3D correspondences

17 Creation of a small baseline, hard illumination dataset To
find the necessary conditions for successful reconstruction, We created a dataset with small baseline and difficult illumination conditions.

18 Matching visualization Keypoints are partially matched even with extreme
conditions. But viewpoint angles are not enough because of small baselines.

19 Found key conditions for initial image selection Reconstruction has
failed because it could not find the initial image pair. Through the reconstruction experiment of dataset 1, we found two conditions for successful initial image selection: • Sufficient 2D-2D inlier matches • Sufficient triangulation angles If COLMAP couldn’t find the good initial image pair, overall reconstruction will fail.

20 Creation of a very large baseline dataset We created
a dataset with very large baselines with few scene overlaps

21 Matching visualization Only few image pairs have sufficient matches.
But others don’t.

22 Found key conditions for image registration Reconstruction has failed
because of image registration failure. We found two conditions for successful reconstruction • Sufficient number of 3D landmarks • Geometrically accurate, sufficient 2D-3D correspondences

23 Summary of required conditions for successful reconstruction • Sufficient
2D-2D inlier matches • Sufficient triangulation angles • Sufficient number of 3D landmarks • Geometrically accurate, sufficient 2D-3D correspondences To satisfy these conditions above, we need • Input images with large overlaps with sufficiently large viewpoint angles → Almost uncontrollable, because we cannot control the cleaner robot • Geometrically accurate many keypoint matches → Controllable. We can replace the feature extraction and matching algorithm. → We tried to identify feature extraction and matching method that can generate accurate, large number of matches.

24 Identify a suitable feature matching method 04

25 Improve robustness of feature matching [2] Schonberger, Johannes Lutz
and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. We improved feature matching robustness with HLoC

26 How we improved the robustness • Investigated the most
robust feature extraction and matching method • Ran experiments on various grocery store dataset to find the best method ◦ Handheld camera dataset ◦ Actual cleaner robot dataset We replaced the feature extraction and matching method in the experiment

27 Dataset 1: Handheld dataset (10 sequences) • Images captured
in EZOHUB TOKYO • Handheld captured RGB images from RealSense D455i • Created 10 small image sets ◦ Each contains 100 images • Contains both easy and difficult cases ◦ Difficult cases have little texture Easy Difficult Viewpoint variation Less texture

28 Dataset 2: Actual cleaner robot dataset (20 sequences) •
RGB images captured by D455i mounted on a cleaner robot (raw stream is 30FPS) • We subsampled the video into 1 FPS images • For detailed investigation, we divided one cleaning session into 20 subsequences

29 Searching the best option among various features + matchers
We applied feature extractors and matchers contained in HLoC • Used available 21 combinations of extractors and matchers • Ignored incompatible extractor & matcher combinations • Ignored superglue because it’s computationally demanding Explanation of NN-family (nearest neighbor search) • NN-mutual: mutual nearest neighbor matching • NN-ratio: nearest neighbor matching with ratio filtering

30 The metrics we used for evaluation Registration rate →
Good metric for success / failure of reconstruction. • Number of correctly registered images to the number of input images. The number of reconstructed 3D points → Important for robust image registration • The more 3D landmarks the point cloud has, the more robust the pose estimation will be. Track length → Indicates strength of image descriptor and matcher • A higher mean track length indicates that the same 3D points are tracked across more camera views, which generally suggests a more robust reconstruction with better triangulation. Reprojection error → Geometric accuracy of the reconstruction result. • It is a geometric error corresponding to the image distance between a projected point and a measured one. The smaller reprojection error denotes the more robust reconstruction.

31 Experimental results for the Handheld Dataset (Dataset 1) Comparison
of the registration rate Comparison of the number of reconstructed 3D points DISK+LightGlue establishes the densest point cloud and 100% registration rate. → Effective for camera registration and robust reconstruction

32 Experimental results for the Handheld Dataset (Dataset 1) Comparison
of the track length Comparison of the reprojection error DISK+LightGlue achieves the longest track length. → Robustly tracks a 2D keypoint. Improves map consistency.

33 Experimental results for the Cleaner Robot Dataset (Dataset 2)
DISK+LightGlue establishes the densest point cloud. → Effective for camera registration and robust reconstruction Comparison of the registration rate Comparison of the number of reconstructed 3D points

34 Experimental results for the Cleaner Robot Dataset (Dataset 2)
We want to achieve robust successful reconstruction than accurate reconstruction. DISK+LightGlue has large reprojection error but the registration rate and the number of reconstructed points are more important. Comparison of the track length Comparison of the reprojection error

35 Visualization results SIFT + NN-mutual DISK + LightGlue SuperPoint_InLoc
+ LightGlue Visualizations on small image set 02 are presented. The size of 3D point is kept the same for fair comparison. image set 02

36 A discussion of the strong performance of DISK +
LightGlue → DISK+LightGlue had the best performance in terms of dense keypoint extraction and matching About LightGlue feature matcher: Compared to classic feature descriptors like SIFT, LightGlue has two main advantages. 1. LightGlue can get more global-aware features while classic SIFT feature captures very local context in the descriptor. Therefore the LightGlue’s descriptors have strong discriminativeness for distinguishing each other. 2. Classic descriptors that only use the information in one single image. LightGlue extracts descriptive information at the feature extraction stage by cross-image interaction. About DISK feature extractor: The DISK feature detector is designed to obtain many keypoints from images, allowing for dense reconstruction results. 1. DISK formulated local feature matching as a Reinforcement Learning problem, where the objective is to maximize the number of correct matches.

37 Summary of contribution • Found the four required conditions
for successful reconstruction from the COLMAP source code and reconstruction behavior. ◦ Sufficient 2D-2D inlier matches ◦ Sufficient triangulation angles ◦ Sufficient number of 3D landmarks ◦ Geometrically accurate, sufficient 2D-3D correspondences • Found the best performing feature extraction and matching method that satisfies the conditions above through the experiment. • Succeeded robust reconstruction from the actual cleaner robot images for floor map generation.

38 Experiments on the full dataset Comparison of reconstructed maps:
SIFT + Nearest Neighbor vs DISK + LightGlue. We can robustly reconstruct the point cloud for the scenes that the ordinary SIFT feature fails. DISK + LightGlue SIFT + NN-mutual

39 Future work 05

40 Future work • Try to realize the high-quality reconstruction
with fast speed. • Try to employ the reconstruction process only using low computational resources, such as CPUs. • Expected to create a floor map with low-cost sensor observations, such as web camera.

41 Conclusion 06

42 Conclusion • Successful reconstruction required four key conditions ◦
Sufficient 2D-2D inlier matches ◦ Sufficient triangulation angles ◦ Sufficient number of 3D landmarks ◦ Geometrically accurate, sufficient 2D-3D correspondences • DISK+LightGlue was identified as the most robust combination after testing 21 different feature and matcher types. • When applied to a real-world cleaner robot dataset, the DISK+LightGlue method could ensure the successful reconstruction.

Exploring ways to enhance robustnessof 3D recon...

Exploring ways to enhance robustnessof 3D reconstruction using COLMAP

More Decks by Spatial AI Network

Featured

Transcript