Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploring ways to enhance robustnessof 3D recon...

Avatar for Spatial AI Network Spatial AI Network
November 24, 2025
2

Exploring ways to enhance robustnessof 3D reconstruction using COLMAP

- VGGTなどのニューラルネットワークベースの3D Reconstructionはロバストで強力だが、VRAM使用量が大きく動作環境に制約があるため、COLMAPのロバスト化を図った
- 今回、HLoCを使ってロバストに3D Reconstructionできるfeature extractor、feature matcherの組み合わせを実験的に調べた
- 店舗から協力のもと店舗で撮影したRGBカメラデータを使って高精度に3D Reconstructionできることを確認

Avatar for Spatial AI Network

Spatial AI Network

November 24, 2025
Tweet

More Decks by Spatial AI Network

Transcript

  1. Spatial AI Network勉強会 - 2025/11/4 Exploring ways to enhance robustness

    of 3D reconstruction using COLMAP 発表者: 李東岳(名古屋大学 出口研究室) メンター: 石田岳志(株式会社サイバーエージェント AI Lab )
  2. 2 Exploring ways to enhance robustness of 3D reconstruction using

    COLMAP Background 01 Related works 02 Analyze the necessary conditions for successful reconstruction 03 Identify a suitable feature matching method 04 Future work 05 Conclusion 06
  3. 3 Abstract Motivation: • We want to realize the robust

    reconstruction from images for generating 2D floor maps. Baseline method: • We use the COLMAP for execution on a low computational resource environment. What we did: • Found the conditions required for successful reconstruction • Found the best performing feature extraction and matching methods that can satisfy the conditions for successful reconstruction • Improved the robustness of COLMAP reconstruction What we achieved: • Robustly reconstruct point cloud from the actual cleaner robot images
  4. 5 Automatic image collection by a cleaner robot • We

    are experimenting automatic image collection in an actual grocery store called EZOHUB TOKYO. • Mount a camera on a cleaner robot and automatically collect grocery store images. • Much lower cost than installing a specific robot for data collection because it utilizes a cleaner robot already operating in a grocery store. A smartphone camera is mounted on a cleaner robot.
  5. 6 We want to create a floor map only from

    camera images: • Cameras are generally cheaper than LiDAR. • We can navigate customers in a store. • We can add semantic information to the map by recognizing products and objects. • We investigated 3D reconstruction for floor map generation. What we did: Robust 3D reconstruction from cleaner robot images The example image of the grocery store.
  6. 7 What we achieved Successfully reconstructed the point cloud from

    the cleaner robot images by improving the feature matching performance Reconstruction with DISK + LightGlue The best performing feature matching method we found Reconstruction with classic SIFT image features
  7. 9 Neural network based methods The pipeline of VGGT. There

    exists neural network based methods: VGGT, dust3r, mast3r. • Advantages: They can robustly create dense reconstruction results • Disadvantages: Requires heavy computational resources. • Even with 100 images, a general GPU such as A100 cannot handle the large amount of memory. [1] Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, et al, VGGT: Visual Geometry Grounded Transformer, CVPR, 2025. The reconstruction example of VGGT.
  8. 10 COLMAP [2] The example of the feature extraction and

    matching in COLMAP. COLMAP is a de facto standard software for 3D reconstruction. [2] Schonberger, Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. Flaws of COLMAP • Lacks the feature extraction and matching performance ◦ COLMAP uses the classic SIFT feature • We needed to improve the feature matching performance for robust grocery store reconstruction
  9. 11 We use HLoC [4] for efficient experiment [4] Paul-Edouard

    Sarlin, Cesar Cadena, Roland Siegwart, Marcin Dymczyk, From Coarse to Fine: Robust Hierarchical Localization at Large Scale, CVPR, 2019. We use HLoC as a feature extraction & matching library for improving feature extraction and matching robustness. • HLoC contains several feature extraction and matching methods. • We can easily switch them and find the best performing one. → HLoC simplifies debugging and performance analysis
  10. 12 How we improved COLMAP [2] COLMAP basically consists of

    two parts: correspondence search, and incremental reconstruction. What we did to improve the performance: • Investigated the key conditions for successful reconstruction. • Investigated the best performing feature extraction & matching methods that satisfy the key conditions. [2] Schonberger, Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. Improved robustness of feature matching with HLoC Analyzed the reconstruction pipeline and found the key conditions for successful reconstruction
  11. 14 Investigation of required conditions for successful reconstruction [2] Schonberger,

    Johannes Lutz and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. We investigated the key conditions for successful reconstruction
  12. 15 Details of COLMAP reconstruction pipeline The illustration of incremental

    reconstruction The incremental reconstruction process of COLMAP has four main steps: Initialization Create initial landmarks from keypoint matches of two cameras. Image registration Associate new camera keypoints and existing landmarks. Estimate the camera pose from the correspondence. Triangulation Add new landmarks from new keypoint matches. Bundle adjustment Fine-tune camera poses and 3D points at once to minimize reprojection error.
  13. 16 Investigation of the conditions for successful reconstruction We investigated

    key conditions for successful reconstruction Method 1. Conduct two extremely challenging datasets. 2. Analyze the reconstruction results of these two challenging datasets. We found four key conditions for successful reconstruction 1. Sufficient 2D-2D inlier matches 2. Sufficient triangulation angles 3. Sufficient number of 3D landmarks 4. Geometrically accurate, sufficient 2D-3D correspondences
  14. 17 Creation of a small baseline, hard illumination dataset To

    find the necessary conditions for successful reconstruction, We created a dataset with small baseline and difficult illumination conditions.
  15. 18 Matching visualization Keypoints are partially matched even with extreme

    conditions. But viewpoint angles are not enough because of small baselines.
  16. 19 Found key conditions for initial image selection Reconstruction has

    failed because it could not find the initial image pair. Through the reconstruction experiment of dataset 1, we found two conditions for successful initial image selection: • Sufficient 2D-2D inlier matches • Sufficient triangulation angles If COLMAP couldn’t find the good initial image pair, overall reconstruction will fail.
  17. 20 Creation of a very large baseline dataset We created

    a dataset with very large baselines with few scene overlaps
  18. 22 Found key conditions for image registration Reconstruction has failed

    because of image registration failure. We found two conditions for successful reconstruction • Sufficient number of 3D landmarks • Geometrically accurate, sufficient 2D-3D correspondences
  19. 23 Summary of required conditions for successful reconstruction • Sufficient

    2D-2D inlier matches • Sufficient triangulation angles • Sufficient number of 3D landmarks • Geometrically accurate, sufficient 2D-3D correspondences To satisfy these conditions above, we need • Input images with large overlaps with sufficiently large viewpoint angles → Almost uncontrollable, because we cannot control the cleaner robot • Geometrically accurate many keypoint matches → Controllable. We can replace the feature extraction and matching algorithm. → We tried to identify feature extraction and matching method that can generate accurate, large number of matches.
  20. 25 Improve robustness of feature matching [2] Schonberger, Johannes Lutz

    and Frahm, Jan-Michael, Structure-from-Motion Revisited, CVPR, 2016. We improved feature matching robustness with HLoC
  21. 26 How we improved the robustness • Investigated the most

    robust feature extraction and matching method • Ran experiments on various grocery store dataset to find the best method ◦ Handheld camera dataset ◦ Actual cleaner robot dataset We replaced the feature extraction and matching method in the experiment
  22. 27 Dataset 1: Handheld dataset (10 sequences) • Images captured

    in EZOHUB TOKYO • Handheld captured RGB images from RealSense D455i • Created 10 small image sets ◦ Each contains 100 images • Contains both easy and difficult cases ◦ Difficult cases have little texture Easy Difficult Viewpoint variation Less texture
  23. 28 Dataset 2: Actual cleaner robot dataset (20 sequences) •

    RGB images captured by D455i mounted on a cleaner robot (raw stream is 30FPS) • We subsampled the video into 1 FPS images • For detailed investigation, we divided one cleaning session into 20 subsequences
  24. 29 Searching the best option among various features + matchers

    We applied feature extractors and matchers contained in HLoC • Used available 21 combinations of extractors and matchers • Ignored incompatible extractor & matcher combinations • Ignored superglue because it’s computationally demanding Explanation of NN-family (nearest neighbor search) • NN-mutual: mutual nearest neighbor matching • NN-ratio: nearest neighbor matching with ratio filtering
  25. 30 The metrics we used for evaluation Registration rate →

    Good metric for success / failure of reconstruction. • Number of correctly registered images to the number of input images. The number of reconstructed 3D points → Important for robust image registration • The more 3D landmarks the point cloud has, the more robust the pose estimation will be. Track length → Indicates strength of image descriptor and matcher • A higher mean track length indicates that the same 3D points are tracked across more camera views, which generally suggests a more robust reconstruction with better triangulation. Reprojection error → Geometric accuracy of the reconstruction result. • It is a geometric error corresponding to the image distance between a projected point and a measured one. The smaller reprojection error denotes the more robust reconstruction.
  26. 31 Experimental results for the Handheld Dataset (Dataset 1) Comparison

    of the registration rate Comparison of the number of reconstructed 3D points DISK+LightGlue establishes the densest point cloud and 100% registration rate. → Effective for camera registration and robust reconstruction
  27. 32 Experimental results for the Handheld Dataset (Dataset 1) Comparison

    of the track length Comparison of the reprojection error DISK+LightGlue achieves the longest track length. → Robustly tracks a 2D keypoint. Improves map consistency.
  28. 33 Experimental results for the Cleaner Robot Dataset (Dataset 2)

    DISK+LightGlue establishes the densest point cloud. → Effective for camera registration and robust reconstruction Comparison of the registration rate Comparison of the number of reconstructed 3D points
  29. 34 Experimental results for the Cleaner Robot Dataset (Dataset 2)

    We want to achieve robust successful reconstruction than accurate reconstruction. DISK+LightGlue has large reprojection error but the registration rate and the number of reconstructed points are more important. Comparison of the track length Comparison of the reprojection error
  30. 35 Visualization results SIFT + NN-mutual DISK + LightGlue SuperPoint_InLoc

    + LightGlue Visualizations on small image set 02 are presented. The size of 3D point is kept the same for fair comparison. image set 02
  31. 36 A discussion of the strong performance of DISK +

    LightGlue → DISK+LightGlue had the best performance in terms of dense keypoint extraction and matching About LightGlue feature matcher: Compared to classic feature descriptors like SIFT, LightGlue has two main advantages. 1. LightGlue can get more global-aware features while classic SIFT feature captures very local context in the descriptor. Therefore the LightGlue’s descriptors have strong discriminativeness for distinguishing each other. 2. Classic descriptors that only use the information in one single image. LightGlue extracts descriptive information at the feature extraction stage by cross-image interaction. About DISK feature extractor: The DISK feature detector is designed to obtain many keypoints from images, allowing for dense reconstruction results. 1. DISK formulated local feature matching as a Reinforcement Learning problem, where the objective is to maximize the number of correct matches.
  32. 37 Summary of contribution • Found the four required conditions

    for successful reconstruction from the COLMAP source code and reconstruction behavior. ◦ Sufficient 2D-2D inlier matches ◦ Sufficient triangulation angles ◦ Sufficient number of 3D landmarks ◦ Geometrically accurate, sufficient 2D-3D correspondences • Found the best performing feature extraction and matching method that satisfies the conditions above through the experiment. • Succeeded robust reconstruction from the actual cleaner robot images for floor map generation.
  33. 38 Experiments on the full dataset Comparison of reconstructed maps:

    SIFT + Nearest Neighbor vs DISK + LightGlue. We can robustly reconstruct the point cloud for the scenes that the ordinary SIFT feature fails. DISK + LightGlue SIFT + NN-mutual
  34. 40 Future work • Try to realize the high-quality reconstruction

    with fast speed. • Try to employ the reconstruction process only using low computational resources, such as CPUs. • Expected to create a floor map with low-cost sensor observations, such as web camera.
  35. 42 Conclusion • Successful reconstruction required four key conditions ◦

    Sufficient 2D-2D inlier matches ◦ Sufficient triangulation angles ◦ Sufficient number of 3D landmarks ◦ Geometrically accurate, sufficient 2D-3D correspondences • DISK+LightGlue was identified as the most robust combination after testing 21 different feature and matcher types. • When applied to a real-world cleaner robot dataset, the DISK+LightGlue method could ensure the successful reconstruction.