GSVisLoc: Generalizable Visual Localization for Gaussian Splatting Scene Representations

* Equal contribution
1Weizmann Institute of Science, 2NVIDIA Research
ICCVW 2025 (CALIPOSE workshop)
GSVisLoc teaser image

Given an RGB query image and a 3D Gaussian Splatting (3DGS) scene, GSVisLoc estimates the camera pose by matching 2D features with 3D Gaussian descriptors, followed by a 3DGS-based pose refinement step.

Abstract

We present GSVisLoc, a visual localization method tailored to 3D Gaussian Splatting (3DGS) scene representations. Given a 3DGS model and a query image, we estimate the camera pose via coarse-to-fine 3D–2D matching between encoded 3D Gaussian regions and multi-scale 2D image features, followed by PnP+RANSAC and a 3DGS-based pose refinement. GSVisLoc requires no modification or retraining of the 3DGS model and discards reference images after training. It achieves state-of-the-art results among 3DGS-based methods and competitive accuracy on standard indoor and outdoor benchmarks, generalizing to novel scenes without additional training.

GSVisLoc model pipeline
  • 3DGS Encoder: Encodes local Gaussian neighborhoods using a KPConv-based architecture, producing compact region-level descriptors that capture geometry, appearance, and opacity information.
  • 2D Image Encoder: Extracts hierarchical image features at coarse and fine resolutions with a ConvFormer backbone, enabling both global context and pixel-level precision.
  • 3D–2D Alignment: Interleaved self- and cross-attention layers align 3D and 2D features, establishing coarse correspondences between scene regions and image patches.
  • Fine Matching: Local refinement modules predict dense matching heatmaps, improving correspondence accuracy to the sub-pixel level.
  • Pose Estimation and Refinement: Camera pose is recovered via PnP + RANSAC from the predicted matches, followed by a 3DGS-based pose refinement step to ensure metric consistency with the Gaussian scene.

Generalization on ScanNet++

GSVisLoc is trained on multiple ScanNet++ scenes and evaluated on novel test scenes to assess cross-scene generalization. We report the median translation (cm) and rotation (°) errors per scene. Bold indicates the best result, and underline indicates the second best. “Avg. Med ↓” is the average of per-scene medians.

Scene / Method Single-scene Cross-scene
GSplatLoc GSVisLoc (Ours) GSVisLoc (Ours)
ebc200e9282.02 / 0.760.27 / 0.150.33 / 0.22
2a496183e11.29 / 0.310.34 / 0.120.22 / 0.11
b26e64c4b00.79 / 0.200.27 / 0.090.32 / 0.10
9b74afd2d26.18 / 1.290.26 / 0.140.35 / 0.13
bc400d86e117.57 / 2.090.52 / 0.160.62 / 0.36
1204e08f171.96 / 0.530.39 / 0.100.65 / 0.13
f8f12e4e6b3.69 / 0.830.39 / 0.130.34 / 0.13
52599ae06321.95 / 5.542.31 / 0.290.82 / 0.18
94ee15e8ba1.92 / 0.290.33 / 0.130.30 / 0.09
9f139a318d1.19 / 0.220.24 / 0.050.22 / 0.07
30f4a2b44d4.79 / 0.740.44 / 0.180.46 / 0.16
37ea1c52f03.59 / 0.500.49 / 0.100.60 / 0.10
480ddaadc01.51 / 0.370.28 / 0.100.39 / 0.12
a24f64f7fb1.08 / 0.380.54 / 0.140.46 / 0.13
ab111456468.64 / 0.880.38 / 0.130.95 / 0.21
Avg. Med ↓ 5.21 / 0.99 0.50 / 0.13 0.47 / 0.15

Qualitative Results

Each slide visualizes qualitative results for GSVisLoc on unseen ScanNet++ scenes. Two training images (used to construct the 3DGS model), a query image, and the view rendered from the estimated pose are shown. The query images are out-of-distribution (OOD) with respect to the training scenes, demonstrating GSVisLoc’s ability to generalize to novel environments and achieve accurate camera localization.

BibTeX

@inproceedings{khatib2025generalizable,
  title={Generalizable Visual Localization for Gaussian Splatting Scene Representations},
  author={Khatib, Fadi and Moran, Dror and Trostianetsky, Guy and Kasten, Yoni and Galun, Meirav and Basri, Ronen},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={178--189},
  year={2025}
}