Given an RGB query image and a 3D Gaussian Splatting (3DGS) scene, GSVisLoc estimates the camera pose by matching 2D features with 3D Gaussian descriptors, followed by a 3DGS-based pose refinement step.
We present GSVisLoc, a visual localization method tailored to 3D Gaussian Splatting (3DGS) scene representations. Given a 3DGS model and a query image, we estimate the camera pose via coarse-to-fine 3D–2D matching between encoded 3D Gaussian regions and multi-scale 2D image features, followed by PnP+RANSAC and a 3DGS-based pose refinement. GSVisLoc requires no modification or retraining of the 3DGS model and discards reference images after training. It achieves state-of-the-art results among 3DGS-based methods and competitive accuracy on standard indoor and outdoor benchmarks, generalizing to novel scenes without additional training.
GSVisLoc is trained on multiple ScanNet++ scenes and evaluated on novel test scenes to assess cross-scene generalization. We report the median translation (cm) and rotation (°) errors per scene. Bold indicates the best result, and underline indicates the second best. “Avg. Med ↓” is the average of per-scene medians.
| Scene / Method | Single-scene | Cross-scene | |
|---|---|---|---|
| GSplatLoc | GSVisLoc (Ours) | GSVisLoc (Ours) | |
| ebc200e928 | 2.02 / 0.76 | 0.27 / 0.15 | 0.33 / 0.22 |
| 2a496183e1 | 1.29 / 0.31 | 0.34 / 0.12 | 0.22 / 0.11 |
| b26e64c4b0 | 0.79 / 0.20 | 0.27 / 0.09 | 0.32 / 0.10 |
| 9b74afd2d2 | 6.18 / 1.29 | 0.26 / 0.14 | 0.35 / 0.13 |
| bc400d86e1 | 17.57 / 2.09 | 0.52 / 0.16 | 0.62 / 0.36 |
| 1204e08f17 | 1.96 / 0.53 | 0.39 / 0.10 | 0.65 / 0.13 |
| f8f12e4e6b | 3.69 / 0.83 | 0.39 / 0.13 | 0.34 / 0.13 |
| 52599ae063 | 21.95 / 5.54 | 2.31 / 0.29 | 0.82 / 0.18 |
| 94ee15e8ba | 1.92 / 0.29 | 0.33 / 0.13 | 0.30 / 0.09 |
| 9f139a318d | 1.19 / 0.22 | 0.24 / 0.05 | 0.22 / 0.07 |
| 30f4a2b44d | 4.79 / 0.74 | 0.44 / 0.18 | 0.46 / 0.16 |
| 37ea1c52f0 | 3.59 / 0.50 | 0.49 / 0.10 | 0.60 / 0.10 |
| 480ddaadc0 | 1.51 / 0.37 | 0.28 / 0.10 | 0.39 / 0.12 |
| a24f64f7fb | 1.08 / 0.38 | 0.54 / 0.14 | 0.46 / 0.13 |
| ab11145646 | 8.64 / 0.88 | 0.38 / 0.13 | 0.95 / 0.21 |
| Avg. Med ↓ | 5.21 / 0.99 | 0.50 / 0.13 | 0.47 / 0.15 |
Each slide visualizes qualitative results for GSVisLoc on unseen ScanNet++ scenes. Two training images (used to construct the 3DGS model), a query image, and the view rendered from the estimated pose are shown. The query images are out-of-distribution (OOD) with respect to the training scenes, demonstrating GSVisLoc’s ability to generalize to novel environments and achieve accurate camera localization.
Train image
Train image
Test image
Rendered from est. pose
Train image
Train image
Test image
Rendered from est. pose
Train image
Train image
Test image
Rendered from est. pose
Train image
Train image
Test image
Rendered from est. pose
Train image
Train image
Test image
Rendered from est. pose
@inproceedings{khatib2025generalizable,
title={Generalizable Visual Localization for Gaussian Splatting Scene Representations},
author={Khatib, Fadi and Moran, Dror and Trostianetsky, Guy and Kasten, Yoni and Galun, Meirav and Basri, Ronen},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={178--189},
year={2025}
}