GaussianVLM

Scene-centric 3D Vision-Language Models using
Language-aligned Gaussian Splats
for Embodied Reasoning and Beyond

Anna-Maria Halacheva¹, Jan-Nico Zaech¹, Xi Wang^{1, 2, 3}, Danda Pani Paudel ¹, Luc Van Gool¹

¹INSAIT, Sofia University "St. Kliment Ohridski" ²ETH Zurich ³TU Munich

Preprint Data (Coming Soon) Code (Coming soon)

TL;DR

We present GaussianVLM, the first 3D VLM operating on Gaussian splats. Each Gaussian in the scene is enriched with language features, forming a dense, scene-centric representation. A novel dual sparsifier reduces ~40k language-augmented Gaussians to just 132 tokens, retaining task-relevant and location-relevant information. This enables open-vocabulary, detector-free reasoning and yields state-of-the-art performance on both scene- and object-centric embodied benchmarks.

Abstract

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM (LL3DA) five folds, in out-of-the-domain settings.

🔍 Project Highlights

🗣️ Built on Visuo-Linguistic 3D Maps

Constructs maps of 3D Gaussian splats enriched with language features for vision-language spatial reasoning.

🔍 Object Detector-Free

GaussianVLM does NOT rely on object detectors, ensuring scene-centric representations and open vocabulary support.

💡 Novel Dual Sparsification

Sparsifiers reduce ~40k tokens to 132, making maps interpretable for LLMs via two sparsifiers - task-guided, and location-aware.

🌎 OOD Generalization

Outperforms prior methods in real-world RGB-derived 3D datasets due to Gaussian splat utilization.

🚀 State-of-the-Art Performance

Outperforms leading baselines across scene- and object-centric tasks, including SQA3D and 3D-LLM embodied benchmarks.

GaussianVLM Architecture

The GaussianVLM architecture takes as input a user-defined task prompt—consisting of a query and an optional spatial location—and a 3D scene represented using Gaussians. A 3D vision module, the SceneSplat Transformer, first predicts per-Gaussian language features across the scene.

These dense language features are then processed by a dual sparsifier. This sparsifier includes two parallel components: (1) a location-guided pathway that selects Gaussians within a spatial Region of Interest (ROI) to produce ROI tokens; and (2) a task-guided pathway that uses cross-attention with task tokens to extract 128 task-selected tokens based on the decoder’s hidden states and dense scene features.

The resulting sparse representation—composed of both ROI and task-selected tokens—is then combined with the task tokens and passed to a multimodal decoder, enabling precise and grounded reasoning over the 3D scene for downstream language-vision tasks.

Quantitative Results

Our evaluation comprises scene-centric [situated QA & embodied tasks] and object-centric tasks.
To assess real-world robustness, we also test generalization to out-of-domain (OOD) data on scenes reconstructed from RGB images, a more realistic input setting compared to traditional point cloud capture methods.

Scene-centric: require comprehensive understanding of entire 3D environments—including spatial layout, agent context, and multi-turn interaction—covering benchmarks like embodied dialogue, planning, and situated question answering. These tasks demand holistic reasoning beyond individual objects.

Object-centric: emphasize detailed reasoning about specific objects through captioning and question answering using localized queries and object annotations.

While our model is primarily designed to excel at scene-centric understanding without relying on explicit object detectors, it remarkably retains strong performance on object-centric benchmarks, matching or surpassing baselines that use dedicated object detection modules.

OOD generalization: We evaluate on ScanNet++ scenes reconstructed from RGB images using an object counting task (in-house dataset). Given a question targeting an object category (e.g., “How many chairs are in the scene?”), the model predicts the correct count. QA pairs are automatically generated from ScanNet++ instance segmentation annotations.

**Situated QA, SQA3D Benchmark**
Model	EM1	C	B-4	M	R
GPT3	41.0	-	-	-	-
ClipBERT	43.3	-	-	-	-
SQA3D	46.6	-	-	-	-
3D-VisTA	48.5	-	-	-	-
PQ3D	47.1	-	-	-	-
LEO*	47.0	124.7	9.4	25.5	48.4
Ours	49.4	129.6	17.1	26.4	50.2

**3D-LLM Embodied Tasks**
Model	Embodied Dialogue					Embodied Planning					Scene Captioning
Model	Sim	C	B-4	M	R	Sim	C	B-4	M	R	Sim	C	B-4	M	R
OPT-1.3B	-	0.31	0.23	5.62	4.83	-	0.16	0.13	0.24	3.56	-	0.0	0.84	8.40	11.7
OPT-2.7B	-	0.38	0.39	7.38	6.28	-	0.10	0.26	3.59	4.35	-	0.11	0.00	6.60	12.32
OPT-6.7B	-	0.25	0.43	6.88	6.16	-	0.00	0.28	3.65	3.94	-	0.06	1.13	8.99	16.96
LLAMA-7B	-	0.27	0.50	7.81	6.68	-	0.04	0.29	3.53	4.71	-	0.20	0.92	7.00	12.31
LL3DA*	48.2	145.9	22.2	40.9	36.7	50.2	65.1	7.1	20.8	32.2	66.4	0.2	3.0	19.4	18.4
Ours	72.3	270.1	31.5	55.7	48.6	59.0	220.4	20.3	44.5	48.0	65.8	0.8	6.4	23.5	21.1

**Evaluation on object-centric LL3DA benchmarks**
	ScanRefer			ScanQA			Nr3D
Model	Sim	M	R	EM1	M	R	Sim	M	R
Scan2Cap	-	21.4	43.5	-	-	-	-	-	-
VoteNet+MCAN	-	-	-	17.3	11.4	29.8	-	-	-
ScanQA	-	-	-	-	13.14	33.3	-	-	-
3D-LLM	-	13.1	33.2	19.3	13.8	34.0	-	-	-
3D-VLP	-	-	-	-	13.5	34.5	-	-	-
Scene-LLM	-	21.8	45.6	-	15.8	-	-	-	-
LL3DA*	55.9	51.6	54.8	14.3	22.8	34.7	48.1	5.8	9.9
Ours	59.1	52.4	57.4	14.4	22.9	34.8	48.2	20.8	19.2