GeoProg3D: Compositional Visual Reasoning for
City-Scale 3D Language Fields


Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Masato Taki and Yutaka Matsuo
scaled image

Overview of the proposed compositional geographic reasoning task. This task enables natural language interaction with city-scale 3D scenes, supporting diverse geographic reasoning scenarios. Given a natural language query, our GeoProg3D decomposes the task into modular steps, executes structured visual programs using geographic APIs, and integrates results for interpretable and accurate reasoning.

Abstract

The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language.


Proposed Framework

teaser image

Framework overview. Given a user query, GeoProg3D generates a visual program via LLM in-context learning. The program operates GCLF by combining Geographical Vision APIs (GV-APIs) and answers the query.

teaser image

GCLF structure. (a) Coarse-to-fine tree structure to represent 3D scenes. Each node represents a pair of a 3D Gaussian and a language embedding. (b) Language alignment using CLIP features. (c) Geographical alignment using OpenStreetMap.


Visualization of Learned Features on GCLFs


Evaluation Dataset: GeoEval3D

teaser image

We propose a new dataset GeoEval3D which is collected by human annotaters. GeoEval3D contains 952 unique queries covering five tasks. Queries include more words than those in the previous evaluation datasets, indicating complexity of the proposed.


Results on the GeoEval3D Dataset

Grounding (GRD)

scaled image

Query: red lettering signboard within 100 meters of The View

scaled image

Query: The canopy shop southeast of New York Mart

Counting (CNT)

scaled image

Query: How many white automobiles are there around Elmer Holmes Bobst Library?

scaled image

Query: How many solar panels are there directly east of Vanderbilt Hall?

Height (MES-H)

scaled image

Query: How tall is the skyscraper that is closest to 45-45 Center Blvd?

Distance (MES-D)

scaled image

Query: How far is DO, Cookie Dough Confections from building with solar panels?




Quantitative Results

teaser image

This table shows the 3D semantic segmentation performance on the GRD task. Average IoU scores (%) are reported.

teaser image

This table shows the performance of spatial reasoning (SPR), comparison (CMP), counting (CNT), and measurement (MES) tasks.


Language Guided 3D Scene Editing

scaled image

scaled image

scaled image

BibTeX

@misc{yasuki2025geoprog3d,
    title={GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields},
    author={Shunsuke, Yasuki and Taiki, Miyanishi and Nakamasa, Inoue and Shuhei, Kurita and Koya, Sakamoto and Daichi, Azuma and Masato, Taki and Yutaka Matsuo},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year={2025},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.23352},
  }