Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Johns Hopkins University

NeurIPS 2025 (Spotlight)

*Equal Contribution, Corresponding Author

Abstract

How to encode visual-spatial intelligence (VSI) into representative and informative features remains an open challenge. Instead of representing VSI through Visual Question Answering (VQA)-style solely, we introduce spatial intelligence grid (SIG): a structured, grid-based data schema that embeds geometrical spatial relationship among objects along with physical priors in human world, as a complementary representation. We further derive a set of SIG-informed evaluation metrics that rigorously quantify a model’s true VSI capabilities. In few-shot in-context learning experiments on state-of-the-art multimodal LLMs (e.g. GPT-4o, Gemini-2.5-Pro), SIG yields consistently larger, more stable, and more comprehensive improvements across all VSI metrics compared to VQA-style representations, demonstrating its potential as a novel data schema for learning VSI. Based on SIG, we create SIGBench, a benchmark containing 1.4K driving frames annotated with ground-truth SIG labels and human gaze attention, supporting both grid-based machine VSI tasks and human-like attention-driven VSI tasks in autonomous-driving scenarios.

Highlight

benchmark_overview.
Overview of Human-like SIG in AD Scenario. In the left, we use SIG to represent the spatial relation of traffic sign, traffic lights, vehicles and self (ego-vehicle) in the image. We apply homographic transformation to convert human gaze attention from image to SIG size in the right. Combining them, we get human-like SIG, which can then be extracted to human-like SRG and SRP in middle part. The order denotes the rank of an object of the same category in the image from left to right (e.g. black truck 1 is the left-most object among vehicles).

Evaluation Metrics

Based on SIG, we can extract a directed spatial relation graph (SRG) that describes the spatial relation (direction+distance in grid) of each object and spatial relation paragraph (SRP) that describes spatial relation of each object within a text manner. To quantitatively assess the a model’s VSI, we propose three novel evaluation metrics: Multi-level spatial matching (MLSM), spatial relation graph similarity (SRGS) and semantic relational distance (SRD). MLSM compares object positions directly within the SIG representation, capturing absolute localization accuracy. SRGS measures both node-wise and edge-wise correspondence between predicted and ground-truth (GT) SRG, emphasizing relation classification and structure. SRD computes a semantic relational distance between predicted and ground-truth prepositions in SRP, evaluating the fidelity of both directional and proximal relations.

evaluation_metrics.
Illustration Examples of MLSM and SRGS. At the start of both MLSM and SRGS, it will match the objects between the predicted and GT SIG using bipartite matching. For MLSM, we provide example for calculating TP, FP and FN for vehicles in the boxed area in upper part. For SRGS, we highlight the node and edge needed to insert and substitute and their total edit distance in lower part.

SIGBench Dataset

We introduce SIGBench, a benchmark for quantifying both grid-based and human-like VSI in MLLMs within AD scenario. SIGBench comprises 1,423 frames, each annotated with (i) SIG and human-like SIG, (ii) SRP and human-like SRP and (iii) a gaze attention map in image size. SIGBench contains two main task clusters: grid-based VSI tasks: spatial intelligence grid creation (SIGC) and spatial relation paragraph filling (SRPF) and human-like VSI tasks: human-like SIGC and SRPF, and gaze prediction.

annotation_pipeline.
(a) is the annotation pipeline of SIGBench and (b) illustrates the SIGC and SRPF tasks in SIGBench.

Experiment

We evaluate several top-tier MLLMs on SIGBench mainly from five modal families: 1) open-source models such as InternVL and Qwen-VL; 2) Proprietary models including OpenAI GPT, Google Gemini and Anthropic Claude. We conduct both SIG-based and MC(VQA)-based In-context Learning (ICL) on GPT-4o and Gemini-2.5-Pro.

Results of Zero-shot Inference on SIGBench for Grid-based VSI tasks

annotation_pipeline.
Quantitative comparison of different MLLMs on SIGBench for general VSI tasks. P, R, F1, and AssA means precision, recall, F1-score and Association Accuracy, respectively. S and WS means graph similarity and weighted graph similarity. Acc means accuracy. Dark blue and light blue indicates the best and the second best result among all models.

Results of SIG-based ICL using Random Sample Selection

annotation_pipeline.
Quantitative comparison of 3-shot ICL for general VSI tasks on SIGBench-tiny. Z-S means zero-shot, ICL-MC meaning ICL using multiple-choice VQA and ICL-SIG meaning ICL using SIG. light red indicates the results that is worse than zero-shot after applying ICL on GPT-4o and Gemini-2.5-Pro.

Visualization of Grid-based VSI tasks Results

annotation_pipeline.
Visualization of SIG-empowered VSI results. SRD-Dir and SRD-Prox denotes the Acc in SRD (Directional) and SRF (Proximal). (a) demonstrate the performance of human and different models in grid-based tasks on SIGBench. Even for leading MLLMs, there is a substantial gap compared to human performance in VSI. (b) and (c) denotes the ICL results of GPT-4o and Gemini-2.5-pro on SIGBench-tiny. ICL-SIG outperforms the zero-shot baseline in all VSI metrics and delivers more general improvements than ICL-MC.

Results of Zero-shot Inference on SIGBench for Human-Like VSI tasks

annotation_pipeline.
Quantitative comparison of different MLLMs on SIGBench dataset for human-like visual-spatial intelligence tasks. H means human-like and KL-D means KL-Divergence.

Acknowledgement

The author team would like to share the sincere thank to Wei Zhang from U.S. Department of Transportation (USDOT) for providing the valuable U.S. Federal Highway Administration driving dataset, based on which we construct our proposed benchmark. The author team also appreciate the valuable suggestions from Junyue Jiang, Linshen Liu and Yibo Zhao during the discussion about this project.

BibTeX citation

    @inproceedings{wu2025towards,
  title = {Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study},
  author = {Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, Hao Frank Yang},
  booktitle={The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2025}
}