Skip to main navigation Skip to search Skip to main content

Visual Grounding in 2D and 3D: A unified perspective and survey

  • Keyu Guo
  • , Yongle Huang
  • , Tinglei Jia
  • , Xiangyu Song*
  • , Shijie Sun
  • , Hongkai Wei
  • , Xian Feng Han
  • , Shuwen Huang
  • , Nicola Strisciuglio
  • , Shuyan Li
  • *Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

19 Downloads (Pure)

Abstract

Visual Grounding (VG), the task of localizing specific image or scene regions based on natural language descriptions, is crucial for bridging the semantic gap between vision and language in Artificial Intelligence. Despite substantial progress in both 2D and 3D domains, existing surveys often focus narrowly on one dimension, lacking a unified perspective. This survey provides the first comprehensive review offering such a unified viewpoint, systematically integrating and analyzing research across both 2D and 3D Visual Grounding. We provide a structured categorization of core methodologies, detailing the evolution of two-stage and one-stage paradigms and their representative techniques. Furthermore, we review emerging trends, including the integration of Large Language Models (LLM) for enhanced semantic reasoning and strategies for cross-dimensional knowledge transfer between 2D and 3D VG, as well as the nascent area of monocular 3DVG. The survey also encompasses an overview of benchmark datasets, a discussion of evaluation metrics, an analysis of current performance levels, and an articulation of open challenges. By offering this holistic and systematically organized review, we aim to provide researchers with a clear understanding of the current landscape, facilitate the identification of promising research avenues, and inspire further innovation in this dynamic and impactful cross-modal research area.

Original languageEnglish
Article number103625
JournalInformation Fusion
Volume126
Issue numberPart B
DOIs
Publication statusPublished - Feb 2026

Keywords

  • 2025 OA procedure
  • Scene understanding
  • Vision and language model
  • Visual grounding
  • Multimodal learning

Fingerprint

Dive into the research topics of 'Visual Grounding in 2D and 3D: A unified perspective and survey'. Together they form a unique fingerprint.

Cite this