Abstract
Visual Grounding (VG), the task of localizing specific image or scene regions based on natural language descriptions, is crucial for bridging the semantic gap between vision and language in Artificial Intelligence. Despite substantial progress in both 2D and 3D domains, existing surveys often focus narrowly on one dimension, lacking a unified perspective. This survey provides the first comprehensive review offering such a unified viewpoint, systematically integrating and analyzing research across both 2D and 3D Visual Grounding. We provide a structured categorization of core methodologies, detailing the evolution of two-stage and one-stage paradigms and their representative techniques. Furthermore, we review emerging trends, including the integration of Large Language Models (LLM) for enhanced semantic reasoning and strategies for cross-dimensional knowledge transfer between 2D and 3D VG, as well as the nascent area of monocular 3DVG. The survey also encompasses an overview of benchmark datasets, a discussion of evaluation metrics, an analysis of current performance levels, and an articulation of open challenges. By offering this holistic and systematically organized review, we aim to provide researchers with a clear understanding of the current landscape, facilitate the identification of promising research avenues, and inspire further innovation in this dynamic and impactful cross-modal research area.
| Original language | English |
|---|---|
| Article number | 103625 |
| Journal | Information Fusion |
| Volume | 126 |
| Issue number | Part B |
| DOIs | |
| Publication status | Published - Feb 2026 |
Keywords
- 2025 OA procedure
- Scene understanding
- Vision and language model
- Visual grounding
- Multimodal learning
Fingerprint
Dive into the research topics of 'Visual Grounding in 2D and 3D: A unified perspective and survey'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver