Skip to main navigation Skip to search Skip to main content

Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

8 Downloads (Pure)

Abstract

Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.
Original languageEnglish
Title of host publication2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
Pages24917-24927
Number of pages11
ISBN (Electronic)979-8-3315-4364-8
ISBN (Print)979-8-3315-4365-5
DOIs
Publication statusPublished - 13 Aug 2025
EventIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025: 8th Multimodal Learning and Applications Workshop - Nashville, TN, USA, Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Workshop

WorkshopIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Abbreviated titleCVPR 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25

Keywords

  • 2025 OA procedure
  • Visualization
  • Noise
  • Natural languages
  • Image representation
  • Controllability
  • Robustness
  • Encoding
  • Pattern recognition
  • Noise measurement
  • Geometry

Fingerprint

Dive into the research topics of 'Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models'. Together they form a unique fingerprint.

Cite this