Skip to main navigation Skip to search Skip to main content

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Research output: Working paperPreprintAcademic

7 Downloads (Pure)

Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
Original languageEnglish
PublisherArXiv.org
Number of pages7
DOIs
Publication statusPublished - 11 Aug 2025

Fingerprint

Dive into the research topics of 'MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization'. Together they form a unique fingerprint.

Cite this