Visual Representation Alignment for Multimodal Large Language Models

11/09/2025

Daily Paper Cast

0:00

26:13

🤗 Upvotes: 54 | cs.CV

Authors:
Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

Title:
Visual Representation Alignment for Multimodal Large Language Models

Arxiv:
http://arxiv.org/abs/2509.07979v1

Abstract:
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

D'autres épisodes de "Daily Paper Cast"

Plus d'épisodes

Découvrez le meilleur des podcasts sur l'application GetPodcast.

Abonnez-vous à tous vos podcasts préférés, écoutez les épisodes sans connexion internet et recevez des recommandations de podcasts passionnants.

Visual Representation Alignment for Multimodal Large Language Models

Daily Paper Cast

D'autres épisodes de "Daily Paper Cast"

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Visual Representation Alignment for Multimodal Large Language Models

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Reconstruction Alignment Improves Unified Multimodal Models

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

Reverse-Engineered Reasoning for Open-Ended Generation

Does DINOv3 Set a New Medical Vision Standard?

Symbolic Graphics Programming with Large Language Models

Set Block Decoding is a Language Model Inference Accelerator

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth