Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

29/10/2025

Daily Paper Cast

0:00

23:09

🤗 Upvotes: 147 | cs.CV

Authors:
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao

Title:
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Arxiv:
http://arxiv.org/abs/2510.23607v1

Abstract:
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

D'autres épisodes de "Daily Paper Cast"

Plus d'épisodes

Découvrez le meilleur des podcasts sur l'application GetPodcast.

Abonnez-vous à tous vos podcasts préférés, écoutez les épisodes sans connexion internet et recevez des recommandations de podcasts passionnants.

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Daily Paper Cast

D'autres épisodes de "Daily Paper Cast"

The End of Manual Decoding: Towards Truly End-to-End Language Models

Kimi Linear: An Expressive, Efficient Attention Architecture

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Language Models are Injective and Hence Invertible