Skip to main navigation Skip to search Skip to main content

M2H: Multi-task learning with efficient window-based cross-task attention for monocular spatial perception

Research output: Working paperPreprintAcademic

8 Downloads (Pure)

Abstract

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing
computational overhead. In this paper, we introduce MultiMono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while reserving taskspecific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the backbone for monocular spatial perception systems, a framework for 3D scene graph construction in dynamic environments.
Comprehensive evaluations demonstrate that M2H outperforms state-of-the-art (SOTA) multi-task models on NYUDv2, exceeds single-task depth and semantic baselines on Hypersim, and achieves superior performance on Cityscapes datasets, all while maintaining computational efficiency on laptop hardware. Beyond curated benchmarks, we validate M2H on real-world data, demonstrating its practicality in spatial perception tasks. We provide our implementation and pretrained models at https://github.com/UAV-Centre-ITC/M2H.git
Original languageEnglish
PublisherArXiv.org
Number of pages8
DOIs
Publication statusPublished - 20 Oct 2025

Fingerprint

Dive into the research topics of 'M2H: Multi-task learning with efficient window-based cross-task attention for monocular spatial perception'. Together they form a unique fingerprint.

Cite this