Multimodal fusion architectures for pedestrian detection

Dayan Guan, Jiangxin Yang, Yanlong Cao, Michael Ying Yang, Yanpeng Cao

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

Abstract

Pedestrian detection provides a crucial functionality in many human-centric applications, such as video surveillance, urban scene analysis, and autonomous driving. Recently, multimodal pedestrian detection has received extensive attention since the fusion of complementary information captured by visible and infrared sensors enables robust human target detection under daytime and nighttime scenes. In this chapter, we systematically evaluate the performance of different multimodal fusion architectures in order to identify the optimal solutions for pedestrian detection. We made two important observations: (1) it is useful to combine the most commonly used concatenation fusion scheme with a global scene-aware mechanism to learn both human-related features and correlation between visible and thermal feature maps; (2) the two-stream segmentation supervision without multimodal fusion provides the most effective scheme to infuse segmentation information as supervision for learning human-related features. Based on these studies, we present a unified multimodal fusion framework for joint training of target detection and segmentation supervision which achieves the state-of-the-art multimodal pedestrian detection performance on the public KAIST benchmark dataset.

Original languageEnglish
Title of host publicationMultimodal Scene Understanding
Subtitle of host publicationAlgorithms, Applications and Deep Learning
EditorsMichael Ying Yang, Bodo Rosenhahn, Vittorio Murino
PublisherElsevier
Chapter5
Pages101-133
Number of pages33
ISBN (Electronic)9780128173589
DOIs
Publication statusPublished - 2 Aug 2019

Keywords

  • Deep neural networks
  • Multimodal fusion
  • Pedestrian detection
  • Segmentation supervision

Fingerprint Dive into the research topics of 'Multimodal fusion architectures for pedestrian detection'. Together they form a unique fingerprint.

  • Cite this

    Guan, D., Yang, J., Cao, Y., Yang, M. Y., & Cao, Y. (2019). Multimodal fusion architectures for pedestrian detection. In M. Y. Yang, B. Rosenhahn, & V. Murino (Eds.), Multimodal Scene Understanding: Algorithms, Applications and Deep Learning (pp. 101-133). Elsevier. https://doi.org/10.1016/B978-0-12-817358-9.00011-1