Abstract
Pedestrian detection provides a crucial functionality in many human-centric applications, such as video surveillance, urban scene analysis, and autonomous driving. Recently, multimodal pedestrian detection has received extensive attention since the fusion of complementary information captured by visible and infrared sensors enables robust human target detection under daytime and nighttime scenes. In this chapter, we systematically evaluate the performance of different multimodal fusion architectures in order to identify the optimal solutions for pedestrian detection. We made two important observations: (1) it is useful to combine the most commonly used concatenation fusion scheme with a global scene-aware mechanism to learn both human-related features and correlation between visible and thermal feature maps; (2) the two-stream segmentation supervision without multimodal fusion provides the most effective scheme to infuse segmentation information as supervision for learning human-related features. Based on these studies, we present a unified multimodal fusion framework for joint training of target detection and segmentation supervision which achieves the state-of-the-art multimodal pedestrian detection performance on the public KAIST benchmark dataset.
Original language | English |
---|---|
Title of host publication | Multimodal Scene Understanding |
Subtitle of host publication | Algorithms, Applications and Deep Learning |
Editors | Michael Ying Yang, Bodo Rosenhahn, Vittorio Murino |
Publisher | Elsevier |
Chapter | 5 |
Pages | 101-133 |
Number of pages | 33 |
ISBN (Print) | 978-0-12-817358-9 |
DOIs | |
Publication status | Published - 2 Aug 2019 |
Keywords
- 2021 OA procedure
- Multimodal fusion
- Pedestrian detection
- Segmentation supervision
- Deep neural networks