Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Yanpeng Cao, Xing Luo, Jiangxin Yang, Yanlong Cao*, Michael Ying Yang

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

17 Citations (Scopus)
162 Downloads (Pure)


Multispectral pedestrian detection has received much attention in recent years due to its superiority in detecting targets under adverse lighting/weather conditions. In this paper, we aim to generate highly discriminative multi-modal features by aggregating the human-related clues based on all available samples presented in multispectral images. To this end, we present a novel multispectral pedestrian detector performing locality guided cross-modal feature aggregation and pixel-level detection fusion. Given a number of single bounding boxes covering pedestrians in both modalities, we deploy two segmentation sub-branches to predict the existence of pedestrians on visible and thermal channels. By referring to the important locality information in the reference modality, we perform locality guided cross-modal feature aggregation to learn highly discriminative human-related features in the complementary modality by exploring the clues of all available pedestrians. Moreover, we utilize the obtained spatial locality maps to provide prediction confidence scores in visible and thermal channels and conduct pixel-wise adaptive fusion of detection results in complementary modalities. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art detectors on both KAIST and CVC-14 multispectral pedestrian detection datasets.

Original languageEnglish
Pages (from-to)1-11
Number of pages11
JournalInformation Fusion
Early online date13 Jul 2022
Publication statusPublished - Dec 2022


  • Deep neural networks
  • Feature aggregation
  • Multispectral fusion
  • Pedestrian detection
  • Pixel-wise guidance
  • 22/4 OA procedure


Dive into the research topics of 'Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection'. Together they form a unique fingerprint.

Cite this