Multi-Modal Earth Observation and Deep Learning for Urban Scene Understanding

Research output: ThesisPhD Thesis - Research UT, graduation UT

3 Downloads (Pure)


This research explores the nuances of semantic segmentation in remote sensing data through deep learning, with a focus on multi-modal data integration, the impact of label noise, and the need for diverse datasets in Earth Observation (EO). It introduces a novel model named TransFusion, designed to integrate 2D images and 3D point clouds directly, avoiding the complexities common with traditional fusion methods used in the conext of semantic segmentation. This approach led to improvements in segmentation accuracy, demonstrated by higher mean Intersection over Union (mIoU) scores for the Vaihingen and Potsdam datasets. This indicates the model's capability to better interpret spatial and structural information from multi-modal data.
The study also investigates the effects of label noise—incorrect annotations in training data, a prevalent issue in remote sensing. Through experiments involving high-resolution aerial images with intentionally inaccurate labels, it was discovered that label noise influences model performance differently across various object classes, with the size of an object significantly affecting the model's ability to handle errors. The research highlights that models are somewhat resilient to random noise, although accuracy decreases even with a small proportion of incorrect labels.
Addressing the challenge of geographic bias in urban semantic segmentation datasets, primarily focused on Europe and North America, the research introduces the UAVPal dataset from Bhopal, India. This effort, along with the development of a new dense predictor head for semantic segmentation, aims to better represent the diverse urban landscapes globally. The new segmentation head, which efficiently leverages multi-scale features and notably reduces computational demands, showed improved mIoU scores across various classes and datasets.
Overall, the study contributes to the field of semantic segmentation for EO by improving data fusion methods, offering insights into the effects of label noise, and encouraging the inclusion of diverse geographic data for broader representation. These efforts are steps toward more accurate and efficient remote sensing applications.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • University of Twente
  • Vosselman, George, Supervisor
  • Oude Elberink, Sander, Co-Supervisor
Award date9 Apr 2024
Place of PublicationEnschede, The Netherlands
Print ISBNs978–90–365–6028–3
Electronic ISBNs978-90-365-6029-0
Publication statusPublished - 9 Apr 2024


  • Earth Observation
  • Deep Learning
  • Data Fusion
  • Point Cloud
  • Photogrammetry
  • Computer Vision
  • Remote Sensing
  • Scene understanding


Dive into the research topics of 'Multi-Modal Earth Observation and Deep Learning for Urban Scene Understanding'. Together they form a unique fingerprint.

Cite this