Scene understanding is an important and fundamental research field in computer vision, which is quite useful for many applications in photogrammetry and remote sensing. It focuses on locating and classifying objects in images, understanding the relationships between them. The higher goal is to interpret what event happens in the scene, when it happens and why it happens, and what should we do based on the information. Dynamic scene understanding is to use information from different time to interpret scenes and answer the above related questions. For modern scene understanding technology, deep learning has shown great potential for such task. "Deep" in deep learning refers to the use of multiple layers in the neural networks. Deep neural networks are powerful as they are highly non-linear function that possess the ability to map from one domain to another quite different domain after proper training. It is the best solution for many fundamental research tasks regarding scene understanding. This ph.D. research also takes advantage of deep learning for dynamic scene understanding. Temporal information plays an important role for dynamic scene understanding. Compared with static scene understanding from images, information distilled from the time dimension provides values in many different ways. Images across consecutive frames have very high correlation, i.e., objects observed in one frame have very high chance to be observed and identified in nearby frames as well. Such redundancy in observation could potentially reduce the uncertainty for object recognition with deep learning based methods, resulting in more consistent inference. High correlation across frames could also improve the chance for recognizing objects correctly. If the camera or the object moves, the object could be observed in multiple different views with different poses and appearance. The information captured for object recognition would be more diverse and complementary, which could be aggregated to jointly inference the categories and the properties of objects. This ph.D. research involves several tasks related to the dynamic scene understanding in computer vision, including semantic segmentation for aerial platform images (chapter 2, 3), video object segmentation and video object detection for common objects in natural scenes (chapter 4, 5), and multi-object tracking and segmentation for cars and pedestrians in driving scenes(chapter 6). Chapter2 investigates how to establish the semantic segmentation benchmark for the UAV images, which includes data collection, data labeling, dataset construction, and performance evaluation with baseline deep neural networks and the proposed multi-scale dilation net. Conditional random field with feature space optimization is used to achieve consistent semantic segmentation prediction in videos. Chapter3 investigates how to better extract the scene context information for etter object recognition performance by proposing the novel bidirectional multiscale attention networks. It achieves better performance by inferring features and attention weights for feature fusing from both higher level and lower level branches. Chapter4 investigates how to simultaneously segment multiple objects across multiple frames by combining memory modules with instance segmentation networks. Our method learns to propagate the target object labels without auxiliary data, such as optical flow, which simplifies the model. Chapter5 investigates how to improve the performance of well-trained object detectors with a light weighted and efficient plug&play tracker for object detection in video. This chapter also investigates how the proposed model performs when lacking video training data. Chapter6 investigates how to improve the performance of detection, segmentation, and tracking by jointly considering top-down and bottom-up inference. The whole pipeline follows the multi-task design, i.e., a single feature extraction backbone with multiple heads for different sub-tasks. Overall, this manuscript has delved into several different computer vision tasks, which share fundamental research problems, including detection, segmentation, and tracking. Based on the research experiments and knowledge from literature review, several reflections regarding dynamic scene understanding have been discussed: The range of object context influence the quality for object recognition; The quality of video data affect the method choice for specific computer vision task; Detection and tracking are complementary for each other. For future work, unified dynamic scene understanding task could be a trend, and transformer plus self-supervised learning is one promising research direction. Real-time processing for dynamic scene understanding requires further researches in order to put the methods into usage for real-world applications.
|Qualification||Doctor of Philosophy|
|Award date||8 Sept 2021|
|Place of Publication||Enschede|
|Publication status||Published - 8 Sept 2021|