TY - JOUR
T1 - AST
T2 - Adaptive Self-supervised Transformer for optical remote sensing representation
AU - He, Qibin
AU - Sun, Xian
AU - Yan, Zhiyuan
AU - Wang, Bing
AU - Zhu, Zicong
AU - Diao, Wenhui
AU - Yang, Michael Ying
N1 - Publisher Copyright:
© 2023 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS)
PY - 2023/6
Y1 - 2023/6
N2 - Due to the variation in spatial resolution and the diversity of object scales, the interpretation of optical remote sensing images is extremely challenging. Deep learning has become the mainstream solution to interpret such complex scenes. However, the explosion of deep learning model architectures has resulted in the need for hundreds of millions of remote sensing images for which labels are very costly or often unavailable publicly. This paper provides an in-depth analysis of the main reasons for this data thirst, i.e., (i) limited representational power for model learning, and (ii) underutilization of unlabeled remote sensing data. To overcome the above difficulties, we present a scalable and adaptive self-supervised Transformer (AST) for optical remote sensing image interpretation. By performing masked image modeling in pre-training, the proposed AST releases the rich supervision signals in massive unlabeled remote sensing data and learns useful multi-scale semantics. Specifically, a cross-scale Transformer architecture is designed to collaboratively learn global dependencies and local details by introducing a pyramid structure, to facilitate multi-granular feature interactions and generate scale-invariant representations. Furthermore, a masking token strategy relying on correlation mapping is proposed to achieve adaptive masking of partial patches without affecting key structures, which enhances the understanding of visually important regions. Extensive experiments on various optical remote sensing interpretation tasks show that AST has good generalization capability and competitiveness.
AB - Due to the variation in spatial resolution and the diversity of object scales, the interpretation of optical remote sensing images is extremely challenging. Deep learning has become the mainstream solution to interpret such complex scenes. However, the explosion of deep learning model architectures has resulted in the need for hundreds of millions of remote sensing images for which labels are very costly or often unavailable publicly. This paper provides an in-depth analysis of the main reasons for this data thirst, i.e., (i) limited representational power for model learning, and (ii) underutilization of unlabeled remote sensing data. To overcome the above difficulties, we present a scalable and adaptive self-supervised Transformer (AST) for optical remote sensing image interpretation. By performing masked image modeling in pre-training, the proposed AST releases the rich supervision signals in massive unlabeled remote sensing data and learns useful multi-scale semantics. Specifically, a cross-scale Transformer architecture is designed to collaboratively learn global dependencies and local details by introducing a pyramid structure, to facilitate multi-granular feature interactions and generate scale-invariant representations. Furthermore, a masking token strategy relying on correlation mapping is proposed to achieve adaptive masking of partial patches without affecting key structures, which enhances the understanding of visually important regions. Extensive experiments on various optical remote sensing interpretation tasks show that AST has good generalization capability and competitiveness.
KW - 2024 OA procedure
KW - Interpretation
KW - Masked image modeling
KW - Optical remote sensing
KW - Representation learning
KW - Cross-scale transformer
KW - ITC-ISI-JOURNAL-ARTICLE
U2 - 10.1016/j.isprsjprs.2023.04.003
DO - 10.1016/j.isprsjprs.2023.04.003
M3 - Article
AN - SCOPUS:85156272002
SN - 0924-2716
VL - 200
SP - 41
EP - 54
JO - ISPRS journal of photogrammetry and remote sensing
JF - ISPRS journal of photogrammetry and remote sensing
ER -