Image captioning through image transformer

Sen He, Wentong Liao*, Hamed R. Tavakoli, Michael Ying Yang, Bodo Rosenhahn, Nicolas Pugeault

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

16 Citations (Scopus)
156 Downloads (Pure)


Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect of captioning is the notion of attention: how to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous works have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer’s internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widens the original transformer layer’s inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. The code is available at

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2020
Subtitle of host publication15th Asian Conference on Computer Vision, 2020, Revised Selected Papers
EditorsHiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, Jianbo Shi
Number of pages17
ISBN (Print)9783030695378
Publication statusPublished - 25 Feb 2021
Event15th Asian Conference on Computer Vision, ACCV 2020 - Virtual, Online
Duration: 30 Nov 20204 Dec 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12625 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference15th Asian Conference on Computer Vision, ACCV 2020
CityVirtual, Online


Dive into the research topics of 'Image captioning through image transformer'. Together they form a unique fingerprint.

Cite this