Exploiting attention for visual relationship detection

Tongxin Hu*, Wentong Liao, Michael Ying Yang, Bodo Rosenhahn

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

3 Citations (Scopus)
16 Downloads (Pure)


Visual relationship detection targets on predicting categories of predicates and object pairs, and also locating the object pairs. Recognizing the relationships between individual objects is important for describing visual scenes in static images. In this paper, we propose a novel end-to-end framework on the visual relationship detection task. First, we design a spatial attention model for specializing predicate features. Compared to a normal ROI-pooling layer, this structure significantly improves Predicate Classification performance. Second, for extracting relative spatial configuration, we propose to map simple geometric representations to a high dimension, which boosts relationship detection accuracy. Third, we implement a feature embedding model with a bi-directional RNN which considers subject, predicate and object as a time sequence. We evaluate our method on three tasks. The experiments demonstrate that our method achieves competitive results compared to state-of-the-art methods.

Original languageEnglish
Title of host publicationPattern Recognition
Subtitle of host publication41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings
EditorsGernot A. Fink, Simone Frintrop, Xiaoyi Jiang
Place of PublicationCham
Number of pages14
ISBN (Electronic)978-3-030-33676-9
ISBN (Print)978-3-030-33675-2
Publication statusPublished - 25 Oct 2019
Event41st DAGM German Conference on Pattern Recognition, DAGM GCPR 2019 - Dortmund, Germany
Duration: 10 Sep 201913 Sep 2019

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference41st DAGM German Conference on Pattern Recognition, DAGM GCPR 2019


Dive into the research topics of 'Exploiting attention for visual relationship detection'. Together they form a unique fingerprint.

Cite this