Predicting semantic labels of text regions in heterogeneous document images

Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, Mariet Theune

Research output: Contribution to conferencePaperpeer-review

65 Downloads (Pure)


This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural net-work method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Original languageEnglish
Number of pages9
Publication statusPublished - 2019
Event15th Conference on Natural Language Processing, KONVENS 2019: Bridging the gap between NLP and human understanding - Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Duration: 9 Oct 201911 Oct 2019
Conference number: 15


Conference15th Conference on Natural Language Processing, KONVENS 2019
Abbreviated titleKONVENS 2019


Dive into the research topics of 'Predicting semantic labels of text regions in heterogeneous document images'. Together they form a unique fingerprint.

Cite this