Predicting semantic labels of text regions in heterogeneous document images

Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, Mariet Theune

Research output: Contribution to conferencePaper

11 Downloads (Pure)

Abstract

This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural net-work method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Original languageEnglish
Pages203-11
Number of pages9
Publication statusPublished - 2019
Event15th Conference on Natural Language Processing, KONVENS 2019: Bridging the gap between NLP and human understanding - Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Duration: 9 Oct 201911 Oct 2019
Conference number: 15

Conference

Conference15th Conference on Natural Language Processing, KONVENS 2019
Abbreviated titleKONVENS 2019
CountryGermany
CityErlangen
Period9/10/1911/10/19

    Fingerprint

Cite this

Enendu, S., Scholtes, J., Smeets, J., Hiemstra, D., & Theune, M. (2019). Predicting semantic labels of text regions in heterogeneous document images. 203-11. Paper presented at 15th Conference on Natural Language Processing, KONVENS 2019, Erlangen, Germany.