This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural net-work method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
|Number of pages||9|
|Publication status||Published - 2019|
|Event||15th Conference on Natural Language Processing, KONVENS 2019: Bridging the gap between NLP and human understanding - Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany|
Duration: 9 Oct 2019 → 11 Oct 2019
Conference number: 15
|Conference||15th Conference on Natural Language Processing, KONVENS 2019|
|Abbreviated title||KONVENS 2019|
|Period||9/10/19 → 11/10/19|
Enendu, S., Scholtes, J., Smeets, J., Hiemstra, D., & Theune, M. (2019). Predicting semantic labels of text regions in heterogeneous document images. 203-11. Paper presented at 15th Conference on Natural Language Processing, KONVENS 2019, Erlangen, Germany.