Towards a Digital Infrastructure for Illustrated Handwritten Archives

Andreas Weber, Mahya Ameryan, Katherine Wolstencroft, Lise Stork, Maarten Heerlien, Lambert Schomaker

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820–1850 (“Natuurkundige Commissie voor Nederlands-Indië”) tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.

Keywords: Deep learning · Digital heritage · Natural history
Biodiversity heritage
LanguageEnglish
Title of host publicationDigital Cultural Heritage
EditorsMarinos Ioannides
PublisherSpringer International
Pages155-166
Number of pages12
ISBN (Electronic)978-3-319-75826-8
ISBN (Print)978-3-319-75825-1
DOIs
StatePublished - Mar 2018

Publication series

NameLecture Notes in Computer Science LNCS
PublisherSpringer
Volume10605

Fingerprint

Optical character recognition
Image recognition
Biodiversity
Analog to digital conversion
Linguistics
Semantics
Deep learning
Manuscripts
Handwriting
Monks

Keywords

  • deep learning
  • Digital Heritage
  • Natural History
  • Biodiversity heritage

Cite this

Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M., & Schomaker, L. (2018). Towards a Digital Infrastructure for Illustrated Handwritten Archives. In M. Ioannides (Ed.), Digital Cultural Heritage (pp. 155-166). (Lecture Notes in Computer Science LNCS; Vol. 10605). Springer International. DOI: 10.1007/978-3-319-75826-8_13
Weber, Andreas ; Ameryan, Mahya ; Wolstencroft, Katherine ; Stork, Lise ; Heerlien, Maarten ; Schomaker, Lambert. / Towards a Digital Infrastructure for Illustrated Handwritten Archives. Digital Cultural Heritage. editor / Marinos Ioannides. Springer International, 2018. pp. 155-166 (Lecture Notes in Computer Science LNCS).
@inbook{854499a7afdc461d8e2b2c439450aa47,
title = "Towards a Digital Infrastructure for Illustrated Handwritten Archives",
abstract = "Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820–1850 (“Natuurkundige Commissie voor Nederlands-Indi{\"e}”) tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.Keywords: Deep learning · Digital heritage · Natural historyBiodiversity heritage",
keywords = "deep learning, Digital Heritage, Natural History, Biodiversity heritage",
author = "Andreas Weber and Mahya Ameryan and Katherine Wolstencroft and Lise Stork and Maarten Heerlien and Lambert Schomaker",
year = "2018",
month = "3",
doi = "10.1007/978-3-319-75826-8_13",
language = "English",
isbn = "978-3-319-75825-1",
series = "Lecture Notes in Computer Science LNCS",
publisher = "Springer International",
pages = "155--166",
editor = "Ioannides, {Marinos }",
booktitle = "Digital Cultural Heritage",

}

Weber, A, Ameryan, M, Wolstencroft, K, Stork, L, Heerlien, M & Schomaker, L 2018, Towards a Digital Infrastructure for Illustrated Handwritten Archives. in M Ioannides (ed.), Digital Cultural Heritage. Lecture Notes in Computer Science LNCS, vol. 10605, Springer International, pp. 155-166. DOI: 10.1007/978-3-319-75826-8_13

Towards a Digital Infrastructure for Illustrated Handwritten Archives. / Weber, Andreas ; Ameryan, Mahya; Wolstencroft, Katherine; Stork, Lise; Heerlien, Maarten ; Schomaker, Lambert.

Digital Cultural Heritage. ed. / Marinos Ioannides. Springer International, 2018. p. 155-166 (Lecture Notes in Computer Science LNCS; Vol. 10605).

Research output: Chapter in Book/Report/Conference proceedingChapter

TY - CHAP

T1 - Towards a Digital Infrastructure for Illustrated Handwritten Archives

AU - Weber,Andreas

AU - Ameryan,Mahya

AU - Wolstencroft,Katherine

AU - Stork,Lise

AU - Heerlien,Maarten

AU - Schomaker,Lambert

PY - 2018/3

Y1 - 2018/3

N2 - Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820–1850 (“Natuurkundige Commissie voor Nederlands-Indië”) tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.Keywords: Deep learning · Digital heritage · Natural historyBiodiversity heritage

AB - Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820–1850 (“Natuurkundige Commissie voor Nederlands-Indië”) tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.Keywords: Deep learning · Digital heritage · Natural historyBiodiversity heritage

KW - deep learning

KW - Digital Heritage

KW - Natural History

KW - Biodiversity heritage

UR - https://www.springer.com/gb/book/9783319758251#aboutBook

U2 - 10.1007/978-3-319-75826-8_13

DO - 10.1007/978-3-319-75826-8_13

M3 - Chapter

SN - 978-3-319-75825-1

T3 - Lecture Notes in Computer Science LNCS

SP - 155

EP - 166

BT - Digital Cultural Heritage

PB - Springer International

ER -

Weber A, Ameryan M, Wolstencroft K, Stork L, Heerlien M, Schomaker L. Towards a Digital Infrastructure for Illustrated Handwritten Archives. In Ioannides M, editor, Digital Cultural Heritage. Springer International. 2018. p. 155-166. (Lecture Notes in Computer Science LNCS). Available from, DOI: 10.1007/978-3-319-75826-8_13