An exploration of language identification techniques for the Dutch folktale database

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

32 Downloads (Pure)

Abstract

The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.
Original languageUndefined
Title of host publicationProceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012)
EditorsP. Osenova, S. Piperidis, M. Slavcheva, C. Vertan
Place of PublicationIstanbul, Turkey
PublisherLREC organization
Pages47-51
Number of pages5
ISBN (Print)not assigned
Publication statusPublished - 26 May 2012

Publication series

Name
PublisherLREC organization

Keywords

  • METIS-289720
  • IR-82013
  • EWI-22321
  • Text classification
  • HMI-SLT: Speech and Language Technology
  • Language detection

Cite this

Trieschnigg, R. B., Hiemstra, D., Theune, M., de Jong, F. M. G., & Meder, T. (2012). An exploration of language identification techniques for the Dutch folktale database. In P. Osenova, S. Piperidis, M. Slavcheva, & C. Vertan (Eds.), Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012) (pp. 47-51). Istanbul, Turkey: LREC organization.
Trieschnigg, Rudolf Berend ; Hiemstra, Djoerd ; Theune, Mariet ; de Jong, Franciska M.G. ; Meder, Theo. / An exploration of language identification techniques for the Dutch folktale database. Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012). editor / P. Osenova ; S. Piperidis ; M. Slavcheva ; C. Vertan. Istanbul, Turkey : LREC organization, 2012. pp. 47-51
@inproceedings{20baf405d3bc46c286758b81298a5a7e,
title = "An exploration of language identification techniques for the Dutch folktale database",
abstract = "The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.",
keywords = "METIS-289720, IR-82013, EWI-22321, Text classification, HMI-SLT: Speech and Language Technology, Language detection",
author = "Trieschnigg, {Rudolf Berend} and Djoerd Hiemstra and Mariet Theune and {de Jong}, {Franciska M.G.} and Theo Meder",
year = "2012",
month = "5",
day = "26",
language = "Undefined",
isbn = "not assigned",
publisher = "LREC organization",
pages = "47--51",
editor = "P. Osenova and S. Piperidis and M. Slavcheva and C. Vertan",
booktitle = "Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012)",

}

Trieschnigg, RB, Hiemstra, D, Theune, M, de Jong, FMG & Meder, T 2012, An exploration of language identification techniques for the Dutch folktale database. in P Osenova, S Piperidis, M Slavcheva & C Vertan (eds), Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012). LREC organization, Istanbul, Turkey, pp. 47-51.

An exploration of language identification techniques for the Dutch folktale database. / Trieschnigg, Rudolf Berend; Hiemstra, Djoerd; Theune, Mariet; de Jong, Franciska M.G.; Meder, Theo.

Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012). ed. / P. Osenova; S. Piperidis; M. Slavcheva; C. Vertan. Istanbul, Turkey : LREC organization, 2012. p. 47-51.

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - An exploration of language identification techniques for the Dutch folktale database

AU - Trieschnigg, Rudolf Berend

AU - Hiemstra, Djoerd

AU - Theune, Mariet

AU - de Jong, Franciska M.G.

AU - Meder, Theo

PY - 2012/5/26

Y1 - 2012/5/26

N2 - The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.

AB - The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.

KW - METIS-289720

KW - IR-82013

KW - EWI-22321

KW - Text classification

KW - HMI-SLT: Speech and Language Technology

KW - Language detection

M3 - Conference contribution

SN - not assigned

SP - 47

EP - 51

BT - Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012)

A2 - Osenova, P.

A2 - Piperidis, S.

A2 - Slavcheva, M.

A2 - Vertan, C.

PB - LREC organization

CY - Istanbul, Turkey

ER -

Trieschnigg RB, Hiemstra D, Theune M, de Jong FMG, Meder T. An exploration of language identification techniques for the Dutch folktale database. In Osenova P, Piperidis S, Slavcheva M, Vertan C, editors, Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage (LREC 2012). Istanbul, Turkey: LREC organization. 2012. p. 47-51