From D-Coi to SoNaR: A reference corpus for Dutch

N. Oostdijk, M. Reynaert, P. Monachesi, G. van Noord, Roeland J.F. Ordelman, I. Schuurman, V. Vandeghinste

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    19 Citations (Scopus)

    Abstract

    The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.
    Original languageUndefined
    Title of host publicationProceedings on the sixth international conference on language resources and evaluation (LREC 2008)
    PublisherELRA
    Pages1437-1444
    Number of pages8
    ISBN (Print)2-9517408-4-0
    Publication statusPublished - 31 May 2008
    Event6th International Conference on Language Resources and Evaluation 2008 - Marrakech, Marrakech, Morocco
    Duration: 28 May 200830 May 2008
    Conference number: 6
    http://www.lrec-conf.org/lrec2008/

    Publication series

    Name
    PublisherELRA
    Number07-04

    Conference

    Conference6th International Conference on Language Resources and Evaluation 2008
    Abbreviated titleLREC 2008
    CountryMorocco
    CityMarrakech
    Period28/05/0830/05/08
    Internet address

    Keywords

    • HMI-SLT: Speech and Language Technology
    • EWI-15099
    • LR national/international projects
    • etc)
    • METIS-255893
    • organizational/policy issues
    • IR-62741
    • Corpus (creation
    • annotation
    • Standards for LRs

    Cite this

    Oostdijk, N., Reynaert, M., Monachesi, P., van Noord, G., Ordelman, R. J. F., Schuurman, I., & Vandeghinste, V. (2008). From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings on the sixth international conference on language resources and evaluation (LREC 2008) (pp. 1437-1444). ELRA.