Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Martin Reynaert, Nelleke Oostdijk, Orph´ee De Clercq, Henk van den Heuvel, Franciska M.G. de Jong

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    11 Citations (Scopus)
    118 Downloads (Pure)


    In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well.
    Original languageUndefined
    Title of host publicationProceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)
    Place of PublicationParis
    PublisherEuropean Language Resources Association (ELRA)
    Number of pages6
    ISBN (Print)2-9517408-6-7
    Publication statusPublished - May 2010
    Event7th International Conference on Language Resources and Evaluation 2010 - Mediterranean Conference Centre, Valetta, Malta
    Duration: 19 May 201021 May 2010
    Conference number: 7

    Publication series

    PublisherEuropean Language Resources Association (ELRA)


    Conference7th International Conference on Language Resources and Evaluation 2010
    Abbreviated titleLREC 2010
    Internet address


    • IR-72111
    • METIS-270850
    • Corpus design
    • natural language processing
    • dutch language
    • EWI-18001

    Cite this