TwNC: a Multifaceted Dutch News Corpus

    Research output: Contribution to journalArticleAcademic

    117 Downloads (Pure)

    Abstract

    This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN. The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus.
    Original languageUndefined
    Number of pages10
    JournalELRA Newsletter
    Volume12
    Issue number3-4
    Publication statusPublished - 2007

    Keywords

    • HMI-MR: MULTIMEDIA RETRIEVAL
    • EWI-15098
    • IR-68090
    • Speech Recognition
    • Text corpora
    • Multimedia Retrieval

    Cite this

    @article{42e3c5016cab421281a9029a774fffae,
    title = "TwNC: a Multifaceted Dutch News Corpus",
    abstract = "This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN. The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus.",
    keywords = "HMI-MR: MULTIMEDIA RETRIEVAL, EWI-15098, IR-68090, Speech Recognition, Text corpora, Multimedia Retrieval",
    author = "Ordelman, {Roeland J.F.} and {de Jong}, {Franciska M.G.} and {van Hessen}, {Adrianus J.} and G.H.W. Hondorp",
    year = "2007",
    language = "Undefined",
    volume = "12",
    journal = "ELRA Newsletter",
    number = "3-4",

    }

    TwNC: a Multifaceted Dutch News Corpus. / Ordelman, Roeland J.F.; de Jong, Franciska M.G.; van Hessen, Adrianus J.; Hondorp, G.H.W.

    In: ELRA Newsletter, Vol. 12, No. 3-4, 2007.

    Research output: Contribution to journalArticleAcademic

    TY - JOUR

    T1 - TwNC: a Multifaceted Dutch News Corpus

    AU - Ordelman, Roeland J.F.

    AU - de Jong, Franciska M.G.

    AU - van Hessen, Adrianus J.

    AU - Hondorp, G.H.W.

    PY - 2007

    Y1 - 2007

    N2 - This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN. The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus.

    AB - This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN. The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus.

    KW - HMI-MR: MULTIMEDIA RETRIEVAL

    KW - EWI-15098

    KW - IR-68090

    KW - Speech Recognition

    KW - Text corpora

    KW - Multimedia Retrieval

    M3 - Article

    VL - 12

    JO - ELRA Newsletter

    JF - ELRA Newsletter

    IS - 3-4

    ER -