Abstract
This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN.
The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts
of four different sources: Dutch national newspapers, television subtitles, teleprompter
(auto-cues) files, and both manually and automatically generated broadcast news transcripts
along with the broadcast news audio. TwNC plays a crucial role in the
development and evaluation of a wide range of tools and applications for the
domain of multimedia indexing, such as large vocabulary speech recognition,
cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into
the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that
was completed in 2007. The sections below will describe the rationale that was the starting point
for the corpus development; it will outline the cross-media linking approach adopted within
MultimediaN, and finally provide some facts and figures about the corpus.
Original language | English |
---|---|
Number of pages | 10 |
Journal | ELRA Newsletter |
Volume | 12 |
Issue number | 3-4 |
Publication status | Published - 2007 |
Keywords
- HMI-MR: MULTIMEDIA RETRIEVAL
- EWI-15098
- IR-68090
- Speech Recognition
- Text corpora
- Multimedia Retrieval
Fingerprint
Dive into the research topics of 'TwNC: a Multifaceted Dutch News Corpus'. Together they form a unique fingerprint.Datasets
-
Twente Nieuws Corpus (TwNC)
Ordelman, R. J. F. (Creator) & Hondorp, G. H. W. (Data Collector), Centre for Telematics and Information Technology (CTIT), 1 Jan 2003
Dataset