As data storage capacities grow to nearly unlimited sizes thanks to ever ongoing hardware and software improvements, an increasing amount of information is being stored in multimedia and spoken-word collections. Assuming that the intention of data storage is to use (portions of) it some later time, these collections must also be searchable in one way or another. For multimedia and spoken-word collections, traditional text-oriented information retrieval (IR) strategies inevitably fall short, as the amount of textual information included with these types of documents is usually very limited. However, when automatic speech recognition (ASR) can be used to convert the speech occurring in these documents into text, textual representations can be created that in turn can be searched using the traditional text-based search strategies. As ASR systems label recognized words with exact time information as a standard accessory, detailed searching within multimedia and spoken-word collections can be enabled. This type of retrieval is commonly referred to as Spoken Document Retrieval (SDR). Typically, large vocabulary speaker independent continuous speech recognition systems (LVCSR) are deployed for creating textual representations of the spoken audio in multimedia an spoken-word collections. For Dutch however, such a system was not available when this research was started. As creating a Dutch system from scratch was not feasible given the available resources, an existing English system, refered to as the ABBOT system, was ported to Dutch. A significant part of this thesis is dedicated to a complete run-down of the porting work, involving the collection and preparation of suitable training data and the actual training and evaluation of the acoustic models and language models. The broadcast news domain was chosen as domain of focus, as this domain has also been extensively used as a benchmark domain for both international ASR research and SDR. A complicating factor for ASR in the news domain, is that word usage is highly variable. As a consequence, besides using large vocabularies, it is important to adjust these vocabularies regularly, so that they reflect the content of the news programs well. Therefore, it has been investigated which word selection strategies are best suited for making these vocabulary adjustments. Moreover, as dynamic vocabularies require a flexible generation of accurate word pronunciations, the development of a grapheme-to-phoneme converter is addressed. Another vocabulary related issue that is investigated, stems from a well-known characteristic of the Dutch language, word compounding: Dutch words can almost freely be joined together to form new words. As a result of this phenomenon, the number of distinct words in Dutch is relatively large, which reduces the coverage of vocabularies compared to those of the same size of other languages, such as English, that do not have word compounding. This thesis investigates whether splitting Dutch compound words could be a remedy for the relatively limited coverage of vocabularies, so that ASR performance could be improved. Next to a brief history of SDR research and a review of possible SDR approaches, this thesis demonstrates the use of a Dutch LVCSR in SDR by providing an illustrative example of an SDR evaluation given a collection of Dutch broadcast news shows. It is shown that Dutch speech recognition can successfully be deployed for content-based retrieval of broadcast news programs. The experience obtained with the research described in this thesis, and the experience that will emerge from future research efforts must contribute to the long-term accessibility of the increasing amount of information being stored in Dutch multimedia and spoken-word collections.
|Award date||10 Oct 2003|
|Place of Publication||Enschede|
|Publication status||Published - 10 Oct 2003|
- Multimedia Retrieval
- HMI-MR: MULTIMEDIA RETRIEVAL
- HMI-SLT: Speech and Language Technology
- Speech Recognition