Language modeling and transcription of the TED corpus lectures

Erwin Leeuwis, Marcello Federico, Mauro Cettolo

Research output: Contribution to conferencePaperAcademic

35 Citations (Scopus)

Abstract

Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved.
Original languageEnglish
Pages232-235
Publication statusPublished - 2003
EventIEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003 - Hong Kong Exhibition and Convention Centre, Hong Kong, Hong Kong
Duration: 6 Apr 200310 Apr 2003

Other

OtherIEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003
Abbreviated titleICASSP
CountryHong Kong
CityHong Kong
Period6/04/0310/04/03

Fingerprint Dive into the research topics of 'Language modeling and transcription of the TED corpus lectures'. Together they form a unique fingerprint.

Cite this