TY - JOUR
T1 - Clearing the Transcription Hurdle in Dialect Corpus Building
T2 - The Corpus of Southern Dutch Dialects as Case Study
AU - Ghyselen, Anne Sophie
AU - Breitbarth, Anne
AU - Farasyn, Melissa
AU - Van Keymeulen, Jacques
AU - van Hessen, Arjan
N1 - Publisher Copyright:
© Copyright © 2020 Ghyselen, Breitbarth, Farasyn, Van Keymeulen and van Hessen.
PY - 2020/4/15
Y1 - 2020/4/15
N2 - This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve.
AB - This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve.
KW - ASR
KW - corpus research
KW - dialect
KW - dutch
KW - Flanders
KW - forced alignment
KW - respeaking
KW - transcription
UR - http://www.scopus.com/inward/record.url?scp=85109543581&partnerID=8YFLogxK
U2 - 10.3389/frai.2020.00010
DO - 10.3389/frai.2020.00010
M3 - Article
AN - SCOPUS:85109543581
SN - 2624-8212
VL - 3
JO - Frontiers in Artificial Intelligence
JF - Frontiers in Artificial Intelligence
M1 - 10
ER -