Abstract
In the growing domain of natural language processing, low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of part-of-speech tagging and tokenization for Northern Kurdish are still insufficiently addressed. In this study, we aim to bridge this gap by evaluating a range of statistical, neural, and fine-tuned-based models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the Universal Dependency Kurmanji treebank and a novel manually annotated and tokenized gold-standard dataset consisting of 136 sentences (2, 937 tokens). We evaluate several POS tagging models and report that the fine-tuned transformer-based model outperforms others, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. Data and models are publicly available under an open license at https://github.com/peshmerge/northern-kurdish-pos-tagging.
Original language | English |
---|---|
Title of host publication | Joint Workshop on Multiword Expressions and Universal Dependencies, MWE-UD 2024 at LREC-COLING 2024 - Workshop Proceedings |
Editors | Archna Bhatia, Gosse Bouma, A. Seza Dogruoz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademacher |
Publisher | European Language Resources Association (ELRA) |
Pages | 70-80 |
Number of pages | 11 |
ISBN (Electronic) | 9782493814203 |
Publication status | Published - 2024 |
Event | 2024 Joint Workshop on Multiword Expressions and Universal Dependencies, MWE-UD 2024 - Torino, Italy Duration: 25 May 2024 → 25 May 2024 |
Workshop
Workshop | 2024 Joint Workshop on Multiword Expressions and Universal Dependencies, MWE-UD 2024 |
---|---|
Abbreviated title | MWE-UD 2024 |
Country/Territory | Italy |
City | Torino |
Period | 25/05/24 → 25/05/24 |
Keywords
- low-resource NLP
- morphosyntactic analysis
- Northern Kurdish
- Part-of-Speech tagging