A Neural Network Based Dutch Part of Speech Tagger

Mannes Poel, Egwin Boschman, Rieks op den Akker

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

2 Citations (Scopus)
178 Downloads (Pure)

Abstract

In this paper a Neural Network is designed for Part-of-Speech Tagging of Dutch text. Our approach uses the Corpus Gesproken Nederlands (CGN) consisting of almost 9 million transcribed words of spoken Dutch, divided into 15 different categories. The outcome of the design is a Neural Network with an input window of size 8 (4 words back and 3 words ahead) and a hidden layer of 370 neurons. The words ahead are coded based on the relative frequency of the tags in the training set for the word. Special attention is paid to unknown words (words not in the training set) for which such a relative frequency cannot be determined. Based on a 10-fold cross validation an approximation of the relative frequency of tags for unknown words is determined. The performance of the Neural Network is 97.35%, 97.88% on known words and 41.67% on unknown words. This is comparable to state of the art performances found in the literature. The special coding of unknown words resulted of an increase of almost 13% for the tagging of unknown words.
Original languageEnglish
Title of host publicationBNAIC 2008
Subtitle of host publicationProceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008
EditorsAnton Nijholt, Maja Pantic, Mannes Poel, Hendri Hondorp
Place of PublicationEnschede
PublisherTwente University Press (TUP)
Pages217-224
Number of pages8
Publication statusPublished - 2008
Event20th Benelux Conference on Artificial Intelligence, BNAIC 2008 - Boekelo, Netherlands
Duration: 30 Oct 200831 Oct 2008
Conference number: 20

Publication series

NameBNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference
PublisherTwente University Press
Number20
ISSN (Print)1568-7805

Conference

Conference20th Benelux Conference on Artificial Intelligence, BNAIC 2008
Abbreviated titleBNAIC
CountryNetherlands
CityBoekelo
Period30/10/0831/10/08

Fingerprint

Neural networks
Neurons

Keywords

  • IR-65237
  • METIS-255028
  • EWI-14662

Cite this

Poel, M., Boschman, E., & op den Akker, R. (2008). A Neural Network Based Dutch Part of Speech Tagger. In A. Nijholt, M. Pantic, M. Poel, & H. Hondorp (Eds.), BNAIC 2008: Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008 (pp. 217-224). (BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference; No. 20). Enschede: Twente University Press (TUP).
Poel, Mannes ; Boschman, Egwin ; op den Akker, Rieks. / A Neural Network Based Dutch Part of Speech Tagger. BNAIC 2008: Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008. editor / Anton Nijholt ; Maja Pantic ; Mannes Poel ; Hendri Hondorp. Enschede : Twente University Press (TUP), 2008. pp. 217-224 (BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference; 20).
@inproceedings{bf11ad821d4941c5b4202fef701717dd,
title = "A Neural Network Based Dutch Part of Speech Tagger",
abstract = "In this paper a Neural Network is designed for Part-of-Speech Tagging of Dutch text. Our approach uses the Corpus Gesproken Nederlands (CGN) consisting of almost 9 million transcribed words of spoken Dutch, divided into 15 different categories. The outcome of the design is a Neural Network with an input window of size 8 (4 words back and 3 words ahead) and a hidden layer of 370 neurons. The words ahead are coded based on the relative frequency of the tags in the training set for the word. Special attention is paid to unknown words (words not in the training set) for which such a relative frequency cannot be determined. Based on a 10-fold cross validation an approximation of the relative frequency of tags for unknown words is determined. The performance of the Neural Network is 97.35{\%}, 97.88{\%} on known words and 41.67{\%} on unknown words. This is comparable to state of the art performances found in the literature. The special coding of unknown words resulted of an increase of almost 13{\%} for the tagging of unknown words.",
keywords = "IR-65237, METIS-255028, EWI-14662",
author = "Mannes Poel and Egwin Boschman and {op den Akker}, Rieks",
note = "http://eprints.ewi.utwente.nl/14662",
year = "2008",
language = "English",
series = "BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference",
publisher = "Twente University Press (TUP)",
number = "20",
pages = "217--224",
editor = "Anton Nijholt and Maja Pantic and Mannes Poel and Hendri Hondorp",
booktitle = "BNAIC 2008",
address = "Netherlands",

}

Poel, M, Boschman, E & op den Akker, R 2008, A Neural Network Based Dutch Part of Speech Tagger. in A Nijholt, M Pantic, M Poel & H Hondorp (eds), BNAIC 2008: Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008. BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference, no. 20, Twente University Press (TUP), Enschede, pp. 217-224, 20th Benelux Conference on Artificial Intelligence, BNAIC 2008, Boekelo, Netherlands, 30/10/08.

A Neural Network Based Dutch Part of Speech Tagger. / Poel, Mannes ; Boschman, Egwin; op den Akker, Rieks.

BNAIC 2008: Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008. ed. / Anton Nijholt; Maja Pantic; Mannes Poel; Hendri Hondorp. Enschede : Twente University Press (TUP), 2008. p. 217-224 (BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference; No. 20).

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - A Neural Network Based Dutch Part of Speech Tagger

AU - Poel, Mannes

AU - Boschman, Egwin

AU - op den Akker, Rieks

N1 - http://eprints.ewi.utwente.nl/14662

PY - 2008

Y1 - 2008

N2 - In this paper a Neural Network is designed for Part-of-Speech Tagging of Dutch text. Our approach uses the Corpus Gesproken Nederlands (CGN) consisting of almost 9 million transcribed words of spoken Dutch, divided into 15 different categories. The outcome of the design is a Neural Network with an input window of size 8 (4 words back and 3 words ahead) and a hidden layer of 370 neurons. The words ahead are coded based on the relative frequency of the tags in the training set for the word. Special attention is paid to unknown words (words not in the training set) for which such a relative frequency cannot be determined. Based on a 10-fold cross validation an approximation of the relative frequency of tags for unknown words is determined. The performance of the Neural Network is 97.35%, 97.88% on known words and 41.67% on unknown words. This is comparable to state of the art performances found in the literature. The special coding of unknown words resulted of an increase of almost 13% for the tagging of unknown words.

AB - In this paper a Neural Network is designed for Part-of-Speech Tagging of Dutch text. Our approach uses the Corpus Gesproken Nederlands (CGN) consisting of almost 9 million transcribed words of spoken Dutch, divided into 15 different categories. The outcome of the design is a Neural Network with an input window of size 8 (4 words back and 3 words ahead) and a hidden layer of 370 neurons. The words ahead are coded based on the relative frequency of the tags in the training set for the word. Special attention is paid to unknown words (words not in the training set) for which such a relative frequency cannot be determined. Based on a 10-fold cross validation an approximation of the relative frequency of tags for unknown words is determined. The performance of the Neural Network is 97.35%, 97.88% on known words and 41.67% on unknown words. This is comparable to state of the art performances found in the literature. The special coding of unknown words resulted of an increase of almost 13% for the tagging of unknown words.

KW - IR-65237

KW - METIS-255028

KW - EWI-14662

M3 - Conference contribution

T3 - BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference

SP - 217

EP - 224

BT - BNAIC 2008

A2 - Nijholt, Anton

A2 - Pantic, Maja

A2 - Poel, Mannes

A2 - Hondorp, Hendri

PB - Twente University Press (TUP)

CY - Enschede

ER -

Poel M, Boschman E, op den Akker R. A Neural Network Based Dutch Part of Speech Tagger. In Nijholt A, Pantic M, Poel M, Hondorp H, editors, BNAIC 2008: Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference, Enschede/Bad Boekelo, October 30-31, 2008. Enschede: Twente University Press (TUP). 2008. p. 217-224. (BNAIC: proceedings of the ... Belgium/Netherlands Artificial Intelligence Conference; 20).