A Support Vector Machine Approach to Dutch Part-of-Speech Tagging

Part-of-Speech tagging, the assignment of Parts-of-Speech to the words in a given context of use, is a basic technique in many systems that handle natural languages. This paper describes a method for supervised training of a Part-of-Speech tagger using a committee of Support Vector Machines on a large corpus of annotated transcriptions of spoken Dutch. Special attention is paid to the decomposition of the large data set into parts for common, uncommon and unknown words. This does not only solve the space problems caused by the amount of data, it also improves the tagging time. The performance of the resulting tagger in terms of accuracy is 97.54 %, which is quite good, where the speed of the tagger is reasonably good.

[1]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[2]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Yuji Matsumoto,et al.  Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines , 2001, NLPRS.

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[7]  Walter Daelemans,et al.  Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers , 2000, LREC.

[8]  Nelleke Oostdijk,et al.  The Design of the Spoken Dutch Corpus , 2002 .

[9]  van der Ielka Sluis,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04) , 2004 .

[10]  Frank Van Eynde Part of Speech Tagging en Lemmatisering , 2003 .

[11]  Antal van den Bosch,et al.  A Memory-Based Shallow Parser for Spoken Dutch , 2003, CLIN.

[12]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[13]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[14]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.