A Hybrid Morphological Disambiguation System for Turkish

In this paper, we propose a morphological disambiguation method for Turkish, which is an agglutinative language. We use a hybrid method, which combines statistical information with handcrafted rules and learned rules. Five different steps are applied for disambiguation. In the first step, the most likely tags of words are selected. In the second step, we use handcrafted rules to constrain possible parses or select the correct parse. Next, the most likely tags are selected for still ambiguous words according to the suffixes of the words that are unseen in the training corpus. Then, we use transformation-based rules that are learned by a variation of Brill tagger. If the word is still ambiguous, we use some heuristics for the disambiguation. We constructed a hand-tagged dataset for training and applied a ten-fold cross validation with this dataset. We obtained 93.4% accuracy on the average when whole morphological parses are considered in calculation. The accuracy increased to 94.1% when only part-of-speech tags and inflections of last derivations are considered. Our accuracy is 96.9% in terms of part-of-speech tagging.

[1]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[2]  Ilyas Cicekli,et al.  A Link Grammar for an Agglutinative Language , 2007 .

[3]  Gökhan Tür,et al.  Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation , 1996, EMNLP.

[4]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[5]  Ilyas Cicekli,et al.  A Rule-Based Morphological Disambiguator for Turkish , 2007 .

[6]  N. Ezeiza,et al.  Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages , COLING.

[7]  Gökhan Tür,et al.  Morphological Disambiguation by Voting Constraints , 1997, ACL.

[8]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[9]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[10]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[11]  Beáta Megyesi,et al.  Improving Brill’s POS Tagger for an Agglutinative Language , 1999, EMNLP.

[12]  Alon Itai,et al.  Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew , 1995, CL.

[13]  Jan Hajic,et al.  Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset , 1998, ACL.

[14]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[15]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.