Statistical Morphological Disambiguation for Agglutinative Languages

We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.

[1]  Tanja Schultz,et al.  Turkish LVCSR: towards better speech recognition for agglutinative languages , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[3]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[4]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[5]  Beáta Megyesi,et al.  Improving Brill’s POS Tagger for an Agglutinative Language , 1999, EMNLP.

[6]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[7]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[8]  Gökhan Tür,et al.  Morphological Disambiguation by Voting Constraints , 1997, ACL.

[9]  Alon Itai,et al.  Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew , 1995, CL.

[10]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[11]  Bernard Comrie The function of word order in Turkish grammar By Eser Emine Erguvanli (review) , 1986 .

[12]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[13]  Gokhan Tur Using multiple sources of information for constraint-based morphological disambiguation , 1996 .

[14]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[15]  Hans van Halteren,et al.  Syntactic Wordclass Tagging , 1999 .

[16]  David Elworthy Tagset Design and Inflected Languages , 1995, ArXiv.

[17]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Mark Steedman,et al.  Lexical Representation and Process , 1989 .

[19]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[20]  Kemal Oflazer,et al.  Statistical morphological disambiguation for agglutinative languages , 2000, COLING 2000.

[21]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[22]  Jan Hajic,et al.  Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset , 1998, ACL.

[23]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[24]  F. Hamprecht Introduction to Statistics , 2022 .

[25]  George K. Kokkinakis,et al.  Automatic Stochastic Tagging of Natural Language Texts , 1995, Comput. Linguistics.

[26]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[27]  Atro Voutilainen Does tagging help parsing? A case study on finite state parsing , 1998 .

[28]  Kemal Oflazer,et al.  Tagging and Morphological Disambiguation of Turkish Text , 1994, ANLP.

[29]  Kemal Oflazer Two-level description of Turkish morphology , 1993 .

[30]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[31]  Kemal Oflazer Dependency Parsing with an Extended Finite State Approach , 1999, ACL.

[32]  Gökhan Tür,et al.  Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation , 1996, EMNLP.

[33]  N. Ezeiza,et al.  Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages , COLING.