A Suffix Based Part-of-Speech Tagger for Turkish

In this paper, we present a stochastic part-of-speech tagger for Turkish. The tagger is primarily developed for information retrieval purposes, but it can as well serve as a light-weight PoS tagger for other purposes. The tagger uses a well-established Hidden Markov model of the language with a closed lexicon that consists of fixed number of letters from the word endings. We have considered seven different lengths of word endings against 30 training corpus sizes. Best- case accuracy obtained is 90.2% with 5 characters. The main contribution of this paper is to present a way of constructing a closed vocabulary for part-of-speech tagging effort that can be useful for highly inflected languages like Turkish, Finnish, Hungarian, Estonian, and Czech.

[1]  Thorsten Brants,et al.  Tagging the Teleman Corpus , 1995, ArXiv.

[2]  Fernando Sánchez León,et al.  Development of a Spanish Version of the Xerox Tagger , 1995, ArXiv.

[3]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[4]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Frederick Jelinek,et al.  Markov Source Modeling of Text Generation , 1985 .

[7]  Alon Itai,et al.  Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew , 1995, CL.

[8]  Gerard Salton,et al.  An approach to the segmentation problem in speech analysis and language translation , 1961, EARLYMT.

[9]  Jean-Pierre Chanod,et al.  Tagging French - comparing a statistical and a constraint-based method , 1995, EACL.

[10]  Johan Carlberger,et al.  Implementing an Efficient Part-Of-Speech Tagger , 1999, Softw. Pract. Exp..

[11]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[12]  Gökhan Tür,et al.  Morphological Disambiguation by Voting Constraints , 1997, ACL.

[13]  Kemal Oflazer Two-level description of Turkish morphology , 1993 .

[14]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[15]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[16]  George K. Kokkinakis,et al.  Automatic Stochastic Tagging of Natural Language Texts , 1995, Comput. Linguistics.

[17]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[18]  John R. Birge,et al.  Introduction to Stochastic Programming , 1997 .

[19]  Helmut Feldweg,et al.  Implementation and evaluation of a German HMM for POS disambiguation , 1995, ArXiv.

[20]  Jorge Hankamer,et al.  Morphological parsing and the lexicon , 1989 .

[21]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[22]  Robert F. Simmons,et al.  A Computational Approach to Grammatical Coding of English Words , 1963, JACM.

[23]  Kemal Oflazer,et al.  Morphological disambiguation by voting constraints , 1997 .

[24]  J. K. Skwirzynski The impact of processing techniques on communications , 1985 .

[25]  H. M. Taylor,et al.  An introduction to stochastic modeling , 1985 .

[26]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[27]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[29]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[30]  C. J. van Rijsbergen,et al.  Phrase Identification in Cross-Language Information Retrieval , 2000, RIAO.

[31]  Saso Dzeroski,et al.  Morphosyntactic Tagging of Slovene Using Progol , 1999, ILP.

[32]  James K. Baker,et al.  Stochastic modeling for automatic speech understanding , 1990 .

[33]  Percy H. Tannenbaum,et al.  Stochastic approach to the grammatical coding of english , 1965, CACM.

[34]  Itziar Aduriz,et al.  Different Issues in the Design of a Lemmatizer/Tagger for Basque , 1995, ArXiv.