Disambiguating Main POS tags for Turkish

This paper presents the results of main part-of-speech tagging of Turkish sentences using Conditional Ran- dom Fields (CRFs). Although CRFs are applied to many different languages for part-of-speech (POS) tagging, Turkish poses interesting challenges to be modeled with them. The challenges include issues related to the statistical model of the problem as well as issues related to computational complexity and scaling. In this paper, we propose a novel model for main-POS tagging in Turkish. Furthermore, we pro- pose some approaches to reduce the computational complexity and allow better scaling characteristics or improve the performance without increased complexity. These approaches are discussed with respect to their advantages and disadvantages. We show that the best approach is competitive with the current state of the art in accuracy and also in training and test durations. The good results obtained imply a good first step towards full morphological disambiguation.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Deniz Yuret,et al.  Learning Morphological Disambiguation Rules for Turkish , 2006, NAACL.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Shuly Wintner,et al.  Morphological Disambiguation of Hebrew: A Case Study in Classifier Combination , 2007, EMNLP.

[5]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[6]  Noah A. Smith,et al.  Context-Based Morphological Disambiguation with Random Fields , 2005, HLT.

[7]  Erhan Mengusoglu,et al.  Turkish LVCSR: Database Preparation and Language Modeling for an Agglutinative Language , 2001 .

[8]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[9]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[10]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[11]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[12]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[13]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[14]  Kemal Oflazer Two-level description of Turkish morphology , 1993 .

[15]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Lei Yu,et al.  Stable and Accurate Feature Selection , 2009, ECML/PKDD.

[17]  Murat Saraclar,et al.  Morphological Disambiguation of Turkish Text with Perceptron Algorithm , 2009, CICLing.