On constituent chunking for Turkish

Abstract Chunking is a task which divides a sentence into non-recursive structures. The primary aim is to specify chunk boundaries and classes. Although chunking generally refers to simple chunks, it is possible to customize the concept. A simple chunk is a small structure, such as a noun phrase, while constituent chunk is a structure that functions as a single unit in a sentence, such as a subject. For an agglutinative language with a rich morphology, constituent chunking is a significant problem in comparison to simple chunking. Most of Turkish studies on this issue use the IOB tagging schema to mark the boundaries. In this study, we proposed a new simpler tagging schema, namely OE, in constituent chunking for Turkish. “E” represents the rightmost token of a chunk, while “O” stands for all other items. In reference to OE, we also used a schema called OB, where “B” represents the leftmost token of a chunk. We aimed to identify both chunk boundaries and chunk classes using the conditional random fields (CRF) method. The initial motivation was to employ the fact that Turkish phrases are head-final for chunking. In this context, we assumed that marking the end of a chunk (OE) would be more advantageous than marking the beginning of a chunk (OB). In support of the assumption, the test results reveal that OB has the worst performance and OE is significantly a more successful schema in many cases. Especially in long sentences, this contrast is more obvious. Indeed, using OE means simply marking the head of the phrase (chunk). Since the head and the distinctive label “E” are aligned, CRF finds the chunk class more easily by using the information contained in the head. OE also produced more successful results than the schemas available in the literature. In addition to comparing tagging schemas, we performed four analyses. Along with the examination of window size, which is a parameter of CRF, it is adequate to select and accept this value as 3. A comparison of the evaluation measures for chunking revealed that F-score was a more balanced measure in contrast to token accuracy and sentence accuracy. As a result of the feature analysis, syntactic features improves chunking performance significantly under all conditions. Yet when withdrawing these features, a pronounced difference between OB and OE is forthcoming. In addition, flexibility analysis shows that OE is more successful in different data.

[1]  Aung Lwin Moe,et al.  New Phrase Chunking Algorithm for Myanmar Natural Language Processing , 2014 .

[2]  Hao Wu,et al.  Improved Joint Kazakh POS Tagging and Chunking , 2016, CCL.

[3]  Himanshu Gahlot,et al.  Shallow Parsing for Hindi - An extensive analysis of sequential learning algorithms using a large annotated corpus , 2009, 2009 IEEE International Advance Computing Conference.

[4]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[5]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[7]  Kamal Sarkar,et al.  Bengali noun phrase chunking based on conditional random fields , 2014, 2014 2nd International Conference on Business and Information Management (ICBIM).

[8]  Dimitris N. Metaxas,et al.  Recognizing Facial Expressions by Tracking Feature Shapes , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[9]  Kübra Adali,et al.  A RULE BASED NOUN PHRASE CHUNKER FOR TURKISH , 2014 .

[10]  A. R. Weerasinghe,et al.  A shallow parser for Tamil , 2014, 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer).

[11]  Eneko Agirre,et al.  Interpretable Semantic Textual Similarity: Finding and explaining differences between sentences , 2016, Knowl. Based Syst..

[12]  Marek Grác,et al.  Using Low-Cost Annotation to Train a Reliable Czech Shallow Parser , 2013, TSD.

[13]  Latesh G. Malik,et al.  ATSSC: Development of an approach based on soft computing for text summarization , 2017, Comput. Speech Lang..

[14]  Patrice Bellot,et al.  INEX Tweet Contextualization task: Evaluation, results and lesson learned , 2016, Inf. Process. Manag..

[15]  Samuel W. K. Chan,et al.  Sentiment analysis in financial texts , 2017, Decis. Support Syst..

[16]  Yaregal Assabie,et al.  Hierarchical Amharic Base Phrase Chunking Using HMM with Error Pruning , 2013, LTC.

[17]  Ming Zhou,et al.  Two-stage NER for tweets with clustering , 2013, Inf. Process. Manag..

[18]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[19]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[20]  Dilek Z. Hakkani-Tür,et al.  Building a Turkish Treebank , 2003 .

[21]  Ilyas Cicekli,et al.  Noun Phrase Chunking for Turkish Using a Dependency Parser , 2015, ISCIS.

[22]  C T Rekha Raj,et al.  Text chunker for Malayalam using Memory-Based Learning , 2015 .

[23]  Itziar Aduriz,et al.  Morphosyntactic disambiguation and shallow parsing in computational processing of Basque , 2013 .

[24]  Yu-Chieh Wu,et al.  Efficient text chunking using linear kernel with masked method , 2007, Knowl. Based Syst..

[25]  Murat Saraclar,et al.  Resources for Turkish morphological processing , 2011, Lang. Resour. Evaluation.

[26]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[27]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[28]  Ivan Anisimov,et al.  Chunking in Dependency Model and Spelling Correction in Russian and English , 2016 .

[29]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Maurice van Keulen,et al.  Concept Extraction Challenge: University of Twente at #MSM2013 , 2013, #MSM.

[32]  Ayu Purwarianti,et al.  Indonesian Named-entity Recognition for 15 Classes Using Ensemble Supervised Learning , 2016, SLTU.

[33]  Olcay Taner Yildiz,et al.  Chunking in Turkish with Conditional Random Fields , 2015, CICLing.

[34]  Hanna M. Wallach,et al.  Efficient Training of Conditional Random Fields , 2002 .

[35]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.