Improving Transition-Based Dependency Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenomena

In recent years, transition-based parsers have shown promise in terms of efficiency and accuracy. Though these parsers have been extensively explored for multiple Indian languages, there is still considerable scope for improvement by properly incorporating syntactically relevant information. In this article, we enhance transition-based parsing of Hindi and Urdu by redefining the features and feature extraction procedures that have been previously proposed in the parsing literature of Indian languages. We propose and empirically show that properly incorporating syntactically relevant information like case marking, complex predication and grammatical agreement in an arc-eager parsing model can significantly improve parsing accuracy. Our experiments show an absolute improvement of ∼2% LAS for parsing of both Hindi and Urdu over a competitive baseline which uses rich features like part-of-speech (POS) tags, chunk tags, cluster ids and lemmas. We also propose some heuristics to identify ezafe constructions in Urdu texts which show promising results in parsing these constructions.

[1]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[2]  Pushpak Bhattacharyya,et al.  IndoWordNet , 2010, LREC.

[3]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[4]  Yuji Matsumoto,et al.  Japanese Dependency Analysis using Cascaded Chunking , 2002, CoNLL.

[5]  Joakim Nivre,et al.  Discriminative Classifiers for Deterministic Dependency Parsing , 2006, ACL.

[6]  Fei Xia,et al.  Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure , 2009 .

[7]  T. Mohanan Argument structure in Hindi , 1994 .

[8]  Mark Steedman,et al.  Using CCG categories to improve Hindi dependency parsing , 2013, ACL.

[9]  Joakim Nivre,et al.  A Dynamic Oracle for Arc-Eager Dependency Parsing , 2012, COLING.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Michael Collins,et al.  Efficient Third-Order Dependency Parsers , 2010, ACL.

[12]  Nizar Habash,et al.  Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features , 2013, CL.

[13]  Miriam Butt,et al.  Identifying Urdu Complex Predication via Bigram Extraction , 2012, COLING.

[14]  Koldo Gojenola,et al.  Application of feature propagation to dependency parsing , 2009, IWPT.

[15]  Owen Rambow,et al.  Towards a Multi-Representational Treebank , 2008 .

[16]  Matthew Hohensee It's Only Morpho-Logical: Modeling Agreement in Cross-Linguistic Dependency Parsing , 2012 .

[17]  Joakim Nivre,et al.  Transition-based Dependency Parsing with Rich Non-local Features , 2011, ACL.

[18]  Patrick Watrin,et al.  Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing , 2012, ACL.

[19]  Riyaz Ahmad Bhat,et al.  Exploring Semantic Information in Hindi WordNet for Hindi Dependency Parsing , 2013, IJCNLP.

[20]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[21]  Heshaam Faili,et al.  On the Importance of Ezafe Construction in Persian Parsing , 2015, ACL.

[22]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[23]  Dirk Hovy,et al.  What’s in a Preposition? Dimensions of Sense Disambiguation for an Interesting Word Class , 2010, COLING.

[24]  Fei Xia,et al.  The Hindi/Urdu Treebank Project , 2017 .

[25]  Riyaz Ahmad Bhat,et al.  Can Distributed Word Embeddings be an alternative to costly linguistic features: A Study on Parsing Hindi , 2015 .

[26]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[27]  Joakim Nivre,et al.  Parsing Indian Languages with MaltParser , 2009 .

[28]  Miriam Butt,et al.  Urdu Ezafe and the Morphology-Syntax Interface , 2008 .

[29]  Miriam Butt,et al.  The Status of Case , 2004 .

[30]  Joakim Nivre,et al.  Incrementality in Deterministic Dependency Parsing , 2004 .

[31]  Peter Svenonius,et al.  Adpositions, particles and the arguments they introduce , 2007 .

[32]  Reut Tsarfaty,et al.  Parsing Morphologically Rich Languages: Introduction to the Special Issue , 2013, Computational Linguistics.

[33]  Koldo Gojenola,et al.  Testing the Effect of Morphological Disambiguation in Dependency Parsing of Basque , 2011, SPMRL@IWPT.

[34]  Emily M. Bender,et al.  Getting More from Morphology in Multilingual Dependency Parsing , 2012, NAACL.

[35]  Joel R. Tetreault,et al.  It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool , 2015, ACL.

[36]  Dipti Misra Sharma,et al.  Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank , 2012, LAW@ACL.

[37]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[38]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[39]  Ashish Jain,et al.  Identification of Conjunct Verbs in Hindi and Its Effect on Parsing Accuracy , 2011, CICLing.

[40]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsing Models , 2007, EMNLP.

[41]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[42]  Joakim Nivre,et al.  Training Deterministic Parsers with Non-Deterministic Oracles , 2013, TACL.

[43]  Martha Palmer,et al.  Adapting Predicate Frames for Urdu PropBanking , 2014, EMNLP 2014.

[44]  Anuradha Saksena,et al.  Case marking semantics , 1982 .

[45]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[46]  Sarmad Hussain,et al.  Urdu Dependency Parser: A Data-Driven approach , 2010 .

[47]  Yoav Goldberg,et al.  Easy-First Dependency Parsing of Modern Hebrew , 2010, SPMRL@NAACL-HLT.

[48]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[49]  Fei Xia,et al.  A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu , 2009, Linguistic Annotation Workshop.

[50]  Ozan Arkan Can,et al.  Multiword Expressions in Statistical Dependency Parsing , 2011, SPMRL@IWPT.

[51]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[52]  Joakim Nivre,et al.  Algorithms for Deterministic Incremental Dependency Parsing , 2008, CL.

[53]  Joakim Nivre,et al.  On the Role of Morphosyntactic Features in Hindi Dependency Parsing , 2010, SPMRL@NAACL-HLT.

[54]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[55]  Stephen Clark,et al.  A Tale of Two Parsers: Investigating and Combining Graph-based and Transition-based Dependency Parsing , 2008, EMNLP.

[56]  Marie Candito,et al.  Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing , 2014, ACL.

[57]  Joakim Nivre,et al.  Pseudo-Projective Dependency Parsing , 2005, ACL.

[58]  Martha Palmer,et al.  Semantic Roles for Nominal Predicates: Building a Lexical Resource , 2013, MWE@NAACL-HLT.

[59]  Dipti Misra Sharma,et al.  Dependency Annotation Scheme for Indian Languages , 2008, IJCNLP.

[60]  Akshar Bharati,et al.  Natural language processing : a Paninian perspective , 1996 .

[61]  Sambhav Jain,et al.  Two Methods to Incorporate ’Local Morphosyntactic’ Features in Hindi Dependency Parsing , 2010, SPMRL@NAACL-HLT.

[62]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[63]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[64]  Khalil Sima'an,et al.  Modeling Morphosyntactic Agreement in Constituency-Based Parsing of Modern Hebrew , 2010, SPMRL@NAACL-HLT.

[65]  Prashanth Mannem,et al.  Statistical Morphological Analyzer for Hindi , 2013, IJCNLP.

[66]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[67]  Colin P. Masica The Indo-Aryan Languages , 1991 .

[68]  Jonas Kuhn,et al.  Morphological and Syntactic Case in Statistical Dependency Parsing , 2013, Computational Linguistics.

[69]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[70]  Yoav Goldberg,et al.  Word Segmentation, Unknown-word Resolution, and Morphological Agreement in a Hebrew Parsing System , 2013, CL.