Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach

With the availability of limited electronic resources, development of a syntactic parser for all types of sentence forms is a challenging and demanding task for any natural language. This paper presents the development of Penn Treebank based statistical syntactic parsers for two South Dravidian languages namely Kannada and Malayalam. Syntactic parsing is the task of recognizing a sentence and assigning a syntactic structure to it. A syntactic parser is an essential tool used for various natural language processing (NLP) applications and natural language understanding. The well known grammar formalism called Penn Treebank structure was used to create the corpus for proposed statistical syntactic parsers. Both the parsing systems were trained using Treebank based corpus consists of 1,000 Kannada and Malayalam sentences that were carefully constructed. The developed corpus has been already annotated with correct segmentation and Part-Of-Speech (POS) information. We have used our own POS tagger generator for assigning proper tags to each and every word in the training and test sentences. The proposed syntactic parser was implemented using supervised machine learning and probabilistic context free grammars (PCFG) approaches. Training, testing and evaluations were done by support vector method (SVM) algorithms. From the experiment we found that the performance of our systems are significantly well and achieves a very competitive accuracy.

[1]  Antony P.J,et al.  SVM Based Part of Speech Tagger for Malayalam , 2010, 2010 International Conference on Recent Trends in Information, Telecommunication and Computing.

[2]  M. Phil,et al.  LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow , 2010 .

[3]  Shalini R. Urs,et al.  Development of Prototype Morphological Analyzer for he South Indian Language of Kannada , 2007, ICADL.

[4]  Reut Tsarfaty,et al.  Word-Based or Morpheme-Based? Annotation Strategies for Modern Hebrew Clitics , 2008, LREC.

[5]  P. Ramakanth Kumar,et al.  Solving the Noun Phrase and Verb Phrase Agreement in Kannada Sentences , 2009 .

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  M Selvam,et al.  Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model , 2008 .

[8]  Sanjeeth Kumar Ravindranath,et al.  Free Grammar for Natural Language constructs – An implementation for Venpa class of Tamil Poetry , 2003 .

[9]  D. K. Lobiyal,et al.  A computational grammar for Hindi verb phrase , 1994, Proceedings of International Conference on Expert Systems for Development.

[10]  Abhishek Arun,et al.  Statistical Parsing of the French Treebank , 2004 .

[11]  K. P. Soman,et al.  Kernel based part of speech tagger for Kannada , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[12]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[13]  Edie Rasmussen,et al.  Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers , 2007 .

[14]  Rebecca Frances Watson,et al.  Optimising the speed and accuracy of a statistical GLR parser , 2008 .