A bootstrapping method for development of Treebank

Using statistical approaches beside the traditional methods of natural language processing could significantly improve both the quality and performance of several natural language processing (NLP) tasks. The effective usage of these approaches is subject to the availability of the informative, accurate and detailed corpora on which the learners are trained. This article introduces a bootstrapping method for developing annotated corpora based on a complex and rich linguistically motivated elementary structure called supertag. To this end, a hybrid method for supertagging is proposed that combines both of the generative and discriminative methods of supertagging. The method was applied on a subset of Wall Street Journal (WSJ) in order to annotate its sentences with a set of linguistically motivated elementary structures of the English XTAG grammar that is using a lexicalised tree-adjoining grammar formalism. The empirical results confirm that the bootstrapping method provides a satisfactory way for annotating the English sentences with the mentioned structures. The experiments show that the method could automatically annotate about 20% of WSJ with the accuracy of F-measure about 80% of which is particularly 12% higher than the F-measure of the XTAG Treebank automatically generated from the approach proposed by Basirat and Faili [(2013). Bridge the gap between statistical and hand-crafted grammars. Computer Speech and Language, 27, 1085–1104].

[1]  Alexis Nasr,et al.  MICA: A Probabilistic Dependency Parser Based on Tree Insertion Grammars (Application Note) , 2009, HLT-NAACL.

[2]  Heshaam Faili,et al.  Constructing Linguistically Motivated Structures from Statistical Grammars , 2011, RANLP.

[3]  Ralph Grishman,et al.  A Treebank of Spanish and its Application to Parsing , 2000, LREC.

[4]  Josef van Genabith,et al.  QuestionBank: Creating a Corpus of Parse-Annotated Questions , 2006, ACL.

[5]  Vijay K. Shanker,et al.  Towards efficient statistical parsing using lexicalized grammatical information , 2002 .

[6]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[7]  XTAG Research Group,et al.  A Lexicalized Tree Adjoining Grammar for English , 1998, ArXiv.

[8]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[9]  Tim Buckwalter,et al.  A Dependency Treebank of the Quran using traditional Arabic grammar , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  Anne Abeillé,et al.  Enriching a French Treebank , 2004, LREC.

[12]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[13]  Ari Rappoport,et al.  Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[14]  Heshaam Faili,et al.  Automatic Enhancement of LTAG Treebank , 2013, RANLP.

[15]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[16]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[17]  Richard C. Waters,et al.  Tree Insertion Grammar: A Cubic-Time, Parsable Formalism that Lexicalizes Context-Free Grammar without Changing the Trees Produced , 1995, CL.

[18]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[19]  Kemal Oflazer,et al.  The Annotation Process in the Turkish Treebank , 2003, LINC@EACL.

[20]  Aravind K. Joshi,et al.  LTAG Dependency Parsing with Bidirectional Incremental Construction , 2008, EMNLP.

[21]  Heshaam Faili,et al.  Bridge the gap between statistical and hand-crafted grammars , 2013, Comput. Speech Lang..

[22]  Aravind K. Joshi,et al.  LTAG-spinal and the Treebank , 2008, Lang. Resour. Evaluation.

[23]  Anoop Sarkar Combining Supertagging and Lexicalized Tree-Adjoining Grammar Parsing∗ , 2006 .

[24]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[25]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.