Dependency Treebank of Urdu and its Evaluation

In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from a newspaper corpus and uses a Karaka based grammatical framework inspired by Paninian grammatical theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. This work also aims to evaluate the correctness or reliability of this manual annotated dependency treebank. Evaluation is done by measuring the inter-annotator agreement on a manually annotated data set of 196 sentences (5600 words) annotated by two annotators. We present the qualitative analysis of the agreement statistics and identify the possible reasons for the disagreement between the annotators. We also show the syntactic annotation of some constructions specific to Urdu like Ezafe and discuss the problem of word segmentation (tokenization).

[1]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[2]  Dilek Z. Hakkani-Tür,et al.  Building a Turkish Treebank , 2003 .

[3]  Himani Chaudhry Annotation and Issues in Building an English Dependency Treebank , 2011 .

[4]  Ashish Jain,et al.  Identification of Conjunct Verbs in Hindi and Its Effect on Parsing Accuracy , 2011, CICLing.

[5]  Dipti Misra Sharma,et al.  A Karaka Based Annotation Scheme for English , 2009, CICLing.

[6]  Nadir Durrani,et al.  Urdu Word Segmentation , 2010, NAACL.

[7]  Miriam Butt,et al.  Proceedings of LFG08 , 2008 .

[8]  Tara Warrier Mohanan,et al.  Arguments in Hindi , 1990 .

[9]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[10]  Stuart M. Shieber,et al.  Evidence against the context-freeness of natural language , 1985 .

[11]  Eva Hajicová,et al.  Treebank Annotation , 2010, Handbook of Natural Language Processing.

[12]  Chung Yong Lim,et al.  A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation , 1999 .

[13]  Akshar Bharati,et al.  Paninian Grammar Framework Applied to English , .

[14]  Fei Xia,et al.  Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure , 2009 .

[15]  Fei Xia,et al.  A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu , 2009, Linguistic Annotation Workshop.

[16]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[17]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[18]  Dipti Misra Sharma,et al.  Dependency Annotation Scheme for Indian Languages , 2008, IJCNLP.

[19]  Arantza Díaz de Ilarraza,et al.  Evaluation of the Syntactic Annotation in EPEC, the Reference Corpus for the Processing of Basque , 2009, CICLing.

[20]  Gurpreet Lehal A Word Segmentation System for Handling Space Omission Problem in Urdu Script , 2010 .

[21]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[22]  Marilyn A. Walker,et al.  A Dependency Treebank for English , 2002, LREC.

[23]  D. Bhat Grammatical Relations: The Evidence Against Their Necessity and Universality , 1991 .

[24]  Miriam Butt,et al.  Urdu Ezafe and the Morphology-Syntax Interface , 2008 .

[25]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[26]  Francis Bond,et al.  The Hinoki syntactic and semantic treebank of Japanese , 2007, Lang. Resour. Evaluation.

[27]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[28]  Cristina Bosco,et al.  Dependency and relational structure in treebank annotation , 2004 .

[29]  Colin P. Masica The Indo-Aryan Languages , 1991 .

[30]  Gerhard Paass,et al.  Dependency Tree Kernels for Relation Extraction from Natural Language Text , 2009, ECML/PKDD.

[31]  Akshar Bharati,et al.  Natural language processing : a Paninian perspective , 1996 .