Toward a Multi-Representation Persian Treebank

In this paper, we describe our project of building a phrase structure treebank for Persian. The treebank consists of approximately 30000 sentences. With the help of this treebank, the researcher can investigate syntactic phenomena, extract grammars, train and test parsers, etc. In addition to these motivations, as another advantage of it we can refer to the fact that the sentences of this treebank are selected from an available dependency treebank. So the final treebank has two syntactic representations: phrase structure and dependency structure. The treebank is built using a bootstrapping approach, which converts a dependency structure tree to a phrase structure tree and the annotations are corrected manually. Using the new phrase structure treebank, we train models for constituency parsers. The treebank is freely available for educational purposes1.

[1]  Éric Villemonte de la Clergerie,et al.  Deep Syntax Annotation of the Sequoia French Treebank , 2014, LREC.

[2]  Heshaam Faili,et al.  Using decision tree to hybrid morphology generation of Persian verb for English-Persian translation , 2015, Comput. Speech Lang..

[3]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[4]  Fei Xia,et al.  Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure , 2009 .

[5]  Mahmood Bijankhan,et al.  Lessons from building a Persian written corpus: Peykare , 2011, Lang. Resour. Evaluation.

[6]  Mohammad Sadegh Rasooli,et al.  Development of a Persian Syntactic Dependency Treebank , 2013, NAACL 2013.

[7]  Mark Steedman,et al.  Hindi CCGbank: A CCG treebank from the Hindi dependency treebank , 2017, Language Resources and Evaluation.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[10]  Zhiguo Wang,et al.  Language Independent Dependency to Constituent Tree Conversion , 2016, COLING.

[11]  Rebecca Hwa,et al.  Jointly Parse and Fragment Ungrammatical Sentences , 2018, AAAI.

[12]  Fei Xia,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[13]  Joakim Nivre,et al.  Greedy Universal Dependency Parsing with Right Singular Word Vectors , 2016 .

[14]  Jonas Kuhn,et al.  Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank , 2014, LREC.

[15]  Joakim Nivre,et al.  Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing , 2017, NODALIDA.

[16]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[17]  Mojgan Seraji,et al.  ParsPer: A Dependency Parser for Persian , 2015, DepLing.

[18]  Fei Xia Towards a Multi-Representational Treebank , 2008 .

[19]  Masood Ghayoomi Bootstrapping the Development of an HPSG-based Treebank for Persian , 2012 .

[20]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[21]  Mojgan Seraji,et al.  Universal Dependencies for Persian , 2016, LREC.

[22]  Heshaam Faili,et al.  On the Importance of Ezafe Construction in Persian Parsing , 2015, ACL.

[23]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[24]  Joakim Nivre,et al.  Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[25]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[26]  Mojgan Seraji,et al.  Bootstrapping a Persian Dependency Treebank , 2012 .

[27]  Joakim Nivre,et al.  Inductive Dependency Parsing , 2006, Text, speech and language technology.

[28]  William Schuler,et al.  On Relations of Constituency and Dependency Grammars , 2004 .

[29]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.