Informal Persian Universal Dependency Treebank

This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.

[1]  Mohammad Sadegh Rasooli,et al.  The Persian Dependency Treebank Made Universal , 2020, ArXiv.

[2]  Ryan Smith Similative plurality and the nature of alternatives , 2020, Semantics and Pragmatics.

[3]  Mihai Surdeanu,et al.  Parsing as Tagging , 2020, LREC.

[4]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[5]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[6]  Johannes Heinecke,et al.  ConlluEditor: a fully graphical editor for Universal dependencies treebank files , 2019, Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019).

[7]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[8]  Daniel Jurafsky,et al.  Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context , 2018, ACL.

[9]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[10]  Mojgan Seraji,et al.  Universal Dependencies for Persian , 2016, LREC.

[11]  Martin Haspelmath The Serial Verb Construction: Comparative Concept and Cross-linguistic Generalizations , 2016 .

[12]  Sanjay Kumar Jena,et al.  Parsing-based sarcasm sentiment recognition in Twitter data , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[13]  Mojgan Seraji,et al.  Morphosyntactic Corpora and Tools for Persian , 2015 .

[14]  Narges Nematollahi Development of the progressive construction in Modern Persian , 2015 .

[15]  Arsalan Kahnemuyipour Revisiting the Persian Ezafe construction: A roll-up movement analysis , 2014 .

[16]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[17]  Behrouz Minaei-Bidgoli,et al.  An Empirical Study on the Effect of Morphological and Lexical Features in Persian Dependency Parsing , 2013, SPMRL@EMNLP.

[18]  Mohammad Sadegh Rasooli,et al.  Development of a Persian Syntactic Dependency Treebank , 2013, NAACL 2013.

[19]  Mojgan Seraji,et al.  Dependency Parsers for Persian , 2012, ALR@COLING.

[20]  Simin Karimi,et al.  A generalization concerning the EZAFE construction in persian , 2012 .

[21]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[22]  Simin Karimi A Minimalist Approach to Scrambling: Evidence from Persian , 2005 .

[23]  Donald L. Stilo Coordination in Three Western Iranian languages: Vafsi, Persian and Gilaki , 2004 .

[24]  Jila Ghomeshi,et al.  Non-Projecting Nouns and the Ezafe: Construction in Persian , 1997 .

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Gilbert Lazard,et al.  A grammar of contemporary Persian , 1994 .

[27]  E. M. Jeremias Diglossia in Persian , 1984 .

[28]  Gernot L. Windfuhr Persian Grammar: History and State of Its Study , 1979 .

[29]  P. Ladefoged A course in phonetics , 1975 .

[30]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[31]  J. A. Boyle Notes on the colloquial lanǵuaǵe of Persia as recorded in certain recent writings , 1952 .