Statistical Parsing by Machine Learning from a Classical Arabic Treebank

Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year.

[1]  J. Åkesson Arabic Morphology and Phonology: Based on the Marāḥ Al-Arwāḥ By Aḥmad B. 'aī B. Mas'ūd , 2017 .

[2]  Roger Evans,et al.  Adam Kilgarriff , 2015, Computational Linguistics.

[3]  R. Baalbaki Arabic Linguistic Tradition I , 2013 .

[4]  I. Boullata Fawātiḥ al-Suwar: The Mysterious Letters of the Qur'ān , 2013 .

[5]  Tapio Salakoski,et al.  Building the essential resources for Finnish: the Turku Dependency Treebank , 2013, Language Resources and Evaluation.

[6]  Rabiah Abdul Kadir,et al.  Query Translation using Concepts Similarity based on Quran Ontology for Cross-Language Information Retrieval , 2013, J. Comput. Sci..

[7]  E. Atwell,et al.  SALMA: Standard Arabic Language Morphological Analysis , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[8]  R. Mahmud,et al.  Issues of coherence analysis on English translations of Quran , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[9]  Nizar Habash,et al.  Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features , 2013, CL.

[10]  Ghaida Rebdawi,et al.  The Interactive Arabic Dictionary: Another Collaboratively Constructed Language Resource , 2013 .

[11]  A. B. Muhammad Annotation of conceptual co-reference and text Mining the Qur'an , 2012 .

[12]  John Lee,et al.  A Dependency Treebank of Classical Chinese Poems , 2012, NAACL.

[13]  Wajdi Zaghouani,et al.  A Pilot PropBank Annotation for Quranic Arabic , 2012, CLfL@NAACL-HLT.

[14]  A. M. Alashqar,et al.  A comparative study on Arabic POS tagging using Quran corpus , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[15]  Eric Atwell,et al.  QurAna: Corpus of the Quran annotated with Pronominal Anaphora , 2012, LREC.

[16]  Eric Atwell,et al.  Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing , 2012, LREC.

[17]  Eric Atwell,et al.  QurSim: A corpus for evaluation of relatedness in short texts , 2012, LREC.

[18]  Jonas Kuhn,et al.  Making Ellipses Explicit in Dependency Conversion for a German Treebank , 2012, LREC.

[19]  Eric Atwell,et al.  LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis , 2012, LREC.

[20]  Jamal al-Qinai,et al.  Convergence and divergence in the interpretation of Quranic polysemy and lexical recurrence , 2012 .

[21]  Susan Tiefenbrun,et al.  SAN JOSE (California) , 2012 .

[22]  Jamal Al-qinai Convergence and Divergence in the Interpretation of QuranicPolysemy and Lexical Recurrence. , 2011 .

[23]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[24]  Nizar Habash,et al.  Supervised collaboration for syntactic annotation of Quranic Arabic , 2011, Language Resources and Evaluation.

[25]  Nizar Habash,et al.  One-Step Statistical Parsing of Hybrid Dependency-Constituency Syntactic Representations , 2011, IWPT.

[26]  Duncan Forbes,et al.  Grammar of the arabic language , 2011 .

[27]  Hajder S. Rabiee Adapting Standard Open-Source Resources To Tagging A Morphologically Rich Language: A Case Study With Arabic , 2011, RANLP.

[28]  Yoav Goldberg,et al.  Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser , 2011, ACL.

[29]  Anders Søgaard,et al.  Semi-supervised condensed nearest neighbor for part-of-speech tagging , 2011, ACL.

[30]  Nazlia Omar,et al.  Developing a Competitive HMM Arabic POS Tagger Using Small Training Corpora , 2011, ACIIDS.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  R. Zainuddin,et al.  QUR'ANIC WORDS STEMMING , 2010 .

[33]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[34]  Reaz Ahmed,et al.  Arabic nominals in HPSG: A verbal noun perspective , 2010, Proceedings of the International Conference on Head-Driven Phrase Structure Grammar.

[35]  Lambert M. Surhone,et al.  Compiler Construction , 2010, Lecture Notes in Computer Science.

[36]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[37]  Yoav Goldberg,et al.  Easy-First Dependency Parsing of Modern Hebrew , 2010, SPMRL@NAACL-HLT.

[38]  Nizar Habash,et al.  Improving Arabic Dependency Parsing with Lexical and Inflectional Morphological Features , 2010, SPMRL@NAACL-HLT.

[39]  Koldo Gojenola,et al.  Application of Different Techniques to Dependency Parsing of Basque , 2010, SPMRL@NAACL-HLT.

[40]  Eric Atwell,et al.  Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text , 2010, LREC.

[41]  Eric Atwell,et al.  Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank , 2010, LREC.

[42]  Nizar Habash,et al.  Morphological Annotation of Quranic Arabic , 2010, LREC.

[43]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[44]  Tim Buckwalter,et al.  A Dependency Treebank of the Quran using traditional Arabic grammar , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[45]  Aniket Kittur,et al.  Beyond Wikipedia: coordination and conflict in online production groups , 2010, CSCW '10.

[46]  Luz Rello,et al.  A Rule-Based Approach to the Identification of Spanish Zero Pronouns , 2009, RANLP.

[47]  Udo Kruschwitz,et al.  Constructing an Anaphorically Annotated Corpus with Non-Experts: Assessing the Quality of Collaborative Annotations , 2009, PWNLP@IJCNLP.

[48]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[49]  Joakim Nivre,et al.  Non-Projective Dependency Parsing in Expected Linear Time , 2009, ACL.

[50]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[51]  Eric Atwell,et al.  Development of tag sets for part-of-speech tagging , 2008 .

[52]  Klaus Böhm,et al.  Geographical analysis of hierarchical business structures by interactive drill down , 2008, GIS '08.

[53]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[54]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[55]  Joakim Nivre,et al.  A Dependency-Driven Parser for German Dependency and Constituency Representations , 2008, ACL 2008.

[56]  James R. Glass,et al.  Segmentation for English-to-Arabic Statistical Machine Translation , 2008, ACL.

[57]  Reut Tsarfaty,et al.  A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing , 2008, ACL.

[58]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[59]  Seth Kulick,et al.  Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines , 2008, LREC.

[60]  M. A. Haleem,et al.  Arabic-English Dictionary of Qur'anic Usage , 2008 .

[61]  I. Mattson The Story of the Qur'an: Its History and Place in Muslim Life , 2007 .

[62]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[63]  Nizar Habash,et al.  Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features , 2007, EMNLP.

[64]  I. Kleiner A History of Abstract Algebra , 2007 .

[65]  Antal van den Bosch,et al.  Book Reviews: Arabic Computational Morphology: Knowledge-Based and Empirical Methods by Abdelhadi Soudi, Antal van den Bosch, and Günter Neumann (editors) , 2007, CL.

[66]  Mona T. Diab Improved Arabic Base Phrase Chunking with a new enriched POS tag set , 2007, SEMITIC@ACL.

[67]  Joakim Nivre,et al.  Single Malt or Blended? A Study in Multilingual Parser Optimization , 2007, EMNLP.

[68]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsing Models , 2007, EMNLP.

[69]  Joakim Nivre,et al.  A Hybrid Constituency-Dependency Parser for Swedish , 2007, NODALIDA.

[70]  Qi Su,et al.  Internet-scale collection of human-reviewed data , 2007, WWW '07.

[71]  Geoffrey Zweig,et al.  The IBM 2006 Gale Arabic ASR System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[72]  Reut Tsarfaty,et al.  Integrated Morphological and Syntactic Disambiguation for Modern Hebrew , 2006, ACL.

[73]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[74]  Andreas Zollmann,et al.  Syntax Augmented Machine Translation via Chart Parsing , 2006, WMT@HLT-NAACL.

[75]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[76]  Seth Kulick,et al.  Fully Parsing the Penn Treebank , 2006, NAACL.

[77]  Alon Lavie,et al.  A Classifier-Based Parser with Linear Run-Time Complexity , 2005, IWPT.

[78]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[79]  Daniel M. Bikel,et al.  Intricacies of Collins’ Parsing Model , 2004, CL.

[80]  Gregory Kuhlmann and Peter Stone and Raymond J. Mooney and Shavlik Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer , 2004, AAAI 2004.

[81]  Shuly Wintner,et al.  Morphological Analysis of the Qur'an , 2004, Lit. Linguistic Comput..

[82]  Geoffrey Sampson,et al.  Corpus Linguistics: Readings in a Widening Discipline , 2004 .

[83]  K. Glasgow,et al.  Los Angeles, California , 2003 .

[84]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[85]  Edward William Lane,et al.  Arabic-English Lexicon , 2003 .

[86]  Najoua Essoukri Ben Amara,et al.  Classification of Arabic script using multiple sources of information: State of the art and perspectives , 2003, Document Analysis and Recognition.

[87]  Jason Brittain,et al.  Tomcat: The Definitive Guide , 2003 .

[88]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[89]  Ludwig M. Eichinger,et al.  Dependency Syntax in Functional Generative Description , 2003 .

[90]  David Chiang,et al.  Recovering Latent Information in Treebanks , 2002, COLING.

[91]  Silvia Hansen,et al.  Developments in the TIGER Annotation Scheme and their Realization in the Corpus , 2002, LREC.

[92]  Wolfdietrich Fischer,et al.  A grammar of classical Arabic , 2001 .

[93]  William D. Marslen-Wilson,et al.  Morphological units in the Arabic mental lexicon , 2001, Cognition.

[94]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[95]  Ibn Kathir Tafsir Ibn Kathir , 2000 .

[96]  Thorsten Brants,et al.  Probabilistic Parsing and Psychological Plausibility , 2000, COLING.

[97]  Thorsten Brants,et al.  Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[98]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[99]  T. Brants TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[100]  Martin Kay,et al.  Guides and Oracles for Linear-Time Parsing , 2000, IWPT.

[101]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[102]  K. Versteegh,et al.  The development of Arabic linguistics after Sībawayhi: Baṣra, Kūfa and Baghdad , 2000 .

[103]  Adam Kilgarriff,et al.  Gold standard datasets for evaluating word sense disambiguation programs , 1998, Comput. Speech Lang..

[104]  Jan Hajic,et al.  Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset , 1998, ACL.

[105]  Mahmoud Gaafar,et al.  Arabic Verbs and Essentials of Grammar: A Practical Guide to the Mastery of Arabic , 1997 .

[106]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[107]  K. Versteegh Landmarks in Linguistic Thought Volume III: The Arabic Linguistic Tradition , 1997 .

[108]  T. Brants,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[109]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[110]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[111]  Richard Hudson,et al.  English word grammar , 1995 .

[112]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[113]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[114]  K. Ryding,et al.  The Foundations of Grammar: An Introduction to Medieval Arabic Grammatical Theory , 1989 .

[115]  Saad A. Mehdi,et al.  Arabic Language Parser , 1986, Int. J. Man Mach. Stud..

[116]  P. Hellwig Dependency Unification Grammar , 1986, COLING.

[117]  J. Owens Structure, class and dependency: Modern linguistic theory and the Arabic grammatical tradition☆ , 1984 .

[118]  Mikhail A. Sokolovskiy,et al.  Introductory Chapter , 1979, Earth and Environmental Science Transactions of the Royal Society of Edinburgh.

[119]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[120]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[121]  C. Versteegh The Arabic Terminology of Syntactic Position * , 1978, Arabica.

[122]  M. Carter An Arab Grammarian of the Eighth Century A , 1973 .

[123]  Jane J. Robinson Dependency Structures and Transformational Rules , 1970 .

[124]  Donald E. Knuth,et al.  On the Translation of Languages from Left to Right , 1965, Inf. Control..

[125]  D. G. Hays Dependency Theory: A Formalism and Some Observations , 1964 .

[126]  John Alfred Haywood,et al.  A new Arabic grammar of the written language , 1962 .

[127]  Dennis Eichmann,et al.  The History Of Mathematics An Introduction , 2016 .

[128]  Angelika Königseder,et al.  Walter de Gruyter , 2016 .

[129]  O. Spevak Glossary of Linguistic Terms , 2014 .

[130]  Saudi Arabia,et al.  The Holy Quran Digitization: Challenges and Concerns , 2013 .

[131]  Bernie Power The Textual History of the Qur’an , 2013 .

[132]  Mervat Ibrahim The Arabic Language , 2012 .

[133]  Boumediene Belkhouche,et al.  Parse Trees of Arabic Sentences Using the Natural Language Toolkit , 2012 .

[134]  Klong Luang,et al.  Application of a Mining Algorithm to Finding Frequent Patterns in a Text Corpus: A Case Study of the Arabic , 2012 .

[135]  M. Boella Regular expressions for interpreting and cross-referencing Hadith texts , 2012 .

[136]  Michael Gasser,et al.  A Dependency Grammar for Amharic , 2010 .

[137]  Ryan Gabbard,et al.  Null element restoration , 2010 .

[138]  Nizar Habash,et al.  Understanding the Quran:a new grand challenge for computer science and artificial intelligence , 2010 .

[139]  Lubna Almenoar,et al.  Procedure with graphics using Quranic verses in English , 2010 .

[140]  Emilio Carrizosa,et al.  Binarized Support Vector Machines , 2010, INFORMS J. Comput..

[141]  Nizar Habash,et al.  Syntactic Annotation in the Columbia Arabic Treebank , 2009 .

[142]  Gregory Crane,et al.  An Ownership Model of Annotation: The Ancient Greek Dependency Treebank , 2009 .

[143]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[144]  Ramzī Baʿlabakkī,et al.  The legacy of the Kitāb : Sībawayhi's analytical methods within the context of the Arabic grammatical theory , 2008 .

[145]  Nizar Habash,et al.  Arabic Morphological Representations for Machine Translation , 2007 .

[146]  Joakim Nivre,et al.  MaltParser: A language-independent system for data-driven dependency parsing , 2007 .

[147]  Eric Atwell,et al.  A comparative study of the tagging of adverbs in modern English corpora , 2007 .

[148]  Seth Kulick,et al.  Parsing the Arabic Treebank: Analysis and Improvements , 2006 .

[149]  Haidar Moukdad,et al.  Stemming and root-based approaches to the retrieval of Arabic documents on the Web , 2006, Webology.

[150]  Liang Huang,et al.  Statistical Syntax-Directed Translation with Extended Domain of Locality , 2006, AMTA.

[151]  Otakar Smr,et al.  Formal System and Implementation , 2006 .

[152]  Eric Atwell,et al.  The design of a corpus of Contemporary Arabic , 2006 .

[153]  Joakim Nivre,et al.  Dependency Grammar and Dependency Parsing , 2005 .

[154]  Shrikanth S. Narayanan,et al.  Automatic diacritization of Arabic transcripts for automatic speech recognition , 2005 .

[155]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic: Arabic noun types , 2005 .

[156]  菅山 謙正 Word Grammar 理論の研究 , 2005 .

[157]  A. Marsham,et al.  Medieval Islamic Civilization: An Encyclopedia , 2005 .

[158]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[159]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[160]  Tang Qing-ye,et al.  New Developments in the Study of Function and Cognition —— Review on Construing Experience through Meaning: A Language-based Approach to Cognition , 2004 .

[161]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[162]  Petr Pajas,et al.  PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[163]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[164]  Steven Bird,et al.  Towards a general model of interlinear text , 2003 .

[165]  Yasir Suleiman,et al.  The Arabic language and national identity : a study in ideology , 2003 .

[166]  À. Toronto A New Paradigm for Addressing Old Questions: The Relevance of the Interlinear Model for the Study of the Septuagint , 2002, Bible and Computer.

[167]  Kenneth R. Beesley,et al.  Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001 , 2001 .

[168]  John F. Sowa,et al.  Book Reviews: Construing Experience through Meaning: A Language-based Approach to Cognition , 2001 .

[169]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[170]  G. Hawting The First Dynasty of Islam : The Umayyad Caliphate AD 661-750 , 2000 .

[171]  Fahd M. M. Al-Liheibi Aspects of sentence analysis in the Arabic linguistic tradition, with particular reference to ellipsis , 1999 .

[172]  Sandra Kübler,et al.  Recent Developments in Linguistic Annotations of the TüBa-D / Z Treebank , 1999 .

[173]  E. Mccarus The Explanation of Linguistic Causes: Az-Zâggagï's theory of grammar . Introduction, translation, commentary by Kees Versteegh , 1997 .

[174]  PM Williams The encyclopedia of language and linguistics , 1994 .

[175]  Robert Burchfield,et al.  Glossary of linguistic terms , 1994 .

[176]  Esa Itkonen,et al.  Universal history of linguistics , 1991 .

[177]  J. Owens THE SYNTACTIC BASIS OF ARABIC WORD CLASSIFICATION , 1989 .

[178]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[179]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[180]  M. Lings Muhammad: His Life Based on the Earliest Sources , 1983 .

[181]  J. S. Badeau,et al.  The Genius of Arab civilization: Source of Renaissance , 1978 .

[182]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[183]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[184]  Carl Paul Caspari,et al.  A grammar of the Arabic language , 1859 .

[185]  Houssain Kettani,et al.  World Muslim Population : 1950 – 2020 , 2022 .

[186]  B. Hladká,et al.  The Prague Dependency Treebank: Annotation Structure and Support , 2022 .