Orthographic enrichment for arabic grammatical analysis

The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relations between the words in a sentence. The study uses the memory-based algorithm for vocalization, segmentation, and part of speech tagging, and the natural language parser MaltParser for dependency parsing. The thesis represents the first approach to the processing of real-world Arabic, and has found that through the correct choice of features and algorithms, the need for pre-processing for grammatical analysis can be minimized.

[1]  Sandra Kübler,et al.  Is Arabic Part of Speech Tagging Feasible Without Word Segmentation? , 2010, NAACL.

[2]  Nizar Habash,et al.  Syntactic Annotation in the Columbia Arabic Treebank , 2009 .

[3]  Judith Rosenhouse,et al.  Arabic Dialects and Maltese , 1997 .

[4]  H. Rogers Writing Systems: A Linguistic Approach , 2004 .

[5]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[6]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[7]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[8]  Sandra Kübler,et al.  Instance Sampling Methods for Pronoun Resolution , 2009, RANLP.

[9]  J. McCarthy The phonology and morphology of Arabic , 2004 .

[10]  菅山 謙正,et al.  Word Grammar 理論の研究 , 2005 .

[11]  Ming-Wei Chang,et al.  A Pipeline Model for Bottom-Up Dependency Parsing , 2006, CoNLL.

[12]  Peter F. Abboud,et al.  Elementary modern standard Arabic , 1983 .

[13]  P. Matthews The Concise Oxford Dictionary of Linguistics , 1998 .

[14]  Salim Abu-Rabia,et al.  The Effect of Arabic Vowels on the Reading Comprehension of Second- and Sixth-Grade Native Arab Children , 1999, Journal of psycholinguistic research.

[15]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[16]  Michael A. Covington,et al.  A Fundamental Algorithm for Dependency Parsing , 2004 .

[17]  Martine Cuvalay-Haak The Verb in Literary and Colloquial Arabic , 1997 .

[18]  Charles Jochim,et al.  Evaluating Distributional Properties of Tagsets , 2010, LREC.

[19]  Mohamed Elhadi,et al.  Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques , 2009, RSFDGrC.

[20]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[21]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[22]  Peter F. Abboud,et al.  Elementary Modern Standard Arabic: WRITING DRILLS , 1983 .

[23]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[24]  Fred Popowich,et al.  Automatic Transliteration of Proper Nouns from Arabic to English , 2006, BCS.

[25]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[26]  Olga Vechtomova Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Cambridge University Press, 2008 , 2009, Comput. Linguistics.

[27]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[28]  Joakim Nivre,et al.  Single Malt or Blended? A Study in Multilingual Parser Optimization , 2007, EMNLP.

[29]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[30]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[31]  Joakim Nivre,et al.  Inductive Dependency Parsing of Natural Language Text , 2005 .

[32]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[33]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[34]  Thorsten Brants Tagset Reduction without Information Loss , 1995, ACL.

[35]  S. Abu-Rabia The role of vowels in reading Semitic scripts: Data from Arabic and Hebrew , 2001 .

[36]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[37]  Fathi Debili,et al.  La langue arabe et l'ordinateur de l'étiquetage gramatical à la voyellation automatique , 2002 .

[38]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[39]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[40]  Ronald Wardhaugh An introduction to sociolinguistics , 1988 .

[41]  D. Block Semitic Languages: Outline of a Comparative Grammar , 1999 .

[42]  S. Abu-Rabia Reading Arabic texts: Effects of text type, reader type and vowelization , 1998 .

[43]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[44]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[45]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[46]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[47]  Sandra Kübler,et al.  Memory-Based Vocalization of Arabic , 2008 .

[48]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[49]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[50]  Salim Abu-Rabia,et al.  Reading in Arabic Orthography: The Effect of Vowels and Context on Reading Accuracy of Poor and Skilled Native Arabic Readers in Reading Paragraphs, Sentences, and Isolated Words , 1997, Journal of psycholinguistic research.

[51]  Salim Abu-Rabia,et al.  The role of vowels and context in the reading of highly skilled native Arabic readers , 1996 .

[52]  John Alfred Haywood,et al.  A new Arabic grammar of the written language , 1962 .

[53]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[54]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[55]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[56]  Alice Faber Genetic Subgrouping of the Semitic Languages , 1997 .

[57]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[58]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[59]  Linda S. Siegel,et al.  DIFFERENT ORTHOGRAPHIES DIFFERENT CONTEXT EFFECTS: THE EFFECTS OF ARABIC SENTENCE CONTEXT IN SKILLED AND POOR READERS , 1995 .

[60]  Joakim Nivre,et al.  Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines , 2006, CoNLL.

[61]  Nizar Habash,et al.  Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features , 2007, EMNLP.

[62]  Joakim Nivre,et al.  Dependency Parsing , 2009, Lang. Linguistics Compass.

[63]  Salim Abu-Rabia,et al.  LEARNING TO READ IN ARABIC: READING, SYNTACTIC, ORTHOGRAPHIC AND WORKING MEMORY SKILLS IN NORMALLY ACHIEVING AND POOR ARABIC READERS , 1995 .

[64]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.