Sentence boundary detection of various forms of Tunisian Arabic

Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation.

[1]  Hadhemi Achour,et al.  Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web , 2015, ICWE Workshops.

[2]  Houda Saadane,et al.  Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques , 2015 .

[3]  AbdelRahim A. Elmadany,et al.  Turn Segmentation into Utterances for Arabic Spontaneous Dialogues and Instance Messages , 2015, ArXiv.

[4]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Hend Suliman Al-Khalifa,et al.  Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result , 2011, 2011 International Conference on Asian Language Processing.

[7]  Peerapon Vateekul,et al.  Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations , 2019, ACM Transactions on Asian and Low-Resource Language Information Processing.

[8]  Hassan Mathkour,et al.  Semantic-Based Segmentation of Arabic Texts , 2008 .

[9]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[10]  Philippe Blache,et al.  Morphological disambiguation of Tunisian dialect , 2017, J. King Saud Univ. Comput. Inf. Sci..

[11]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Frédéric Béchet,et al.  De l'arabe standard vers l'arabe dialectal : projection de corpus et ressources linguistiques en vue du traitement automatique de l'oral dans les médias tunisiens , 2014, Trait. Autom. des Langues.

[14]  Segmentation de textes arabes basée sur l’analyse contextuelle des signes de ponctuations et de certaines particules , 2005, JEPTALNRECITAL.

[15]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[16]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[17]  Juan-Manuel Torres-Moreno,et al.  Sentence Boundary Detection for French with Subword-Level Information Vectors and Convolutional Neural Networks , 2018, ArXiv.

[18]  L. Belguith,et al.  Treebank Creation and Parser Generation for Tunisian Social Media Text , 2020, 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA).

[19]  S. Gunn Support Vector Machines for Classification and Regression , 1998 .

[20]  Rahma Sellami,et al.  Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media , 2014, LG-LP@COLING.

[21]  Joos Vandewalle,et al.  Multiple-Valued Threshold Logic and Multi-Valued Neurons , 2000 .

[22]  Waqas Anwar,et al.  A hybrid approach for urdu sentence boundary disambiguation , 2012, Int. Arab J. Inf. Technol..

[23]  Lamia Hadrich Belguith,et al.  Critical description of TA linguistic resources , 2018, ACLING.

[24]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[25]  Lamia Hadrich Belguith,et al.  Linguistic Resources Construction: Towards Disfluency Processing in Spontaneous Tunisian Dialect Speech , 2019, TSD.

[26]  Amitava Das,et al.  Sentence Boundary Detection for Social Media Text , 2015, ICON.

[27]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[28]  Lamia Hadrich Belguith,et al.  Syntactic Analysis of the Tunisian Arabic , 2017, LPKM.

[29]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[30]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[31]  Philippe Blache,et al.  Sentence Boundary Detection for Transcribed Tunisian Arabic , 2016, KONVENS.

[32]  Noraini Seman,et al.  Sentence boundary detection without speech recognition: A case of an under-resourced language , 2015 .

[33]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[34]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[35]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[36]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[37]  Lamia Hadrich Belguith,et al.  Clause-based Discourse Segmentation of Arabic Texts , 2012, LREC.

[38]  George Sanchez,et al.  Sentence Boundary Detection in Legal Text , 2019, Proceedings of the Natural Legal Language Processing Workshop 2019.

[39]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.