Sentence Boundary Detection for Transcribed Tunisian Arabic

We study, in this paper, the problem of detecting the sentence boundary in tran-scribed spoken Tunisian Arabic. We compare and contrast three different methods for detecting sentence bounda-ries in transcribed speech. The first method uses a set of handmade contex-tual patterns for identifying the limit of sentences. The second method aims to classify transcriptions words into four classes according to their position in a sentence. Both methods are based only on lexical and some prosodic information such as silent and filled pauses. Finally, we develop two techniques for mixing the results of the two proposed methods. We show that sentence boundary detec-tion system can improve the accuracy of a POS tagger system developed for tag-ging transcribed Tunisian Arabic.

[1]  I. Khalifa,et al.  Arabic Discourse Segmentation Based on Rhetorical Methods , 2013 .

[2]  Isabelle Tellier,et al.  POS-tagging for Oral Texts with CRF and Category Decomposition , 2010, CICLing 2010.

[3]  Mohand Tilmatine,et al.  Substrat et convergences: le berbère et l¿arabe nord-africain , 1999 .

[4]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[5]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[6]  Lamia Hadrich Belguith,et al.  Clause-based Discourse Segmentation of Arabic Texts , 2012, LREC.

[7]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[8]  Noraini Seman,et al.  Sentence boundary detection without speech recognition: A case of an under-resourced language , 2015 .

[9]  W. N. H. W. Mohamed,et al.  A comparative study of Reduced Error Pruning method in decision tree algorithms , 2012, 2012 IEEE International Conference on Control System, Computing and Engineering.

[10]  Iskandar Keskes,et al.  Segmentation de textes arabes en unités discursives minimales , 2013 .

[11]  Abdessatar Mahfoudhi,et al.  A Minimalist Account of Word Order and Agreement Variation in Arabic , 2002 .

[12]  Husni Al-Muhtaseb,et al.  AUTOMATIC SEGMENTATION OF ARABIC SPEECH , 2007 .

[13]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[14]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Anja Habacha Chaïbi,et al.  Topic Segmentation for Textual Document Written in Arabic Language , 2014, KES.

[17]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[18]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[19]  Tatsuya Kawahara,et al.  Sentence boundary detection of spontaneous Japanese using statistical language model and support vector machines , 2006, INTERSPEECH.

[20]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).