论文信息 - Automated Sentence Boundary Detection in Modern Standard Arabic Transcripts using Deep Neural Networks

Automated Sentence Boundary Detection in Modern Standard Arabic Transcripts using Deep Neural Networks

Abstract The increased volumes of Arabic sources of data available on the Web has boosted the development of Natural Language Processing (NLP) tools over different tasks and applications. However, to take advantage from a vast amount of these applications, a prior segmentation task call Sentence Boundary Detection (SBD) is needed. In this paper we focus on SBD over Modern Standard Arabic (MSA) by comparing two different approaches based on Deep Neural Networks (DNN) using out-of-domain and in-domain training data with only lexical features (represented as character embedding) while conducting two scenarios based on a Convolutional Neural Network and a Recurrent Neural Network with attention mechanism architectures. While tuning a big out-of-domain dataset with a smaller in-domain dataset, improves the performance in general. Our evaluations were based on IWSLT 2017 TED talks transcripts and showed similarities and differences depending of the SBD method. MSA carries certain complications given its rich and complex morphology. However, using only lexical features for Arabic SBD is an acceptable option when the source audio signal is not available and a certain level of language independence needs to be reached.

Juan-Manuel Torres-Moreno | Fatiha Sadat | Elvys Linhares Pontes | Carlos-Emiliano González-Gallardo

[1] Khaled Shaalan,et al. Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[2] Nizar Habash,et al. Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[3] David A. Ferrucci,et al. UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[4] Juan-Manuel Torres-Moreno,et al. Cross-Language Text Summarization Using Sentence and Multi-Sentence Compression , 2018, NLDB.

[5] Yannick Estève,et al. LIUM ASR systems for the 2016 Multi-Genre Broadcast Arabic challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[6] Fatiha Sadat,et al. Automatic Machine Translation for Arabic Tweets , 2018 .

[7] Andreas Stolcke,et al. A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[8] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[9] Allan Ramsay,et al. A web-based tool for Arabic sentiment analysis , 2017, ACLING.

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.