Deep Contextualized Pairwise Semantic Similarity for Arabic Language Questions

Question semantic similarity is a challenging and active research problem that is very useful in many NLP applications, such as detecting duplicate questions in community question answering platforms such as Quora. Arabic is considered to be an under-resourced language, has many dialects, and rich in morphology. Combined together, these challenges make identifying semantically similar questions in Arabic even more difficult. In this paper, we introduce a novel approach to tackle this problem and test it on two benchmarks; one for Modern Standard Arabic (MSA), and another for the 24 major Arabic dialects. We are able to show that our new system outperforms state-of-the-art approaches by achieving 93% F1-score on the MSA benchmark and 82% on the dialectical one. This is achieved by utilizing contextualized word representations (ELMo embeddings) trained on a text corpus containing MSA and dialectic sentences. This in combination with a pairwise fine-grained similarity layer, helps our question-to-question model to generalize on dialects while being trained only on our newly developed question-to-question MSA data.

[1]  Mahmoud Al-Ayyoub,et al.  Deep learning for Arabic NLP: A survey , 2017, J. Comput. Sci..

[2]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[3]  Qingsheng Li,et al.  Calculation of Sentence Semantic Similarity Based on Syntactic Structure , 2015 .

[4]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[5]  Arafat Awajan,et al.  Arabic Semantic Similarity Approaches - Review , 2018, 2018 International Arab Conference on Information Technology (ACIT).

[6]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[7]  Didier Schwab,et al.  Semantic Similarity of Arabic Sentences with Word Embeddings , 2017, WANLP@EACL.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Noam Chomsky,et al.  Deep structure, surface structure, and semantic interpretation , 1969 .

[10]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[11]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[12]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[13]  Alexander Erdmann,et al.  Unified Guidelines and Resources for Arabic Dialect Orthography , 2018, LREC.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[16]  Ray Kurzweil,et al.  Learning Semantic Textual Similarity from Conversations , 2018, Rep4NLP@ACL.

[18]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[19]  Yoshua Bengio,et al.  Context-dependent word representation for neural machine translation , 2016, Comput. Speech Lang..

[20]  Abdou Youssef,et al.  Morphological Word Embedding for Arabic , 2018, ACLING.

[21]  Yang Shao,et al.  HCTI at SemEval-2017 Task 1: Use convolutional neural network to evaluate Semantic Textual Similarity , 2017, SemEval@ACL.

[22]  Andrew McCallum,et al.  Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space , 2014, EMNLP.

[23]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[27]  Heri Ramampiaro,et al.  A Deep Network Model for Paraphrase Detection in Short Text Messages , 2017, Inf. Process. Manag..

[28]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[29]  Mahmoud Al-Ayyoub,et al.  Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic , 2019, NSURL.

[30]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[31]  Mohsen Rashwan,et al.  Word Representations in Vector Space and their Applications for Arabic , 2015, CICLing.

[32]  Jimmy J. Lin,et al.  Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement , 2016, NAACL.

[33]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[34]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[35]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[36]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.