Semantic Similarity Analysis for Paraphrase Identification in Arabic Texts

Arabic plagiarism detection is a difficult task because of the great richness of Arabic language characteristics of which it is a productive, derivational and inflectional language, on the one hand, and a word can has more than one lexical category in different contexts allows us to have different meanings of the word what changes the meaning of the sentence, on the other hand. In this context, Arabic paraphrase identification allows quantifying how much a suspect Arabic text and source Arabic text are similar based on their contexts. In this paper, we proposed a semantic similarity approach for paraphrase identification in Arabic texts by combining different techniques of Natural Language Processing NLP, such as: Term FrequencyInverse Document Frequency TF-IDF technique to improve the identification of words that are highly descriptive in each sentence; and distributed word vector representations using word2vec algorithm to reduce computational complexity and to optimize the probability of predicting words in the context given the current center word, which they would be subsequently used to generate a sentence vector representations and after applying a similarity measurement operation based on different metrics of comparison, such as: Cosine Similarity and Euclidean Distance. Finally, our proposed approach was evaluated on the Open Source Arabic Corpus OSAC and obtained a promising

[1]  Sabrina Tiun,et al.  Cross-language plagiarism of Arabic-English documents using linear logistic regression , 2016 .

[2]  Mounir Zrigui,et al.  A Hybrid Approach for Arabic Word Sense Disambiguation , 2012, Int. J. Comput. Process. Orient. Lang..

[3]  Elsa Negre,et al.  Comparaison de textes: quelques approches... , 2013 .

[4]  David Samuel On The Use of Vector Representation for Improved Accuracy and Currency of Twitter POS Tagging , 2017 .

[5]  Xiaochang Peng,et al.  Exploring phrase-compositionality in skip-gram models , 2016, ArXiv.

[6]  Georges Antoniadis,et al.  Compréhension automatique de la parole arabe spontanée , 2008 .

[7]  Vasile Rus,et al.  Combining Word Representations for Measuring Word Relatedness and Similarity , 2015, FLAIRS Conference.

[8]  Sumam Mary Idicula,et al.  SRL based Plagiarism Detection System for Malayalam Documents , 2015 .

[9]  Anis Zouaghi,et al.  ISAO: An Intelligent System of Opinions Analysis , 2016, Res. Comput. Sci..

[10]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Syed Hafeez,et al.  Using Explicit Semantic Similarity for an Improved Web Explorer with ontology and TF-IDF , 2017 .

[13]  Miloslav Konopík,et al.  UWB at SemEval-2016 Task 2: Interpretable Semantic Textual Similarity with Distributional Semantics for Chunks , 2016, SemEval@NAACL-HLT.

[14]  Mounir Zrigui,et al.  Elaboration of a model for an indexed base for teaching Arabic language to disabled people , 2014, 2014 6th International Conference on Computer Science and Information Technology (CSIT).

[15]  Dinesh U Acharya,et al.  SEMANTIC PLAGIARISM DETECTION SYSTEM USING ONTOLOGY MAPPING , 2012 .

[16]  Junhua He,et al.  LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations , 2018, PRCV.

[17]  Mounir Zrigui,et al.  Towards a Hybrid Approach to Semantic Analysis of Spontaneous Arabic Speech , 2014, Int. J. Comput. Linguistics Appl..

[18]  Shadi Aljawarneh,et al.  Hybrid modeling of an OffLine Arabic Handwriting Recognition System AHRS , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[19]  Kayvan Bijari,et al.  A Deep Learning Approach to Persian Plagiarism Detection , 2016, FIRE.

[20]  Ahmed H. Aliwy,et al.  Tokenization as Preprocessing for Arabic Tagging System , 2012 .

[21]  Geoffrey Zweig,et al.  Polarity Inducing Latent Semantic Analysis , 2012, EMNLP.

[22]  Ngoc Phuoc An Vo,et al.  Paraphrase Identification and Semantic Similarity in Twitter with Simple Features , 2015, SocialNLP@NAACL.