Semantic Similarity Analysis of Urdu Documents

Semantic similarity analysis is an emerging research area and plays an important role in document classification, text summarization, and plagiarism identification. Moreover, digital data are increasing tremendously over the Internet. Such unstructured data need efficient tools to find any relevant topic or related content optimally. Thus, many systems have been developed for various languages (English, Arabic, Hindi, Turkish, etc.) to retrieve documents based on semantic similarity but no such work has been done on Urdu language. For optimal search of Urdu digital documents, there is a need of such a system that finds semantically similar documents. This paper focuses on studying the existing systems and proposing an approach for Urdu documents providing a better semantic similarity score. Our proposed system - Semantic Similarity System for Urdu (TripleS4Urdu) provides good results that have been compiled after evaluation.

[1]  B. Yucesoy,et al.  Comparison of semantic and single term similarity measures for clustering turkish documents , 2007, ICMLA 2007.

[2]  Ashraf S. Hussein Arabic document similarity analysis using n-grams and singular value decomposition , 2015, 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS).

[3]  Sule Gündüz Ögüdücü,et al.  A taxonomy based semantic similarity of documents using the cosine measure , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[4]  Abdullah Gani,et al.  Hadith data mining and classification: a comparative analysis , 2016, Artificial Intelligence Review.

[5]  Kavitha Adhikesavan An integrated approach for measuring semantic similarity between words and sentences using web search engine , 2015, Int. Arab J. Inf. Technol..

[6]  Zuhair Bandar,et al.  AWSS: An Algorithm for Measuring Arabic Word Semantic Similarity , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[7]  Yevgen Biletskiy,et al.  Matchmaking through semantic annotation and similarity measurement , 2012, 2012 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[8]  Muhammad Aslam,et al.  Associating targets with SentiUnits: a step forward in sentiment analysis of Urdu text , 2012, Artificial Intelligence Review.

[9]  Arafat Awajan Semantic similarity based approach for reducing Arabic texts dimensionality , 2016, Int. J. Speech Technol..

[10]  M. Arif Wani,et al.  Hybrid Neural Network Based Model for Predicting the Performance of a Two Stroke Spark Ignition Engine , 2007, ICMLA 2007.

[11]  Shady Shehata,et al.  A WordNet-Based Semantic Model for Enhancing Text Clustering , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[12]  Jia Wei Chang,et al.  A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences , 2014, TheScientificWorldJournal.

[13]  Satish R. Kolhe,et al.  Information Retrieval Based on Semantic Similarity Using Information Content , 2011 .

[14]  Aditi Sharan,et al.  Lexical Ontology-Based Computational Model to Find Semantic Similarity , 2013, ICACNI.

[15]  Abdelmajid Ben Hamadou,et al.  Supervised Learning to Measure the Semantic Similarity Between Arabic Sentences , 2015, ICCCI.

[16]  Jinwook Choi,et al.  Effect of Latent Semantic Indexing for Clustering Clinical Documents , 2010, 2010 IEEE/ACIS 9th International Conference on Computer and Information Science.

[17]  Mohamed El Bachir Menai,et al.  Automatic Arabic text summarization: a survey , 2015, Artificial Intelligence Review.

[18]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[19]  Ana María Martínez Enríquez,et al.  Sentiment Analysis of Urdu Language: Handling Phrase-Level Negation , 2011, MICAI.

[20]  Harald Hammarström,et al.  Urdu Morphology, Orthography and Lexicon Extraction , 2007 .