State of the art document clustering algorithms based on semantic similarity

The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures.

[1]  Sujata R. Kolhe,et al.  A concept driven document clustering using WordNet , 2017, 2017 International Conference on Nascent Technologies in Engineering (ICNTE).

[2]  Adel AL-Zebari,et al.  Football Ontology Construction using Oriented Programming , 2020 .

[3]  Karwan Jacksi,et al.  Development History Of The World Wide Web , 2019 .

[4]  Austin Melton,et al.  Semantic-Based Text Document Clustering Using Cognitive Semantic Learning and Graph Theory , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[5]  V. Thombre,et al.  Document Classification and Clustering using Feature Extraction for Similarity Measure , 2016 .

[6]  Hoda M. O. Mokhtar,et al.  Ontology Based Document Clustering Using MapReduce , 2015, ArXiv.

[7]  Wolfgang Glänzel,et al.  Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset , 2017, Scientometrics.

[8]  Chidambaram,et al.  A Hybrid Approach for Measuring Semantic Similarity between Documents and its Application in Mining the Knowledge Repositories , 2016 .

[9]  Arafat Awajan Semantic similarity based approach for reducing Arabic texts dimensionality , 2016, Int. J. Speech Technol..

[10]  Harsha Patil,et al.  A semantic approach for text document clustering using frequent itemsets and WordNet , 2018 .

[12]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[13]  Paul J. Kennedy,et al.  An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit , 2020, Inf. Process. Manag..

[14]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[15]  Debolina Mahapatra,et al.  A Fuzzy-Cluster based Semantic Information Retrieval System , 2020, 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC).

[16]  Parisa Zandieh,et al.  Clustering Data Text Based on Semantic , 2017 .

[17]  Subhi R. M. Zeebaree,et al.  LOD Explorer: Presenting the Web of Data , 2018 .

[18]  K. Ramar,et al.  Semantic Similarity-Based Clustering of Web Documents Using Fuzzy C-Means , 2015, Int. J. Comput. Intell. Appl..

[19]  Karwan Jacksi Toward the Semantic Web and Linked Data Exploration , 2019 .

[20]  Karwan Jacksi,et al.  Design and Implementation of E-Campus Ontology with a Hybrid Software Engineering Methodology , 2019, Science Journal of University of Zakho.

[21]  Mariana Mocanu,et al.  Clustering Documents using the Document to Vector Model for Dimensionality Reduction , 2020, 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR).

[22]  Shaodian Zhang,et al.  Detection of medical text semantic similarity based on convolutional neural network , 2019, BMC Medical Informatics and Decision Making.

[23]  Ali Selamat,et al.  Designing an Ontology of E-learning system for Duhok Polytechnic University Using Protégé OWL Tool , 2019 .

[24]  Surangika Ranathunga,et al.  Clustering Sinhala News Articles Using Corpus-Based Similarity Measures , 2018, 2018 Moratuwa Engineering Research Conference (MERCon).

[25]  Xiao Hua Chen,et al.  A WordNet-based semantic similarity measurement combining edge-counting and information content theory , 2015, Eng. Appl. Artif. Intell..

[26]  Subhi R. M. Zeebaree,et al.  State of the Art Exploration Systems for Linked Data: A Review , 2016 .

[27]  Larbi Alaoui,et al.  A Framework for Semantic Text Clustering , 2020 .

[28]  S. Sumathi,et al.  TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM , 2014 .

[29]  Vassil Alexandrov,et al.  News clustering based on similarity analysis , 2017, ITQM.

[30]  Sören Auer,et al.  Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation , 2017, SEMANTiCS.

[31]  Didier Schwab,et al.  Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences , 2017 .

[32]  Lubomir Stanchev,et al.  Semantic Document Clustering Using Information from WordNet and DBPedia , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[33]  Rob Koopman,et al.  Clustering articles based on semantic similarity , 2017, Scientometrics.

[34]  Pradeep Kumar,et al.  Conceptual Semantic Model for Web Document Clustering Using Term Frequency , 2018, EAI Endorsed Transactions on Energy Web.

[35]  Zongda Wu,et al.  An efficient Wikipedia semantic matching approach to text document classification , 2017, Inf. Sci..

[36]  Dhanya Pramod,et al.  Document clustering: TF-IDF approach , 2016, 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT).

[37]  Sowmya Kamath,et al.  Semantic Similarity Based Context-Aware Web Service Discovery Using NLP Techniques , 2016, J. Web Eng..

[38]  Subhi R. M. Zeebaree,et al.  Survey on Semantic Similarity Based on Document Clustering , 2019, Advances in Science, Technology and Engineering Systems Journal.

[39]  Sneha S. Desai,et al.  WordNet and Semantic similarity based approach for document clustering , 2016, 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS).

[40]  C. Kavitha,et al.  Semantic similarity based web document classification using Artificial Bee Colony ( ABC ) algorithm , 2014 .

[41]  Salvatore Romeo,et al.  Semantic-Based Multilingual Document Clustering via Tensor Modeling , 2014, EMNLP.

[42]  Retantyo Wardoyo,et al.  The K-Means Clustering Algorithm With Semantic Similarity To Estimate The Cost of Hospitalization , 2019 .

[43]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[44]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[45]  Jack G. Conrad,et al.  Semi-Supervised Events Clustering in News Retrieval , 2016, NewsIR@ECIR.

[46]  Subhi R. M. Zeebaree,et al.  A survey of exploratory search systems based on LOD resources , 2015 .

[47]  Tho T. Quan,et al.  Semantic Document Clustering on Named Entity Features , 2018, ArXiv.