A Synonym-based Approach for the Semantic Indexing of Texts

In this work, we present an algorithmic technique for text indexing based on the utilization of classes of synonyms. The method proposed in this study utilizes a set of synonym classes in order to develop a more abstract representation of a given text focusing on the indexing of texts that express semantic similarity, according to the terms utilized. The content of the texts under consideration is represented by a set of terms that correspond to the class of synonyms substituting each term of the sentences of the text. In the proposed approach the terms are stored into vectors where the uniqueness and the multiplicity of their appearance inside the text are considered to deploy a corresponding similarity metric. Through the development of our model, we omit words that consist of monograms, di-grams and tri-grams, where a novel approach is deployed considering the optimally discriminating words over each class of synonyms that characterize each thematic area on which a text is indexed according to its relevance with semantically similar texts. We describe thoroughly the proposed approach and perform a series of evaluation experiments utilizing an adequate number of text samples from specific thematic areas, such as business, politics, sports, entertainment and technology, intending to attest the potentials of our proposed model to index texts from specific areas.

[1]  N. Chamidah,et al.  Word Expansion using Synonyms in Indonesian Short Essay Auto Scoring , 2021, 2021 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS.

[2]  G. Gkoutos,et al.  Evaluating semantic similarity methods for comparison of text-derived phenotype profiles , 2021, BMC Medical Informatics and Decision Making.

[3]  Ravishankar Krishnaswamy,et al.  FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search , 2021, ArXiv.

[4]  Mounir Zrigui,et al.  Semantic similarity analysis for corpus development and paraphrase detection in arabic , 2020, Int. Arab J. Inf. Technol..

[5]  Khaled Rezeg,et al.  Multi-Agents Indexing System (MAIS) for Plagiarism Detection , 2020, J. King Saud Univ. Comput. Inf. Sci..

[6]  Sunilkumar P,et al.  A Survey on Semantic Similarity , 2019, 2019 International Conference on Advances in Computing, Communication and Control (ICAC3).

[7]  Mingming Lu,et al.  Improving semantic similarity retrieval with word embeddings , 2018, Concurr. Comput. Pract. Exp..

[8]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[9]  Kenneth Younge,et al.  Text Similarity in Vector Space Models: A Comparative Study , 2018, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[10]  Marcin Mironczuk,et al.  A recent overview of the state-of-the-art elements of text classification , 2018, Expert Syst. Appl..

[11]  Asit Kumar Das,et al.  Graph-Based Text Summarization Using Modified TextRank , 2018, Soft Computing in Data Analytics.

[12]  Alper Kursat Uysal,et al.  On Two-Stage Feature Selection Methods for Text Classification , 2018, IEEE Access.

[13]  Saroj K. Biswas,et al.  A graph based keyword extraction model using collective node weight , 2018, Expert Syst. Appl..

[14]  Béatrice Daille,et al.  Word Embedding Approach for Synonym Extraction of Multi-Word Terms , 2018, LREC.

[15]  Stavros D. Nikolopoulos,et al.  Malicious software classification based on relations of system-call groups , 2015, Panhellenic Conference on Informatics.

[16]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Ming Gu,et al.  An efficient algorithm for dynamic text indexing , 1994, SODA '94.

[19]  Abdulaziz Shehab,et al.  An Automatic Arabic Essay Grading System based on Text Similarity Algorithms , 2018 .

[20]  Hector Ferrada,et al.  Hybrid Indexing Revisited , 2018, ALENEX.