IMPROVING INFORMATION RETRIEVAL USING DOCUMENT CLUSTERS AND SEMANTIC SYNONYM EXTRACTION

Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems and as an efficient way of finding the nearest neighbors of a document. More recently, clustering has been proposed for use in browsing a collection of documents or in organizing the results returned by a search engine in response to a user’s query. This paper presents a new semantic synonym based correlation indexing method in which documents are clustered based on nearest neighbors from the document collection and then further refined by semantically relating the query term with the retrieved documents by making use of a thesaurus or ontology model to improve the performance of Information Retrieval System (IRS) by increasing the number of relevant documents retrieved. Results show that the proposed method achieves significant improvement than the existing methods and may generate the more relevant document in the top rank.

[1]  Ian H. Witten,et al.  Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[2]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[3]  Masoud Makrehchi Query-relevant document representation for text clustering , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[6]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[7]  Wahyu Kusuma,et al.  Journal of Theoretical and Applied Information Technology , 2012 .

[8]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[9]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[12]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[13]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[14]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[15]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[16]  José A. Alonso-Jiménez,et al.  Foundational challenges in automated semantic Web data and ontology cleaning , 2006, IEEE Intelligent Systems.

[17]  Soon Myoung Chung,et al.  Text document clustering based on neighbors , 2009, Data Knowl. Eng..

[18]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.