Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval

A similarity measure is used in information retrieval systems to retrieve and rank the relevant documents. In this paper, a new fuzzy-based approach to develop hybrid similarity measure is proposed and implemented. The proposed approach overcomes the limitations of extensively used similarity measures such as Cosine, Jaccard, Euclidean and Okapi-BM25 along with Genetic Algorithm-based hybrid similarity measures proposed by researchers. This approach uses fuzzy rules to infer the weights of different similarity measures. In this paper, the experiments are performed on CACM and CISI benchmark data collections. The performance of the proposed approach is evaluated in terms of precision, recall and average precision and average recall of retrieved relevant documents. The results are compared with different similarity measures available in literature. The results show the marked improvement in performance of information retrieval systems using the proposed fuzzy logic-based hybrid similarity measure.

[1]  Shi-Jay Chen,et al.  Fuzzy Information Retrieval Based On A New Similarity Measure Of Generalized Fuzzy Numbers , 2011, Intell. Autom. Soft Comput..

[2]  S. Robertson The probability ranking principle in IR , 1997 .

[3]  Ebrahim H. Mamdani,et al.  An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller , 1999, Int. J. Hum. Comput. Stud..

[4]  Rosa Rodriguez-Sánchez,et al.  Ranking of the subject areas of Scopus , 2011, J. Assoc. Inf. Sci. Technol..

[5]  Weiguo Fan,et al.  Effective information retrieval using genetic algorithms based matching functions adaptation , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  William S. Cooper,et al.  Getting beyond Boole , 1988, Inf. Process. Manag..

[8]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[9]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  John A. Keane,et al.  Using Web-Search Results to Measure Word-Group Similarity , 2008, COLING.

[12]  M Zou Alghadri Jahromi,et al.  A PROPOSED QUERY-SENSITIVE SIMILARITY MEASURE FOR INFORMATION RETRIEVAL , 2006 .

[13]  Georges Gardarin,et al.  Similarity Model and Term Association for Document Categorization , 2002, NLDB.

[14]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[15]  Donald H. Kraft,et al.  Fuzzy Sets and Generalized Boolean Retrieval Systems , 1983, Int. J. Man Mach. Stud..

[16]  K. Iyakutti,et al.  A Genetic Algorithm based on Cosine Similarity for Relevant Document Retrieval , 2013 .

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Donald H. Kraft,et al.  A mathematical model of a weighted boolean retrieval system , 1979, Inf. Process. Manag..

[19]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[20]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..

[21]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[22]  Yu-Jen Lin,et al.  A study on searching for similar documents based on multiple concepts and distribution of concepts , 2003, Expert Syst. Appl..

[23]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[24]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[25]  Rudolf Kruse,et al.  Interactive text retrieval based on document similarities , 2000 .

[26]  Wei-Pang Yang,et al.  Learning to Rank for Information Retrieval Using Genetic Programming , 2007 .

[27]  Martti Juhola,et al.  On principal component analysis, cosine and Euclidean measures in information retrieval , 2007, Inf. Sci..