Comparison and fusion of retrieval schemes based on different structures, similarity measures and weighting schemes

Many retrieval models and techniques can be applied to retrieve theses that are most relevant to certain queries or concepts. It has been found that different retrieval methods often retrieve different sets of relevant documents. It is therefore anticipated that a particular retrieval method will usually retrieve some relevant theses not retrieved by other methods. Therefore in this study, different methods are used in the theses retrieval, based on different thesis structures, different similarity measures and different weighting schemes. The theses used in this study are collected from FSKSM postgraduate library. Many operations have been applied on the collected theses such as digitizing, stop words removal, stemming and building index. The results from these operations are stored in a database. In this study, 85 theses and 30 queries are used. The comparisons between query and theses were made using five similarity measures with seven weighting schemes using different thesis structures. The results show that the use of bibliography gives poorer results compared to the use of title and abstract alone. In the weighting schemes combinations, the results show that weighting schemes using Cosine and Tanimoto perform well individually but did not do well in the combinations and weighting schemes using Forbes and Russell similarity measures do not do well individually but did well in the combination. In the similarity measures combinations, the results show that the best combination was Cosine using LTU weighting scheme with Russell using LOGG weighting scheme using title structure but using abstract structure, the best combination was Cosine using TFIDF weighting scheme with Forbes using ATFA weighting scheme but it has less performance than the combination of Cosine using LTU weighting scheme with Russell using LOGG weighting scheme using title structure. The overall results show that the best thesis structure is title and the best similarity measure is Cosine with LTU weighting scheme.

[1]  Tim Bass,et al.  Intrusion detection systems and multisensor data fusion , 2000, CACM.

[2]  Dennis McLeod,et al.  Ontology-based information selection , 2000 .

[3]  Umakishore Ramachandran,et al.  DFuse: a framework for distributed data fusion , 2003, SenSys '03.

[4]  Ophir Frieder,et al.  On Arabic-English cross-language information retrieval: a machine translation approach , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[5]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[8]  Michael Negnevitsky,et al.  Artificial Intelligence: A Guide to Intelligent Systems , 2001 .

[9]  IJsbrand Jan Aalbersberg,et al.  A document retrieval model based on term frequency ranks , 1994, SIGIR '94.

[10]  William T. Morgan,et al.  The role of variance in term weighting for probabilistic information retrieval , 2002, CIKM '02.

[11]  John Bear,et al.  Using Information Extraction to Improve Document Retrieval , 1998, TREC.

[12]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[13]  Yannis Tzitzikas Democratic data fusion for information retrieval mediators , 2001, Proceedings ACS/IEEE International Conference on Computer Systems and Applications.

[14]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..

[15]  Haidar M. Harmanani,et al.  A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic , 2006, Int. Arab J. Inf. Technol..

[16]  Shengli Wu,et al.  Data fusion with estimated weights , 2002, CIKM '02.

[17]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[18]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[19]  Jaime B. Teevan,et al.  Improving Information Retrieval with Textual Analysis: Bayesian Models and Beyond , 2001 .

[20]  John Bradshaw,et al.  Similarity Searching Using Reduced Graphs , 2003, J. Chem. Inf. Comput. Sci..

[21]  Naomie Salim,et al.  Combination of Fingerprint-Based Similarity Coefficients Using Data Fusion , 2003, J. Chem. Inf. Comput. Sci..

[22]  Alan F. Smeaton Independence of Contributing Retrieval Strategies in Data Fusion for Effective Information Retrieval , 1998, BCS-IRSG Annual Colloquium on IR Research.

[23]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[24]  Peter Willett,et al.  CLIP: Similarity Searching of 3D Databases Using Clique Detection , 2003, J. Chem. Inf. Comput. Sci..

[25]  Peter Willett,et al.  Bit-String Methods for Selective Compound Acquisition , 2000, J. Chem. Inf. Comput. Sci..

[26]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[27]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[28]  Mounia Lalmas,et al.  A Formal Model for Data Fusion , 2002, FQAS.

[29]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[30]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[31]  Mario A. Nascimento,et al.  An experiment stemming non-traditional text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[32]  Hugh E. Williams,et al.  A Testbed for Indonesian Text Retrieval , 2004, ADCS.

[33]  William R. Hersh,et al.  Research Paper: A Performance and Failure Analysis of SAPHIRE with a MEDLINE Test Collection , 1994, J. Am. Medical Informatics Assoc..

[34]  Robert J. Gaizauskas,et al.  Evaluating Passage Retrieval Approaches for Question Answering , 2004, ECIR.

[35]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[36]  Darren V. S. Green,et al.  Computational analysis of molecular diversity for drug discovery , 1999, RECOMB.

[37]  Bernardo Magnini,et al.  Exploiting Lexical Expansions and Boolean Compositions for Web Querying , 2000 .

[38]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[39]  Alexander H. Waibel,et al.  Reducing the OOV rate in broadcast news speech recognition , 1998, ICSLP.

[40]  Ophir Frieder,et al.  Disproving the fusion hypothesis: an analysis of data fusion via effective information retrieval strategies , 2003, SAC '03.

[41]  Gobinda G. Chowdhury,et al.  Introduction to Modern Information Retrieval , 1999 .