Advances in Information Retrieval

The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.

[1]  Richard S. Marcus,et al.  An experimental comparison of the effectiveness of computers and humans as search intermediaries , 1983, J. Am. Soc. Inf. Sci..

[2]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[3]  Paul R. Cohen,et al.  Empirical methods for artificial intelligence , 1995, IEEE Expert.

[4]  S. Kullback,et al.  Topics in statistical information theory , 1987 .

[5]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[6]  Patrick Martin,et al.  Data caching strategies for distributed full text retrieval systems , 1991, Inf. Syst..

[7]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[8]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[9]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[10]  Andrea J. van Doorn,et al.  Surface shape and curvature scales , 1992, Image Vis. Comput..

[11]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[12]  Sai Ravela,et al.  Appearance-Based Global Similarity Retrieval of Images , 2002 .

[13]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[14]  David A. Bell,et al.  Distributed database systems , 1992 .

[15]  Donna K. Harman,et al.  Relevance feedback revisited , 1992, SIGIR '92.

[16]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[17]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.

[18]  Hector Garcia-Molina,et al.  Caching and database scaling in distributed shared-nothing information retrieval systems , 1993, SIGMOD '93.

[19]  Charles L. A. Clarke,et al.  A Global Search Architecture , 1995 .

[20]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[21]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[22]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[23]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[24]  Azer Bestavros,et al.  Demand-based document dissemination to reduce traffic and balance load in distributed information systems , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[25]  W. B. Croft,et al.  Automatic Query Expansion for Japanese Text Retrieval , 1995 .

[26]  Evelyne Tzoukermann,et al.  Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing , 1997, SIGIR '97.

[27]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[28]  Carol Peters,et al.  Cross-Language Information Retrieval: A System for Comparable Corpus Querying , 1998 .

[29]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[30]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[31]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[32]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[33]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[34]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[35]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[36]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[37]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[38]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[39]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[40]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[41]  Jamie Callan,et al.  Probing a Collection to Discover Its Language Model , 1998 .

[42]  Domenico Ferrari,et al.  Performance analysis of several back-end database architectures , 1986, TODS.

[43]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[44]  Cordelia Schmid,et al.  Combining greyvalue invariants with local constraints for object recognition , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[46]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[47]  Mark Magennis,et al.  The potential and actual effectiveness of interactive query expansion , 1997, SIGIR '97.

[48]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[49]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[50]  W. Bruce Croft,et al.  Support for Browsing in an Intelligent Text Retrieval System , 1989, Int. J. Man Mach. Stud..

[51]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[52]  Juyang Weng,et al.  Using Discriminant Eigenfeatures for Image Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[54]  James C. French,et al.  The Effects of Query-Based Sampling on Automatic Database Selection Algorithms , 2000 .

[55]  Gerald Salton,et al.  Automatic text processing , 1988 .

[56]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[57]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[58]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[59]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[60]  Gerard Salton,et al.  Experiments in Multi-Lingual Information Retrieval , 1972, Inf. Process. Lett..

[61]  G. Zipf,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[62]  Donna K. Harman,et al.  Prototyping a distributed information retrieval system that uses statistical ranking , 1991, Inf. Process. Manag..

[63]  Kathryn S. McKinley,et al.  Searching a Terabyte of Text Using Partial Replication , 1999 .

[64]  CroftComputer,et al.  Applying Inference Networks to Multiple Collection SearchingZhihong , 1996 .

[65]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[66]  Evangelos P. Markatos On Caching Search Engine Results , 2000 .

[67]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[68]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[69]  Divyakant Agrawal,et al.  Pharos: a scalable distributed architecture for locating heterogeneous information sources , 1997, CIKM '97.

[70]  Jia Wang,et al.  A survey of web caching schemes for the Internet , 1999, CCRV.

[71]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[72]  Marti A. Hearst,et al.  Scatter/gather browsing communicates the topic structure of a very large text collection , 1996, CHI.

[73]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[74]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[75]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[76]  William A. Woods,et al.  Conceptual Indexing: A Better Way to Organize Knowledge , 1997 .

[77]  Mohand Boughanem,et al.  Mercure at TREC6 , 1997, TREC.

[78]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[79]  Kathryn S. McKinley,et al.  Partial replica selection based on relevance for information retrieval , 1999, SIGIR '99.

[80]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[81]  Richard M. Schwartz,et al.  Topic detection in broadcast news , 1999, EUROSPEECH.

[82]  Ming-Kuei Hu,et al.  Visual pattern recognition by moment invariants , 1962, IRE Trans. Inf. Theory.

[83]  Chitra Dorai,et al.  COSMOS-a representation scheme for free-form surfaces , 1995, Proceedings of IEEE International Conference on Computer Vision.

[84]  Mattias Werner,et al.  The Graph Visualization System daVinci - A User Interface for Applications , 1994 .

[85]  James P. Callan,et al.  An Overview of the INQUERY System as Used for the TIPSTER Project , 1993 .

[86]  Toshikazu Kato,et al.  Database architecture for content-based image retrieval , 1992, Electronic Imaging.

[87]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[88]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of US Patents , 1997 .

[89]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[90]  Peter van der Weerd,et al.  Conceptual Grouping in Word Co-Occurrence Networks , 1999, IJCAI.

[91]  David Hawking Scalable Text Retrieval for Large Digital Libraries , 1997, ECDL.

[92]  Thomas H. Reiss,et al.  Recognizing Planar Objects Using Invariant Image Features , 1993, Lecture Notes in Computer Science.

[93]  Bernt Schiele,et al.  Object Recognition Using Multidimensional Receptive Field Histograms , 1996, ECCV.

[94]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[95]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[96]  Alex Pentland,et al.  Generalized Image Matching: Statistical Learning of Physically-Based Deformations , 1996, ECCV.

[97]  CroftComputer,et al.  Measures in Collection Ranking EvaluationZhihong , 1996 .

[98]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[99]  Yorick Wilks,et al.  Evaluation of an Algorithm for the Recognition and Classification of Proper Names , 1996, COLING.

[100]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[101]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[102]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[103]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[104]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[105]  Peter G. Anick,et al.  The paraphrase search assistant: terminological feedback for iterative information seeking , 1999, SIGIR '99.

[106]  Hsinchun Chen,et al.  Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques , 1998, J. Am. Soc. Inf. Sci..

[107]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[108]  Peter Sturm,et al.  Introducing Application-Level Replication and Naming into Today's Web , 1996, Comput. Networks.

[109]  Kui-Lam Kwok,et al.  TREC-3 Ad-Hoc, Routing Retrieval and Thresholding Experiments using PIRCS , 1994, TREC.

[110]  Eric W. Brown,et al.  The GURU System in TREC-6 , 1997, TREC.

[111]  Peter B. Danzig,et al.  Distributed indexing: a scalable mechanism for distributed information retrieval , 1991, SIGIR '91.

[112]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[113]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[114]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[115]  James Allan,et al.  UMASS Approaches to Detection and Tracking at TDT2 , 1999 .

[116]  W. Bruce Croft,et al.  Discovering and Comparing Topic Hierarchies , 2000, RIAO.

[117]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[118]  Mohan S. Kankanhalli,et al.  Shape Measures for Content Based Image Retrieval: A Comparison , 1997, Inf. Process. Manag..

[119]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[120]  W. Bruce Croft,et al.  Efficient probabilistic Inference for text retrieval , 1991, RIAO.

[121]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[122]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[123]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[124]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[125]  Rosalind W. Picard,et al.  Texture orientation for sorting photos "at a glance" , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[126]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Electronic Imaging.

[127]  Eugene Charniak,et al.  Determining the specificity of nouns from text , 1999, EMNLP.

[128]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[129]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[130]  Geoffrey P. Ellis,et al.  A common query interface for multilingual document retrieval from databases of the European Community Institutions (abstract) , 1993, SIGIR.

[131]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[132]  Anil S. Chakravarthy,et al.  NetSerf: using semantic knowledge to find Internet information archives , 1995, SIGIR '95.

[133]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[134]  Martin Braschler,et al.  Cross-Language Information Retrieval in a Multilingual Legal Domain , 1997, ECDL.

[135]  John P. Eakins,et al.  ARTISAN: a shape retrieval system based on boundary family indexing , 1996, Electronic Imaging.

[136]  Lauren B. Doyle,et al.  Semantic Road Maps for Literature Searchers , 1961, JACM.

[137]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[138]  George Lakoff,et al.  Women, Fire, and Dangerous Things , 1987 .

[139]  Ellen M. Voorhees,et al.  Information Technology: The Sixth Text Retrieval Conference (TREC-6) | NIST , 1998 .

[140]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[141]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[142]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[143]  Susan T. Dumais,et al.  Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing , 1997, TREC.

[144]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[145]  Anil K. Jain,et al.  Shape-Based Retrieval: A Case Study With Trademark Image Databases , 1998, Pattern Recognit..

[146]  Françoise Bourdoncle Recherche visuelle d'information sur Internet , 1997 .

[147]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[148]  Donna Harman The Second Text Retrieval Conference (TREC-2) | NIST , 1994 .

[149]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[150]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[151]  Kathryn S. McKinley,et al.  Scalable distributed architectures for information retrieval , 1999 .

[152]  Kathryn S. McKinley,et al.  Performance evaluation of a distributed architecture for information retrieval , 1996, SIGIR '96.

[153]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[154]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[155]  R. Manmatha,et al.  Image retrieval by appearance , 1997, SIGIR '97.

[156]  Gregory Grefenstette Short Query Linguistic Expansion Techniques: Palliating One-Word Queries by Providing Intermediate Structure to Text , 1997, SCIE.

[157]  Rafael Alonso,et al.  Data Caching in Information Retrieval Systems. , 1987, SIGIR 1987.

[158]  Richard H. Fowler,et al.  Information Navigator: An information system using associative networks for display and retrieval , 1992 .

[159]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[160]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[161]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[162]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[163]  Robert G. Reynolds,et al.  Query Translation Using Evolutionary Programming for Multi-Lingual Information Retrieval , 1995 .

[164]  William H. Press,et al.  Numerical Recipes in Fortran 77: The Art of Scientific Computing 2nd Editionn - Volume 1 of Fortran Numerical Recipes , 1992 .

[165]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[166]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[167]  J. M. Schultz,et al.  Topic Detection and Tracking using idf-Weighted Cosine Coefficient , 1999 .

[168]  Patrick Martin,et al.  A case study of caching strategies for a distributed full text retrieval system , 1990, Inf. Process. Manag..

[169]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[170]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[171]  Peter Bruza,et al.  Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine , 1997, RIAO.

[172]  Fang Liu,et al.  Periodicity, Directionality, and Randomness: Wold Features for Image Modeling and Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[173]  Forbes J. Burkowski Retrieval performance of a distributed text database utilizing a parallel processor document server , 1990, DPDS '90.

[174]  Michelle Butler,et al.  A Scalable HTTP Server: The NCSA Prototype , 1994, Comput. Networks ISDN Syst..

[175]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[176]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[177]  Josef Kittler,et al.  Efficient and Robust Retrieval by Shape Content through Curvature Scale Space , 1998, Image Databases and Multi-Media Search.