Topic-Based Language Models for Distributed Retrieval

Effective retrieval in a distributed environment is an important but difficult problem. Lack of effectiveness appears to have two major causes. First, existing collection selection algorithms do not work well on heterogeneous collections. Second, relevant documents are scattered over many collections and searching a few collections misses many relevant documents. We propose a topic-oriented approach to distributed retrieval. With this approach, we structure the document set of a distributed retrieval environment around a set of topics. Retrieval for a query involves first selecting the right topics for the query and then dispatching the search process to collections that contain such topics. The content of a topic is characterized by a language model. In environments where the labeling of documents by topics is unavailable, document clustering is employed for topic identification. Based on these ideas, three methods are proposed to suit different environments. We show that all three methods improve effectiveness of distributed retrieval.

[1]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Divyakant Agrawal,et al.  Pharos: a scalable distributed architecture for locating heterogeneous information sources , 1997, CIKM '97.

[4]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[5]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[6]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[7]  Anil S. Chakravarthy,et al.  NetSerf: using semantic knowledge to find Internet information archives , 1995, SIGIR '95.

[8]  Peter B. Danzig,et al.  Distributed indexing: a scalable mechanism for distributed information retrieval , 1991, SIGIR '91.

[9]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[10]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[11]  Gerald Salton,et al.  Automatic text processing , 1988 .

[12]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[13]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization , 1998 .

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[16]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[17]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[18]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[19]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[20]  S. Kullback,et al.  Topics in statistical information theory , 1987 .

[21]  James P. Callan,et al.  An Overview of the INQUERY System as Used for the TIPSTER Project , 1993 .

[22]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of US Patents , 1997 .