Finding Themes in Medline Documents: Probabilistic Similarity Search

Large on-line document databases, such as Medline, pose a major challenge of retrieving the few documents most releva nt to the user’s needs, while minimizing the return rate of nonrelevant documents. Retrieval of documents similar to a use rprovided example document is a promising query paradigm towards meeting this goal. We present a new theme-based probabilistic approach for find ing documents relevant to a given query document, and summarizi ng their contents. Preliminary experiments conducted over a s ubset of Medline documents related to AIDS demonstrate the effectiveness of our approach.

[1]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[4]  W. Bruce Croft,et al.  Combining automatic and manual index representations in probabilistic retrieval , 1995 .

[5]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[8]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[9]  Ido Dagan,et al.  Detecting Sub-Topic Correspondence through Bipartite Term Clustering , 1999, ArXiv.

[10]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[11]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[12]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[13]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[14]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[15]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[18]  M. Goldszmidt,et al.  A Probabilistic Approach to Full-Text Document Clustering , 1998 .

[19]  Gerald Salton,et al.  Automatic text processing , 1988 .

[20]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[23]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.