SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection

The ability to perform an exploratory search and retrieval of relevant documents from a large collection of domain-specific documents is an important requirement both in the field of medicine and other areas. In this paper, we present a unsupervised distributional clustering technique called SOPHIA. SOPHIA provides a semantically meaningful visual clustering of the document corpus in conjunction with an intuitive interactive search facility. We assess the effectiveness of SOPHIA's cluster-based information retrieval for the MEDLINE testset collection known as OHSUMED.

[1]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[4]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[5]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[7]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[8]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[9]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[14]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[15]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[16]  Paul Thompson,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2002, Information Retrieval.

[17]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[18]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[19]  David J. Harper,et al.  The WebCluster project. Using clustering for mediating access to the World Wide Web , 1998, SIGIR '98.

[20]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[21]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[22]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.