论文信息 - SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection - 字舞流文

SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection

The ability to perform an exploratory search and retrieval of relevant documents from a large collection of domain-specific documents is an important requirement both in the field of medicine and other areas. In this paper, we present a unsupervised distributional clustering technique called SOPHIA. SOPHIA provides a semantically meaningful visual clustering of the document corpus in conjunction with an intuitive interactive search facility. We assess the effectiveness of SOPHIA's cluster-based information retrieval for the MEDLINE testset collection known as OHSUMED.

Mykola Galushka | David W. Patterson | Niall Rooney | Vladimir Dobrynin

[1] Ran El-Yaniv,et al. On feature distributional clustering for text categorization , 2001, SIGIR '01.

[2] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3] Michael I. Jordan,et al. Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[4] Daphne Koller,et al. Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[5] Inderjit S. Dhillon,et al. A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[6] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[7] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.

[8] R. K. Shyamasundar,et al. Introduction to algorithms , 1996 .

[9] Yiming Yang,et al. A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[10] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[11] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[12] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[13] Ran El-Yaniv,et al. Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[14] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[15] Ran El-Yaniv,et al. Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[16] Paul Thompson,et al. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2002, Information Retrieval.

[17] Jianhua Lin,et al. Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[18] Naftali Tishby,et al. Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[19] David J. Harper,et al. The WebCluster project. Using clustering for mediating access to the World Wide Web , 1998, SIGIR '98.

[20] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[21] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[22] Naftali Tishby,et al. Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.