Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints

For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.

[1]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[2]  Lefteris Angelis,et al.  PuReD-MCL: a graph-based PubMed document clustering methodology , 2008, Bioinform..

[3]  Michael K. Ng,et al.  A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering , 2007, DASFAA.

[4]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[5]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[6]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[7]  Stuart J. Nelson,et al.  The MeSH Translation Maintenance System: Structure, Interface Design, and Implementation , 2004, MedInfo.

[8]  Jing Hua,et al.  Non-negative matrix factorization for semi-supervised data clustering , 2008, Knowledge and Information Systems.

[9]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[10]  Weiqing Wang,et al.  Exploring supervised and unsupervised methods to detect topics in biomedical text , 2006, BMC Bioinformatics.

[11]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[14]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[15]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[16]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[17]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[18]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[19]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[20]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[21]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[22]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[23]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[24]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[25]  Ichigaku Takigawa,et al.  Annotating gene function by combining expression data with a modular gene network , 2007, ISMB/ECCB.

[26]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[27]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[28]  Monica Chagoyen,et al.  Quantifying the biological significance of gene ontology biological processes - implications for the analysis of systems-wide data , 2010, Bioinform..

[29]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[30]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[31]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[32]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..