Full Text Clustering and Relationship Network Analysis of Biomedical Publications

Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.

[1]  Maurizio Marchese,et al.  Text Clustering with Seeds Affinity Propagation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Juan Cui,et al.  Regulation of gene expression in ovarian cancer cells by luteinizing hormone receptor expression and activation , 2011, BMC Cancer.

[3]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[4]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Wai Lam,et al.  An active learning framework for semi-supervised document clustering with language modeling , 2009, Data Knowl. Eng..

[7]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[8]  Dit-Yan Yeung,et al.  Locally linear metric adaptation with application to semi-supervised clustering and image retrieval , 2006, Pattern Recognit..

[9]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[10]  Ian F Tannock,et al.  Factors associated with failure to publish large randomized trials presented at an oncology meeting. , 2003, JAMA.

[11]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[12]  Luca Benini,et al.  Co-clustering: A Versatile Tool for Data Analysis in Biomedical Informatics , 2007, IEEE Transactions on Information Technology in Biomedicine.

[13]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[14]  Tran Cao Son,et al.  Incremental Information Extraction Using Relational Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Weiqing Wang,et al.  Exploring supervised and unsupervised methods to detect topics in biomedical text , 2006, BMC Bioinformatics.

[16]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[17]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[18]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[19]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[20]  Padmini Srinivasan,et al.  MeSH: a window into full text for document summarization , 2011, Bioinform..

[21]  Hao Li,et al.  CompMoby: Comparative MobyDick for detection of cis-regulatory motifs , 2008, BMC Bioinformatics.

[22]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[23]  A. Rivas,et al.  Discovering Novel Causal Patterns From Biomedical Natural-Language Texts Using Bayesian Nets , 2008, IEEE Transactions on Information Technology in Biomedicine.

[24]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[25]  Jing Zhang Detecting and understanding combinatorial mutation patterns responsible for HIV drug resistance , 2013 .

[26]  Diana B. Petitti,et al.  Meta-Analysis, Decision Analysis, and Cost-Effectiveness Analysis: Methods for Quantitative Synthesis in Medicine , 1994 .

[27]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[28]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[29]  Hsinchun Chen,et al.  User-Centered Evaluation of Arizona BioPathway: An Information Extraction, Integration, and Visualization System , 2007, IEEE Transactions on Information Technology in Biomedicine.

[30]  Hong Yu,et al.  Accessing bioscience images from abstract sentences , 2006, ISMB.

[31]  Wen-Lian Hsu,et al.  New Challenges for Biological Text-Mining in the Next Decade , 2010, Journal of Computer Science and Technology.

[32]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[33]  Brendan J. Frey,et al.  Response to Comment on "Clustering by Passing Messages Between Data Points" , 2008, Science.