Density-Based Clustering of Functionally Similar Genes Using Biological Knowledge

Clustering is used to identify natural groups present in the data. It has been applied widely for analyzing gene expression data to discover gene clusters that might be involved in same biological processes. This information is very important for analyzing data of fatal diseases like cancers and identifying potential diagnostic and prognostic markers. Existing clustering methods used in this regard are computationally efficient, but do not always produce biologically meaningful results. Additionally, they have one or the other shortcomings; either they are not able to deal with arbitrary-shaped clusters, require number of clusters to be specified previously or are not efficient in dealing with noise present in biological data. In this study, a new density-based clustering method specific for gene expression data is introduced that overcomes the above shortcomings and produces biologically enriched clusters of functionally similar genes by incorporating biological information from Gene Ontology (GO). The proposed method integrates the GO semantic similarity information and the correlation information between the genes for obtaining clusters. The clusters are further validated for their biological relevance using Disease Ontology, KEGG Pathway enrichment and protein-protein interaction network analysis.

[1]  Adrian V. Lee,et al.  Active Estrogen Receptor-alpha Signaling in Ovarian Cancer Models and Clinical Specimens , 2017, Clinical Cancer Research.

[2]  Yu Zhang,et al.  TP53 mutations in epithelial ovarian cancer. , 2016, Translational cancer research.

[3]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Pradipta Maji,et al.  Rough-Fuzzy Clustering for Grouping Functionally Similar Genes from Microarray Data , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[6]  P. Massion,et al.  The State of Molecular Biomarkers for the Early Detection of Lung Cancer , 2012, Cancer Prevention Research.

[7]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[9]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[10]  Sanjay Joshua Swamidass,et al.  Accounting for noise when clustering biological data , 2012, Briefings Bioinform..

[11]  Yi Pan,et al.  A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[13]  Qing-Yu He,et al.  DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis , 2015, Bioinform..

[14]  T. Crook,et al.  The p53 pathway in breast cancer , 2002, Breast Cancer Research.