Biclustering of human cancer microarray data using co-similarity based co-clustering

We propose a novel technique for finding biclusters in gene expression data.We propose a simple yet effective method for automatically determining discriminating biclusters.Our proposed method is robust to noise in the data.We evaluate the empirical and biological significance of our extracted biclusters with biological processes. Biclustering of gene expression data aims at finding localized patterns in a subspace. A bicluster (sometimes called a co-cluster), in the context of gene expression data, is a set of genes that exhibit similar expression intensity under a subset of experimental features (conditions). Most biclustering algorithms proposed in the literature aim at finding sub-matrices that exhibit some sort of coherence by selecting an initial sub-matrix and iteratively adding or subtracting rows and columns. These algorithms are generally dependent on the initial, hard selection of the gene and condition clusters respectively. In this work, we adapt a recently proposed approach for clustering textual data to find biclusters in gene expression data. Our proposed technique is based on the concept of co-similarity between genes (and between conditions) that exploits weighted higher order paths in a bipartite graph representation of the gene expression data. Therefore, we build statistical relations between genes and between conditions by comparing all genes and conditions before finally extracting biclusters from the data. We show that the proposed technique is able to find meaningful non-overlapping biclusters both on synthetically generated data as well as real cancer data. Our results indicate that the proposed technique is resistant to noise in the data and can successfully retrieve biclusters even in the presence of relatively large amount of noise. We also analyze our results with respect to the discovered genes and observe that our extracted biclusters are supported by biological evidences, such as enrichment of gene functions and biological processes.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Gilles Bisson,et al.  An Improved Co-Similarity Measure for Document Clustering , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[3]  Jesús S. Aguilar-Ruiz,et al.  Configurable pattern-based evolutionary biclustering of gene expression data , 2012, Algorithms for Molecular Biology.

[4]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Wilhelm Gruissem,et al.  Exact biclustering algorithm for the analysis of large gene expression data sets , 2012, BMC Bioinformatics.

[6]  Hitashyam Maka,et al.  Biclustering of Gene Expression Data Using Genetic Algorithm , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[7]  Juntao Li,et al.  Identifying local co-regulation relationships in gene expression data. , 2014, Journal of theoretical biology.

[8]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[9]  Jesús S. Aguilar-Ruiz,et al.  Shifting and scaling patterns from gene expression data , 2005, Bioinform..

[10]  M. Kosorok,et al.  Biclustering with heterogeneous variance , 2013, Proceedings of the National Academy of Sciences.

[11]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[12]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[13]  Fabrício Olivetti de França,et al.  A biclustering approach for classification with mislabeled data , 2015, Expert Syst. Appl..

[14]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[15]  F. Eisenhaber,et al.  pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model , 2007, Biology Direct.

[16]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[17]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[19]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[20]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[21]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[22]  Syed Fawad Hussain Bi-clustering Gene Expression Data Using Co-similarity , 2011, ADMA.

[23]  Rui Henriques,et al.  BicPAM: Pattern-based biclustering for biomedical data analysis , 2014, Algorithms for Molecular Biology.

[24]  Gilles Bisson,et al.  Text Categorization Using Word Similarities Based on Higher Order Co-occurrences , 2010, SDM.

[25]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[26]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[27]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[28]  Gilles Bisson,et al.  Chi-Sim: A New Similarity Measure for the Co-clustering Task , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[29]  Gene H. Golub,et al.  Scaling by Binormalization , 2004, Numerical Algorithms.

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[32]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[33]  Vipin Kumar,et al.  Discovery of error-tolerant biclusters from noisy gene expression data , 2011, BMC Bioinformatics.

[34]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[35]  Haifa Ben Saber,et al.  DNA Microarray Data Analysis: A New Survey on Biclustering , 2014 .

[36]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[37]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[38]  Beatriz Pontes,et al.  Quality Measures for Gene Expression Biclusters , 2015, PloS one.

[39]  Federico Divina,et al.  Biclustering of expression data with evolutionary computation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[40]  Ümit V. Çatalyürek,et al.  A Biclustering Method to Discover Co-regulated Genes Using Diverse Gene Expression Datasets , 2009, BICoB.

[41]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[43]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[44]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[45]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .