A contiguous column coherent evolution biclustering algorithm for time-series gene expression data

As a high-throughput detection technology, the gene chips produce huge amount of gene expression data. How to effectively analyze the data has become an urgent need. Biclustering techniques have been used as important tools to find the local patterns in gene expression data. Biclustering is to find submatrices, so that a subset of the genes shows a “highly correlated behavior in a subset of conditions”. However, most existing biclustering algorithms are not able to find biclusters with contiguous columns. Since there is important internal sequential relationship in time-series data, these methods are not suitable for the analysis of time-series data. In order to explore the potential biological information of contiguous time point and find the co-expressed relationship among genes, this paper presents an efficient, accurate algorithm named k-CCC algorithm, to search contiguous coherent evolution biclusters in time-series data. The first step of the algorithm is to transform the original matrix into a difference matrix; then starting from the column pattern consisting of contiguous k columns, we gradually assemble them into patterns composed of more columns. A pattern update strategy is adopted to improve the efficiency of the algorithm. The algorithm can find all the embedded biclusters and show good scalability in simulated tests. Experimental results on real datasets show that the algorithm can find biclusters with statistical significance and strong biological relevance.

[1]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[2]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[4]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[5]  Michael J. Korenberg Comprar Microarray Data Analysis · Methods and Applications | Korenberg, Michael J. | 9781588295408 | Springer , 2007 .

[6]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[7]  Wilfred Ng,et al.  Mining Bucket Order-Preserving SubMatrices in Gene Expression Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Hui Xiong,et al.  On the Deep Order-Preserving Submatrix Problem: A Best Effort Approach , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[10]  Yasser M. Kadah,et al.  An automatic gene ontology software tool for bicluster and cluster comparisons , 2009, 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[11]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[12]  Feng Liu,et al.  Biclustering of time-lagged gene expression data using real number , 2010 .

[13]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[14]  Ya Zhang,et al.  A time-series biclustering algorithm for revealing co-regulated genes , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[15]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[16]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[17]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[18]  Wilfred Ng,et al.  Discovering significant relaxed order-preserving submatrices , 2010, KDD '10.

[19]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[20]  Pietro Hiram Guzzi Microarray Data Analysis: Methods and Applications , 2016 .

[21]  Arlindo L. Oliveira,et al.  A Linear Time Biclustering Algorithm for Time Series Gene Expression Data , 2005, WABI.

[22]  Xiaohui Hu,et al.  A common-subsequence-based approach for mining deep order preserving submatrix , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[23]  Zhen Ji,et al.  PK-means: A new algorithm for gene clustering , 2008, Comput. Biol. Chem..

[24]  Arlindo L. Oliveira,et al.  An Efficient Biclustering Algorithm for Finding Genes with Similar Patterns in Time-series Expression Data , 2007, APBC.

[25]  M. Korenberg,et al.  Microarray Data Analysis , 2007, Methods in Molecular Biology.

[26]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[27]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[28]  Yong Wang,et al.  Detecting coherent local patterns from time series gene expression data by a temporal biclustering method , 2011, 2011 IEEE International Conference on Systems Biology (ISB).

[29]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[30]  Yan Li,et al.  A new image segmentation model with local statistical characters based on variance minimization , 2015 .

[31]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[32]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Hui-Huang Hsu,et al.  Advanced Data Mining Technologies in Bioinformatics , 2006 .

[34]  Hong Yan,et al.  Biclustering Analysis for Pattern Discovery: Current Techniques, Comparative Studies and Applications , 2012 .

[35]  Zhen Ji,et al.  Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set , 2014, BMC Bioinformatics.

[36]  Wensheng Chen,et al.  A novel adaptive partial differential equation model for image segmentation , 2014 .