Biclustering of High-throughput Gene Expression Data with Bicluster Miner

During recent years, many biclustering algorithms have been developed for the analysis of gene expression data to complement and expand the capabilities of traditional clustering methods. With biclustering, genes with similar expression profiles can be identified not only over the whole data set but also across subsets of experimental conditions allowing genes to simultaneously belong to several expression patterns. This property makes biclustering a powerful approach especially when it is applied to data with large number of conditions. In spite of the clear theoretical benefit, the full potential of biclustering has not been realized within the gene expression research community and thus it has never truly become a part of the standard gene expression data analysis. Possible reasons include for example the unrealization of the various complementary ways in which biclustering can be applied to micro array or next-generation sequencing based gene expression data sets and the lack of reliable and fast algorithms. In this paper, we first illustrate the various opportunities of applying biclustering within a typical gene expression data analysis pipeline. Then a new biclustering method (BiclusterMiner) is presented that can be applied to all presented cases. The developed method is the first discrete biclustering algorithm that is able to simultaneously handle both up- and down-regulated genes by taking the direction of regulation into account and still discover all possible maximal biclusters. The efficiency of the proposed algorithm is demonstrated on real and synthetic datasets.

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[3]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[4]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[5]  ThieleLothar,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[6]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[7]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[8]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[9]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[10]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[11]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[12]  Debashis Ghosh,et al.  Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data , 2010, Genes.

[13]  Jesús S. Aguilar-Ruiz,et al.  A biclustering algorithm for extracting bit-patterns from binary datasets , 2011, Bioinform..

[14]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[15]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S. Kaski,et al.  Bayesian biclustering with the plaid model , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[17]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[18]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[19]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[20]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: Methods and Software , 2013 .

[21]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[22]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[23]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[24]  Samuel Kaski,et al.  The IEEE International Workshop on Machine Learning for Signal Processing XVIII , 2008 .

[25]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[27]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[28]  Martin Vingron,et al.  DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach , 2011, Algorithms for Molecular Biology.

[29]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[30]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[31]  Riitta Lahesmaa,et al.  Tet1 and Tet2 regulate 5-hydroxymethylcytosine production and cell lineage specification in mouse embryonic stem cells. , 2011, Cell stem cell.

[32]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[33]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..