BicSPAM: flexible biclustering using sequential patterns

BackgroundBiclustering is a critical task for biomedical applications. Order-preserving biclusters, submatrices where the values of rows induce the same linear ordering across columns, capture local regularities with constant, shifting, scaling and sequential assumptions. Additionally, biclustering approaches relying on pattern mining output deliver exhaustive solutions with an arbitrary number and positioning of biclusters. However, existing order-preserving approaches suffer from robustness, scalability and/or flexibility issues. Additionally, they are not able to discover biclusters with symmetries and parameterizable levels of noise.ResultsWe propose new biclustering algorithms to perform flexible, exhaustive and noise-tolerant biclustering based on sequential patterns (BicSPAM). Strategies are proposed to allow for symmetries and to seize efficiency gains from item-indexable properties and/or from partitioning methods with conservative distance guarantees. Results show BicSPAM ability to capture symmetries, handle planted noise, and scale in terms of memory and time. BicSPAM also achieves the best match-scores for the recovery of hidden biclusters in synthetic datasets with varying noise distributions and levels of missing values. Finally, results on gene expression data lead to complete solutions, delivering new biclusters corresponding to putative modules with heightened biological relevance.ConclusionsBicSPAM provides an exhaustive way to discover flexible structures of order-preserving biclusters. To the best of our knowledge, BicSPAM is the first attempt to deal with order-preserving biclusters that allow for symmetries and that are robust to varying levels of noise.

[1]  Gowtham Atluri,et al.  Discovering coherent value bicliques in genetic interaction data , 2010, KDD 2010.

[2]  Hyungwon Choi,et al.  Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data , 2010, Molecular systems biology.

[3]  Alexander Schliep,et al.  Comparative study on normalization procedures for cluster analysis of gene expression datasets , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[4]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[5]  José María Carazo,et al.  Integrated analysis of gene expression by association rules discovery , 2006, BMC Bioinformatics.

[6]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[7]  Hui Xiong,et al.  Generalizing the notion of support , 2004, KDD.

[8]  Gowtham Atluri,et al.  Putting genetic interactions in context through a global modular decomposition. , 2011, Genome research.

[9]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[11]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[12]  Mohamed Ben Ahmed,et al.  Simultaneous Clustering: A Survey , 2011, PReMI.

[13]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[14]  Paul Horton,et al.  A biclustering method for gene expression module discovery using a closed itemset enumeration algorithm , 2007 .

[15]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[16]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[17]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[18]  Ricardo Martínez,et al.  GenMiner: Mining Informative Association Rules from Genomic Data , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[19]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[20]  Wilfred Ng,et al.  Mining order-preserving submatrices from probabilistic matrices , 2014, TODS.

[21]  Chris H. Q. Ding,et al.  Biclustering Protein Complex Interactions with a Biclique Finding Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[22]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[23]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[24]  Cláudia Antunes,et al.  F2G: Efficient Discovery of Full-Patterns , 2013 .

[25]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[26]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[27]  Paul Horton,et al.  Exhaustive Search Method of Gene Expression Modules and Its Application to Human Tissue Data , 2007 .

[28]  Jiong Yang,et al.  Biclustering in gene expression data by tendency , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[29]  Dorit S. Hochbaum,et al.  Approximation Algorithms for a Minimization Variant of the Order-Preserving Submatrices and for Biclustering Problems , 2013, TALG.

[30]  Cláudia Antunes,et al.  Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns , 2013, NFMCP.

[31]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[32]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[33]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[35]  Hui Xiong,et al.  Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results , 2006, PAKDD.

[36]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[37]  Vipin Kumar,et al.  An association analysis approach to biclustering , 2009, KDD.

[38]  Martin Vingron,et al.  DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach , 2011, Algorithms for Molecular Biology.

[39]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[40]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[42]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[43]  Arlindo L. Oliveira,et al.  Identification of Regulatory Modules in Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[45]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[46]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[47]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[48]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[49]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[50]  Andrea Califano,et al.  Analysis of Gene Expression Microarrays for Phenotype Classification , 2000, ISMB.

[51]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[52]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[53]  Szymon Jaroszewicz,et al.  Mining rank-correlated sets of numerical attributes , 2006, KDD '06.

[54]  David Wai-Lok Cheung,et al.  Mining Order-Preserving Submatrices from Data with Repeated Measurements , 2013, IEEE Trans. Knowl. Data Eng..

[55]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[56]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[57]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[58]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[59]  Mohamed A. Ismail,et al.  BIDENS: Iterative Density Based Biclustering Algorithm With Application to Gene Expression Analysis , 2009 .

[60]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[61]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[62]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.

[63]  Liang Yang,et al.  Computational promoter analysis of mouse, rat and human antimicrobial peptide-coding genes , 2006, BMC Bioinformatics.

[64]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[65]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .