EBIC: an artificial intelligence-based parallel biclustering algorithm for pattern discovery

In this paper a novel biclustering algorithm based on artificial intelligence (AI) is introduced. The method called EBIC aims to detect biologically meaningful, order-preserving patterns in complex data. The proposed algorithm is probably the first one capable of discovering with accuracy exceeding 50\% multiple complex patterns in real gene expression datasets. It is also one of the very few biclustering methods designed for parallel environments with multiple graphics processing units (GPUs). We demonstrate that EBIC outperforms state-of-the-art biclustering methods, in terms of recovery and relevance, on both synthetic and genetic datasets. EBIC also yields results over 12 times faster than the most accurate reference algorithms. The proposed algorithm is anticipated to be added to the repertoire of unsupervised machine learning algorithms for the analysis of datasets, including those from large-scale genomic studies.

[1]  Riccardo Poli,et al.  Elitism reduces bloat in genetic programming , 2008, GECCO '08.

[2]  Krzysztof Boryczko,et al.  Hybrid Biclustering Algorithms for Data Mining , 2016, EvoApplications.

[3]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[4]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[5]  Sushmita Mitra,et al.  Multi-objective evolutionary biclustering of gene expression data , 2006, Pattern Recognit..

[6]  Beatriz Pontes,et al.  Quality Measures for Gene Expression Biclusters , 2015, PloS one.

[7]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[8]  Ümit V. Çatalyürek,et al.  A Biclustering Method to Discover Co-regulated Genes Using Diverse Gene Expression Datasets , 2009, BICoB.

[9]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[12]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[13]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[14]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[15]  Krzysztof Boryczko,et al.  Propagation-Based Biclustering Algorithm for Extracting Inclusion-Maximal Motifs , 2016, Comput. Informatics.

[16]  Patryk Orzechowski,et al.  Proximity Measures and Results Validation in Biclustering - A Survey , 2013, ICAISC.

[17]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[18]  Bruno Sareni,et al.  Fitness sharing and niching methods revisited , 1998, IEEE Trans. Evol. Comput..

[19]  Friedrich Leisch,et al.  Biclustering , 2012 .

[20]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[21]  Anindya Bhattacharya,et al.  A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules , 2017, Scientific Reports.

[22]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[23]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[24]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[26]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[27]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[28]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Zhenjia Wang,et al.  UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data , 2016, Scientific Reports.

[30]  Federico Divina,et al.  Biclustering of expression data with evolutionary computation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[31]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.