Accelerating binary biclustering on platforms with CUDA-enabled GPUs

Abstract Data mining is nowadays essential in many scientific fields to extract valuable information from large input datasets and transform it into an understandable structure. For instance, biclustering techniques are very useful in identifying subsets of two-dimensional data where both rows and columns are correlated. However, some biclustering techniques have become extremely time-consuming when processing very large datasets, which nowadays prevents their use in many areas of research and industry (such as bioinformatics) that have experienced an explosive growth on the amount of available data. In this work we present CUBiBit , a tool that accelerates the search for relevant biclusters on binary data by exploiting the computational capabilities of CUDA-enabled GPUs as well as the several CPU cores available in most current systems. The experimental evaluation has shown that CUBiBit is up to 116 times faster than the fastest state-of-the-art tool, BiBit , in a system with two Intel Sandy Bridge processors (16 CPU cores) and three NVIDIA K20 GPUs. CUBiBit is publicly available to download from https://sourceforge.net/projects/cubibit .

[1]  Reda Alhajj,et al.  Development of multidimensional academic information networks with a novel data cube based modeling method , 2014, Inf. Sci..

[2]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[3]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[4]  Kuniaki Uehara,et al.  Bit Sequences and Biclustering of Text Documents , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[5]  Nhan Do,et al.  GPU-OSDDA: a bit-vector GPU-based deadlock detection algorithm for single-unit resource systems , 2015, Int. J. Parallel Emergent Distributed Syst..

[6]  Regina Berretta,et al.  GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs , 2012, PloS one.

[7]  Diego R. Amancio,et al.  A Complex Network Approach to Stylometry , 2015, PloS one.

[8]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[9]  Hung-Chia Chen,et al.  Identification of Bicluster Regions in a Binary Matrix and Its Applications , 2013, PloS one.

[10]  Yun Zhu,et al.  Efficient parallel boolean matrix based algorithms for computing composite rough set approximations , 2016, Inf. Sci..

[11]  Jiming Liu,et al.  Speeding up K-Means Algorithm by GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[12]  Cheng Wang,et al.  Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA) , 2011, The Journal of Supercomputing.

[13]  Sebastián Ventura,et al.  High performance evaluation of evolutionary-mined association rules on GPUs , 2013, The Journal of Supercomputing.

[14]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[15]  Marek Kretowski,et al.  Evolutionary induction of a decision tree for large-scale data: a GPU-based approach , 2017, Soft Comput..

[16]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[17]  Verónica Bolón-Canedo,et al.  Fast‐mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High‐Dimensional Big Data , 2017, Int. J. Intell. Syst..

[18]  Matheus Palhares Viana,et al.  On time-varying collaboration networks , 2013, J. Informetrics.

[19]  Luming Zhang,et al.  Fortune Teller: Predicting Your Career Path , 2016, AAAI.

[20]  Krzysztof Boryczko,et al.  Rough assessment of GPU capabilities for parallel PCC-based biclustering method applied to microarray data sets , 2015, Bio Algorithms Med Syst..

[21]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[22]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[23]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[24]  Seokho Lee,et al.  A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood , 2014, Stat. Comput..

[25]  Jesús S. Aguilar-Ruiz,et al.  A biclustering algorithm for extracting bit-patterns from binary datasets , 2011, Bioinform..

[26]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[27]  Lalit Kumar,et al.  An efficient map-reduce algorithm for computing formal concepts from binary data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[28]  Stephen Abell,et al.  GPU-LMDDA: a bit-vector GPU-based deadlock detection algorithm for multi-unit resource systems , 2016, Int. J. Parallel Emergent Distributed Syst..

[29]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[30]  Hong Yan,et al.  GPU-based biclustering for microarray data analysis in neurocomputing , 2014, Neurocomputing.

[31]  Luming Zhang,et al.  Action2Activity: Recognizing Complex Activities from Sensor Data , 2015, IJCAI.

[32]  Nadia Nouali-Taboudjemat,et al.  GPU-based bees swarm optimization for association rules mining , 2014, The Journal of Supercomputing.

[33]  Hosam M. F. AboElFotoh,et al.  A GPU-based genetic algorithm for the p-median problem , 2017, The Journal of Supercomputing.

[34]  ThieleLothar,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .