ScalaParBiBit: scaling the binary biclustering in distributed-memory systems

Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit .

[1]  Raphael D Isokpehi,et al.  Knowledge Visualizations to Inform Decision Making for Improving Food Accessibility and Reducing Obesity Rates in the United States , 2020, International journal of environmental research and public health.

[2]  R. Rathipriya,et al.  A Novel Evolutionary Biclustering Approach using MapReduce(EBC-MR) , 2016, Int. J. Knowl. Discov. Bioinform..

[3]  Wei Liu,et al.  A Parallel Algorithm for Gene Expressing Data Biclustering , 2008, J. Comput..

[4]  Wei-keng Liao,et al.  High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic , 2009, SDM.

[5]  Shi Dong,et al.  Entropy-based outlier detection using spark , 2019, Cluster Computing.

[6]  Yoonhee Kim,et al.  Lightweight memory tracing for hot data identification , 2020, Cluster Computing.

[7]  Jesús S. Aguilar-Ruiz,et al.  A biclustering algorithm for extracting bit-patterns from binary datasets , 2011, Bioinform..

[8]  Qin Lin,et al.  Parallel Large Average Submatrices Biclustering Based on MapReduce , 2015, 2015 11th International Conference on Computational Intelligence and Security (CIS).

[9]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[10]  Sang-Mun Chi,et al.  Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets , 2019, Nucleic acids research.

[11]  Torsten Hoefler,et al.  Remote Memory Access Programming in MPI-3 , 2015, TOPC.

[12]  Lalit Kumar,et al.  An efficient map-reduce algorithm for computing formal concepts from binary data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[13]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[14]  Basilio B. Fraguela,et al.  A general and efficient divide-and-conquer algorithm framework for multi-core clusters , 2017, Cluster Computing.

[15]  Jorge González-Domínguez,et al.  Accelerating binary biclustering on platforms with CUDA-enabled GPUs , 2019, Inf. Sci..

[16]  Feng Yuan,et al.  Multi-task learning based on question–answering style reviews for aspect category classification and aspect term extraction on GPU clusters , 2020, Cluster Computing.

[17]  Jorge González-Domínguez,et al.  ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems , 2018, PloS one.

[18]  Yun Xue,et al.  A Novel Parallel Biclustering Approach and Its Application to Identify and Segment Highly Profitable Telecom Customers , 2019, IEEE Access.

[19]  Carson Kai-Sang Leung,et al.  Mining Interesting "Following" Patterns from Social Networks , 2014, DaWaK.

[20]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[21]  Qi Long,et al.  Bayesian generalized biclustering analysis via adaptive structured shrinkage. , 2018, Biostatistics.

[22]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[23]  Thorsten Kurth,et al.  MPI usage at NERSC: Present and Future , 2016, EuroMPI.

[24]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[25]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[26]  Mustapha Lebbah,et al.  Biclustering using Spark-MapReduce , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[27]  Hung-Chia Chen,et al.  Identification of Bicluster Regions in a Binary Matrix and Its Applications , 2013, PloS one.

[28]  Federico Divina,et al.  A multi-GPU biclustering algorithm for binary datasets , 2021, J. Parallel Distributed Comput..

[29]  Rui Mendes,et al.  JBiclustGE: Java API with unified biclustering algorithms for gene expression data analysis , 2018, Knowl. Based Syst..

[30]  Basilio B. Fraguela,et al.  Enhancing and Evaluating the Configuration Capability of a Skeleton for Irregular Computations , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.