Mining a massive RNA-seq dataset with biclustering: are evolutionary approaches ready for big data?

Finding meaningful structures in big data is challenging, especially within big and noisy data. In this short paper, we present the results of the application of 6 different biclustering methods to a massive human RNA-seq dataset with over 35k genes from over 125k samples. We assess which biclustering methods can handle that large data and compare the results to the mini-batch k-means, a popular clustering approach. Finally, we assess the importance of evolutionary-based approaches in biclustering 'big data'.

[1]  Krzysztof Boryczko,et al.  Text Mining with Hybrid Biclustering Algorithms , 2016, ICAISC.

[2]  Aedín C. Culhane,et al.  iBBiG: iterative binary bi-clustering of gene sets , 2012, Bioinform..

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  Krzysztof Boryczko,et al.  Propagation-Based Biclustering Algorithm for Extracting Inclusion-Maximal Motifs , 2016, Comput. Informatics.

[5]  Patryk Orzechowski,et al.  EBIC: an open source software for high-dimensional and big data analyses , 2019, Bioinform..

[6]  Jason H. Moore,et al.  EBIC: a next-generation evolutionary-based parallel biclustering method , 2018, GECCO.

[7]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[8]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[9]  Krzysztof Boryczko,et al.  Hybrid Biclustering Algorithms for Data Mining , 2016, EvoApplications.

[10]  Jason H. Moore,et al.  EBIC: an evolutionary‐based parallel biclustering algorithm for pattern discovery , 2018, Bioinform..

[11]  Zhenjia Wang,et al.  UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data , 2016, Scientific Reports.

[12]  Jason H. Moore,et al.  runibic: a Bioconductor package for parallel row-based biclustering of gene expression data , 2017, bioRxiv.

[13]  Yu Zhang,et al.  QUBIC: a bioconductor package for qualitative biclustering analysis of gene co‐expression data , 2016, Bioinform..

[14]  Kathleen M Jagodnik,et al.  Massive mining of publicly available RNA-seq data from human and mouse , 2017, Nature Communications.

[15]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.