A Scalable Biclustering Method for Heterogeneous Medical Data

We define the problem of biclustering on heterogeneous data, that is, data of various types (binary, numeric, etc.). This problem has not yet been investigated in the biclustering literature. We propose a new method, HBC (Heterogeneous BiClustering), designed to extract biclusters from heterogeneous, large-scale, sparse data matrices. The goal of this method is to handle medical data gathered by hospitals (on patients, stays, acts, diagnoses, prescriptions, etc.) and to provide valuable insight on such data. HBC takes advantage of the data sparsity and uses a constructive greedy heuristic to build a large number of possibly overlapping biclusters. The proposed method is successfully compared with a standard biclustering algorithm on small-size numeric data. Experiments on real-life data sets further assert its scalability and efficiency.

[1]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[2]  Jianhong Zhou,et al.  ParRescue: Scalable Parallel Algorithm and Implementation for Biclustering over Large Distributed Datasets , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[3]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[4]  Roberto Grossi,et al.  Circular sequence comparison: algorithms and applications , 2016, Algorithms for Molecular Biology.

[5]  Laetitia Vermeulen-Jourdan,et al.  Conception of a dominance-based multi-objective local search in the context of classification rule mining in large and imbalanced data sets , 2015, Appl. Soft Comput..

[6]  Lodewyk F. A. Wessels,et al.  Biclustering Sparse Binary Genomic Data , 2008, J. Comput. Biol..

[7]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[9]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[10]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[11]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[13]  Rui Henriques,et al.  BicNET: Flexible module discovery in large-scale biological networks using biclustering , 2016, Algorithms for Molecular Biology.

[14]  Sonja J. Prohaska,et al.  The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies , 2016, Algorithms for Molecular Biology.