Deep learning-based clustering approaches for bioinformatics

Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subtypes of cells and understanding gene regulations. Subsequently, clustering approaches, including hierarchical, centroid-based, distribution-based, density-based and self-organizing maps, have long been studied and used in classical machine learning settings. In contrast, deep learning (DL)-based representation and feature learning for clustering have not been reviewed and employed extensively. Since the quality of clustering is not only dependent on the distribution of data points but also on the learned representation, deep neural networks can be effective means to transform mappings from a high-dimensional data space into a lower-dimensional feature space, leading to improved clustering results. In this paper, we review state-of-the-art DL-based approaches for cluster analysis that are based on representation learning, which we hope to be useful, particularly for bioinformatics research. Further, we explore in detail the training procedures of DL-based clustering algorithms, point out different clustering quality metrics and evaluate several DL-based approaches on three bioinformatics use cases, including bioimaging, cancer genomics and biomedical text mining. We believe this review and the evaluation results will provide valuable insights and serve a starting point for researchers wanting to apply DL-based unsupervised methods to solve emerging bioinformatics research problems.

[1]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[2]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[3]  M. Kaminski The right to explanation, explained , 2018, Research Handbook on Information Law and Governance.

[4]  Chia-Wen Lin,et al.  CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data , 2017, IEEE Transactions on Multimedia.

[5]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[6]  Stefan Decker,et al.  OncoNetExplainer: Explainable Predictions of Cancer Types Based on Gene Expression Data , 2019, 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE).

[7]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[8]  Vladlen Koltun,et al.  Deep Continuous Clustering , 2018, ArXiv.

[9]  Yasuhiko Saito,et al.  Dental health status of community‐dwelling older Singaporeans: findings from a nationally representative survey , 2017, Gerodontology.

[10]  Zhao,et al.  Medical X-Ray Image Enhancement Based on Kramer's PDE Model , 2007 .

[11]  Anant Madabhushi,et al.  Accurate and reproducible invasive breast cancer detection in whole-slide images: A Deep Learning approach for quantifying tumor extent , 2017, Scientific Reports.

[12]  Z. Zivkovic Improved adaptive Gaussian mixture model for background subtraction , 2004, ICPR 2004.

[13]  James M. Joyce Kullback-Leibler Divergence , 2011, International Encyclopedia of Statistical Science.

[14]  Indranil Mukhopadhyay,et al.  Tight clustering for large datasets with an application to gene expression data , 2019, Scientific Reports.

[15]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[16]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[17]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[18]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[19]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[20]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[21]  Bo Zhang,et al.  Discriminatively Boosted Image Clustering with Fully Convolutional Auto-Encoders , 2017, Pattern Recognit..

[22]  M.K. Sundareshan,et al.  Comparison of self-organizing map with K-means hierarchical clustering for bioinformatics applications , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[23]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[24]  Amy Loutfi,et al.  Semantic Referee: A Neural-Symbolic Framework for Enhancing Geospatial Semantic Segmentation , 2019, Semantic Web.

[25]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[26]  Yufei Huang,et al.  Convolutional neural network models for cancer type prediction based on gene expression , 2019, BMC Medical Genomics.

[27]  Stefan Decker,et al.  Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network , 2019, BCB.

[28]  Huachun Tan,et al.  Variational Deep Embedding: A Generative Approach to Clustering , 2016, ArXiv.

[29]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  Akane Sano,et al.  Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[31]  Vladlen Koltun,et al.  Robust continuous clustering , 2017, Proceedings of the National Academy of Sciences.

[32]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[33]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[34]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[35]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Charles C. Kemp,et al.  A Multimodal Anomaly Detector for Robot-Assisted Feeding Using an LSTM-Based Variational Autoencoder , 2017, IEEE Robotics and Automation Letters.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[39]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[40]  Li Fei-Fei,et al.  HiDDeN: Hiding Data With Deep Networks , 2018, ECCV.

[41]  Qiang Liu,et al.  A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture , 2018, IEEE Access.

[42]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[43]  Jimeng Sun,et al.  RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism , 2016, NIPS.

[44]  Sungzoon Cho,et al.  Variational Autoencoder based Anomaly Detection using Reconstruction Probability , 2015 .

[45]  Lingfeng Wang,et al.  Deep Adaptive Image Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Francisco de A. T. de Carvalho,et al.  Comparative analysis of clustering methods for gene expression time course data , 2004, Genetics and Molecular Biology.

[47]  P. Sudhakar,et al.  Evaluating and Analyzing Clusters in Data Mining using Different Algorithms , 2014 .

[48]  Ricardo J. G. B. Campello,et al.  Clustering of RNA-Seq samples: Comparison study on cancer data. , 2018, Methods.

[49]  L. Hubert,et al.  Comparing partitions , 1985 .

[50]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[51]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[52]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[53]  Ezekiel Adebiyi,et al.  Clustering Algorithms: Their Application to Gene Expression Data , 2016, Bioinformatics and biology insights.

[54]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[55]  Stefan Decker,et al.  A snapshot neural ensemble method for cancer-type prediction based on copy number variations , 2019, Neural Computing and Applications.

[57]  Samir Kumar Bandyopadhyay,et al.  Segmentation of Brain Tumour from MRI image – Analysis of K-means and DBSCAN Clustering , 2013 .

[58]  Seokjun Seo,et al.  Hybrid Approach of Relation Network and Localized Graph Convolutional Filtering for Breast Cancer Subtype Classification , 2017, IJCAI.

[59]  Catarina Eloy,et al.  BACH: Grand Challenge on Breast Cancer Histology Images , 2018, Medical Image Anal..

[60]  Ricardo J. G. B. Campello,et al.  Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[61]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[62]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[63]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[64]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Shaheed Zulfikar,et al.  Clustering Techniques in Bioinformatics , 2015, International Journal of Modern Education and Computer Science.

[66]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67]  Cheng Deng,et al.  Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Dietrich Rebholz-Schuhmann,et al.  Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction , 2018, ArXiv.

[69]  Wei Wang,et al.  Deep Embedding Network for Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[70]  Raymond W. Ptucha,et al.  Prostate cancer detection using photoacoustic imaging and deep learning , 2016, Image Processing: Algorithms and Systems.

[71]  Zsolt Kira,et al.  Neural network-based clustering using pairwise constraints , 2015, ArXiv.

[72]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[73]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Prasanta K. Jana,et al.  A Prototype-Based Modified DBSCAN for Gene Clustering , 2012 .

[75]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[76]  Daniel Cremers,et al.  Clustering with Deep Learning: Taxonomy and New Methods , 2018, ArXiv.

[77]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[78]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[79]  Gang Chen,et al.  Deep Learning with Nonparametric Clustering , 2015, ArXiv.

[80]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  F. Bertucci,et al.  Basal Breast Cancer: A Complex and Deadly Molecular Subtype , 2012, Current molecular medicine.

[82]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[83]  Ismail Uysal,et al.  Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization , 2018, ICLR.

[84]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[85]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[86]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[87]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[88]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[89]  Felix Gräßer,et al.  Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning , 2018, DH.

[90]  Derek Greene,et al.  Normalized Mutual Information to evaluate overlapping community finding algorithms , 2011, ArXiv.

[91]  Stefan Decker,et al.  Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data , 2019, IEEE Access.

[92]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[93]  Vinaitheerthan Renganathan,et al.  Text Mining in Biomedical Domain with Emphasis on Document Clustering , 2017, Healthcare informatics research.

[94]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[95]  Stefano Rovetta,et al.  Artificial Neural Networks and Machine Learning – ICANN 2017 , 2017, Lecture Notes in Computer Science.

[96]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[97]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[98]  Yuanzhi Li,et al.  Learning Mixtures of Linear Regressions with Nearly Optimal Complexity , 2018, COLT.