Robust unsupervised dimensionality reduction based on feature clustering for single-cell imaging data

Abstract Biological data, and in particular imaging data, have experienced an exponential growth in terms of volume and complexity in the last few years, raising new challenges in the field of machine learning. Unsupervised problems are of particular relevance, as the generation of labels for the data is often labor-intensive, expensive or simply not possible. However, interpretability of the data and the results is key to extract new valuable knowledge from the large-scale datasets that are studied. This highlights the necessity of adequate unsupervised dimensionality reduction techniques that can lower the computational workload necessary to process the dataset, while at the same time providing information on its structure. This paper describes a framework that brings together previous proposals on unsupervised feature clustering, with the goal of providing a scalable, interpretable and robust dimensionality reduction on single-cell imaging data. The framework integrates several inter-feature dissimilarity measures, clustering algorithms, quality criteria to select the best feature clustering, and dimensionality reduction methods that are built on the clustering. For each of these components, several approaches proposed in previous works have been tested and evaluated on three use cases coming from two different imaging datasets, highlighting the best-performing components. Affinity clustering is applied for feature clustering for the first time. The results were validated using statistical tests, showing that many of the combinations tested lowered the complexity of the datasets while maintaining or improving the accuracy yielded by classifiers applied on them. The analysis highlighted affinity clustering as the best algorithm for feature clustering, with median differences of up to 8.9% and 0.9% in accuracy with respect to FSFS and hierarchical clustering. Representation entropy obtained a median difference of 13.0% and 0.8% with respect to class separability and silhouette index, respectively, as a robust unsupervised criterion to select the cluster set. Dissimilarities based on Pearson’s correlation performed slightly better than the alternatives, with a median improvement of 2.8% with respect to the cosine distance.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[5]  I. Vorobjev,et al.  Imaging Flow Cytometry , 2012, The journal of histochemistry and cytochemistry : official journal of the Histochemistry Society.

[6]  Francisco Herrera,et al.  Multiple Instance Learning , 2016, Springer International Publishing.

[7]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[8]  Anne E Carpenter,et al.  CellProfiler: image analysis software for identifying and quantifying cell phenotypes , 2006, Genome Biology.

[9]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[10]  Kayvan Najarian,et al.  Biomedical Signal and Image Processing , 2005 .

[11]  J. Ellenberg,et al.  High-throughput fluorescence microscopy for systems biology , 2006, Nature Reviews Molecular Cell Biology.

[12]  Howard M. Shapiro,et al.  Practical Flow Cytometry , 1985 .

[13]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[14]  Luís M. Silva,et al.  High-Content Analysis of Breast Cancer Using Single-Cell Deep Transfer Learning , 2016, Journal of biomolecular screening.

[15]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  A. Madabhushi,et al.  Histopathological Image Analysis: A Review , 2009, IEEE Reviews in Biomedical Engineering.

[18]  Natasha S. Barteneva,et al.  Imaging Flow Cytometry , 2016, Methods in Molecular Biology.

[19]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[20]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[21]  Amit Kumar Das,et al.  A feature cluster taxonomy based feature selection technique , 2017, Expert Syst. Appl..

[22]  Anne E Carpenter,et al.  Annotated high-throughput microscopy image sets for validation , 2012, Nature Methods.

[23]  Anne E Carpenter,et al.  Comparison of Methods for Image-Based Profiling of Cellular Morphological Responses to Small-Molecule Treatment , 2013, Journal of biomolecular screening.

[24]  Anne E Carpenter,et al.  Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery. , 2018, Cell chemical biology.

[25]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[27]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[28]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[30]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[31]  Francisco Herrera,et al.  On the use of convolutional neural networks for robust classification of multiple fingerprint captures , 2017, Int. J. Intell. Syst..

[32]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[33]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[34]  Ujjwal Maulik,et al.  Integration of dense subgraph finding with feature clustering for unsupervised feature selection , 2014, Pattern Recognit. Lett..

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[37]  H. Hannah Inbarani,et al.  Hybrid Tolerance Rough Set-Firefly based supervised feature selection for MRI brain tumor image classification , 2016, Appl. Soft Comput..

[38]  Yintong Wang,et al.  Unsupervised Representative Feature Selection Algorithm Based on Information Entropy and Relevance Analysis , 2018, IEEE Access.

[39]  Carmen C. Y. Poon,et al.  Big Data for Health , 2015, IEEE Journal of Biomedical and Health Informatics.

[40]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[41]  Bo Jiang,et al.  Multi-view clustering via simultaneous weighting on views and features , 2016, Appl. Soft Comput..

[42]  Diego Cabrera,et al.  Attribute clustering using rough set theory for feature selection in fault severity classification of rotating machinery , 2017, Expert Syst. Appl..

[43]  Julio López,et al.  Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification , 2018, Appl. Soft Comput..

[44]  Lassi Paavolainen,et al.  Data-analysis strategies for image-based cell profiling , 2017, Nature Methods.

[45]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[47]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[48]  Anne E Carpenter,et al.  An open-source solution for advanced imaging flow cytometry data analysis using machine learning , 2017, Methods.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[51]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[52]  Piet Demeester,et al.  FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[53]  Rafael Yuste,et al.  Fluorescence microscopy today , 2005, Nature Methods.

[54]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[55]  Anne E Carpenter,et al.  Reconstructing cell cycle and disease progression using deep learning , 2017, Nature Communications.

[56]  G. Gauglitz,et al.  Strategies for label-free optical detection. , 2008, Advances in biochemical engineering/biotechnology.

[57]  Anne E Carpenter,et al.  Label-free cell cycle analysis for high-throughput imaging flow cytometry , 2016, Nature Communications.