Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow. To streamline this process, we present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Hypercluster is available on bioconda; installation, documentation and example workflows can be found at: https://github.com/liliblu/hypercluster.

[1]  Davis J. McCarthy,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[2]  M. Reinders,et al.  A comparison of automatic cell identification methods for single-cell RNA sequencing data , 2019, Genome Biology.

[3]  Joel Nothman,et al.  Author Correction: SciPy 1.0: fundamental algorithms for scientific computing in Python , 2020, Nature Methods.

[4]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[5]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[6]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[7]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[8]  R. Satija,et al.  The bone marrow microenvironment at single-cell resolution , 2019, Nature.

[9]  Vincent A. Traag,et al.  From Louvain to Leiden: guaranteeing well-connected communities , 2018, Scientific Reports.

[10]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[11]  Gao Yu,et al.  Design of integrated system for heterogeneous network query terminal: Design of integrated system for heterogeneous network query terminal , 2009 .

[12]  Simon Dirmeier,et al.  PyBDA: a command line tool for automated analysis of big biological data sets , 2019, BMC Bioinformatics.

[13]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[14]  Shu Ichihara,et al.  Basal-like and triple-negative breast cancers: a critical review with an emphasis on the implications for pathologists and oncologists , 2011, Modern Pathology.

[15]  R. Barber,et al.  Gradient descent with nonconvex constraints: local concavity determines convergence , 2017, 1703.07755.

[16]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[17]  Valeria Vitelli,et al.  Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome , 2017, Breast Cancer Research.

[18]  Hendrik Blockeel,et al.  Using internal validity measures to compare clustering algorithms , 2015, ICML 2015.

[19]  Tanneguy Redarce,et al.  Automatic Lip-Contour Extraction and Mouth-Structure Segmentation in Images , 2011, Computing in Science & Engineering.

[20]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[21]  Paul J. Hoffman,et al.  Comprehensive Integration of Single-Cell Data , 2018, Cell.

[22]  Dapeng Wang,et al.  hppRNA - a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples , 2017, Briefings Bioinform..

[23]  J. Chiorini,et al.  ATAC2GRN: optimized ATAC-seq and DNase1-seq pipelines for rapid and accurate genome regulatory network inference , 2018, BMC Genomics.

[24]  Jie Tang,et al.  New Methods of Data Clustering and Classification Based on NMF , 2011, 2011 International Conference on Business Computing and Global Informatization.

[25]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[27]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Vessela N. Kristensen,et al.  Prognostic value of PAM50 and risk of recurrence score in patients with early-stage breast cancer with long-term follow-up , 2017, Breast Cancer Research.

[31]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[32]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[33]  M. Dunning,et al.  Genome-driven integrated classification of breast cancer validated in over 7,500 samples , 2014, Genome Biology.

[34]  Brooke L. Fridley,et al.  Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm , 2017, PloS one.

[35]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[36]  Roland Eils,et al.  New Brain Tumor Entities Emerge from Molecular Classification of CNS-PNETs , 2016, Cell.

[37]  Till Acker,et al.  DNA methylation-based classification of central nervous system tumours , 2018, Nature.

[38]  M. Hemberg,et al.  Publisher Correction: Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[39]  Vincent A. Traag,et al.  Significant Scales in Community Structure , 2013, Scientific Reports.

[40]  Ricardo J. G. B. Campello,et al.  Clustering of RNA-Seq samples: Comparison study on cancer data. , 2018, Methods.

[41]  I. Ellis,et al.  Triple‐negative/basal‐like breast cancer: review , 2009, Pathology.

[42]  G. Lin,et al.  A comparison framework and guideline of clustering methods for mass cytometry data , 2019, Genome Biology.

[43]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.