GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Abstract Background The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. Results We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. Conclusions GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.

[1]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Masoumeh Gity,et al.  Metas-Chip precisely identifies presence of micrometastasis in live biopsy samples by label free approach , 2017, Nature Communications.

[3]  John C. Marioni,et al.  Testing for differential abundance in mass cytometry data , 2017, Nature Methods.

[4]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[5]  B. Becher,et al.  The end of gating? An introduction to automated analysis of high dimensional cytometry data , 2016, European journal of immunology.

[6]  Matthew Rocklin,et al.  Dask: Parallel Computation with Blocked algorithms and Task Scheduling , 2015, SciPy.

[7]  O. Ornatsky,et al.  Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. , 2009, Analytical chemistry.

[8]  Yizong Cheng Convergence and Ordering of Kohonen's Batch Map , 1997, Neural Computation.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Eirini Arvaniti,et al.  Sensitive detection of rare disease-associated cell subsets via representation learning , 2016, Nature Communications.

[11]  Mark D. Robinson,et al.  Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data , 2016, bioRxiv.

[12]  Pascal Bouvry,et al.  Management of an academic HPC cluster: The UL experience , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[13]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[14]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[15]  Jiří Vondrášek,et al.  Generalized EmbedSOM on quadtree-structured self-organizing maps , 2019, F1000Research.

[16]  Sean C. Bendall,et al.  Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE , 2011, Nature Biotechnology.

[17]  Tiffany J. Chen,et al.  Cytobank: providing an analytics platform for community cytometry data analysis and collaboration. , 2014, Current topics in microbiology and immunology.

[18]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[19]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[20]  Piet Demeester,et al.  FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[21]  Mariana Valente,et al.  Spectral Cytometry Has Unique Properties Allowing Multicolor Analysis of Cell Suspensions Isolated from Solid Tissues , 2016, PloS one.

[22]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[23]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[24]  I. Amit,et al.  Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types , 2014, Science.

[25]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[26]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[27]  Reynold Xin,et al.  Apache Spark , 2016 .

[28]  Andrey Tovchigrechko,et al.  Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[29]  David Defour,et al.  Numerical reproducibility for the parallel reduction on multi- and many-core architectures , 2015, Parallel Comput..

[30]  Teuvo Kohonen,et al.  Essentials of the self-organizing map , 2013, Neural Networks.

[31]  R. Tibshirani,et al.  Automated identification of stratifying signatures in cellular subpopulations , 2014, Proceedings of the National Academy of Sciences.

[32]  Steve D. M. Brown,et al.  The International Mouse Phenotyping Consortium: past and future perspectives on mouse phenotyping , 2012, Mammalian Genome.

[33]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[34]  Stefan Rüping,et al.  GridR: An R-based tool for scientific data analysis in grid environments , 2009, Future Gener. Comput. Syst..

[35]  Mustapha Lebbah,et al.  SOM Clustering Using Spark-MapReduce , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.