Automated Contamination Detection in Single-Cell Sequencing

Novel methods for the sequencing of single-cell DNA offer tremendous opportunities. However, many techniques are still in their infancy and a major obstacle is given by sample contamination with foreign DNA. In this contribution, we present a pipeline that allows for fast, automated detection of contaminated samples by the use of modern machine learning methods. First, a vectorial representation of the genomic data is obtained using oligonucleotide signatures. Using non-linear subspace projections, data is transformed to be suitable for automatic clustering. This allows for the detection of one vs. more genomes (clusters) in a sample. As clustering is an ill-posed problem, the pipeline relies on a thorough choice of all involved methods and parameters. We give an overview of the problem and evaluate techniques suitable for this task.

[1]  P. Blainey The future is now: single-cell genomics of bacteria and archaea. , 2013, FEMS microbiology reviews.

[2]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[3]  Jens Stoye,et al.  metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences , 2013, BMC Bioinformatics.

[4]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[5]  Mamoon Rashid,et al.  READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation , 2012, Bioinform..

[6]  Giorgio Valentini,et al.  Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[7]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[9]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[10]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[11]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  George Michailidis,et al.  Critical limitations of consensus clustering in class discovery , 2014, Scientific Reports.

[14]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[15]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[16]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[17]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[18]  Junhyong Kim,et al.  The promise of single-cell sequencing , 2013, Nature Methods.

[19]  Giorgio Valentini,et al.  Model order selection for clustered biomolecular data , 2006 .

[20]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[21]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[22]  Alexander Sczyrba,et al.  Automatic discovery of metagenomic structure , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[23]  Paul Wilmes,et al.  Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction , 2014, Scientific Reports.

[24]  Method of the Year 2013 , 2013, Nature Methods.

[25]  Argyris Kalogeratos,et al.  Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[26]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[27]  Daniel A. Ashlock,et al.  MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering , 2009, BMC Bioinformatics.

[28]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[29]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Alexander Sczyrba,et al.  Nonlinear Dimensionality Reduction for Cluster Identification in Metagenomic Samples , 2013, 2013 17th International Conference on Information Visualisation.

[32]  Sorin Draghici,et al.  MDAsim: A multiple displacement amplification simulator , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.