论文信息 - Automated Contamination Detection in Single-Cell Sequencing

Automated Contamination Detection in Single-Cell Sequencing

Novel methods for the sequencing of single-cell DNA offer tremendous opportunities. However, many techniques are still in their infancy and a major obstacle is given by sample contamination with foreign DNA. In this contribution, we present a pipeline that allows for fast, automated detection of contaminated samples by the use of modern machine learning methods. First, a vectorial representation of the genomic data is obtained using oligonucleotide signatures. Using non-linear subspace projections, data is transformed to be suitable for automatic clustering. This allows for the detection of one vs. more genomes (clusters) in a sample. As clustering is an ill-posed problem, the pipeline relies on a thorough choice of all involved methods and parameters. We give an overview of the problem and evaluate techniques suitable for this task.

[1] P. Blainey. The future is now: single-cell genomics of bacteria and archaea. , 2013, FEMS microbiology reviews.

[2] Natalia N. Ivanova,et al. Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[3] Jens Stoye,et al. metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences , 2013, BMC Bioinformatics.

[4] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[5] Mamoon Rashid,et al. READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation , 2012, Bioinform..

[6] Giorgio Valentini,et al. Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[7] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8] Hui Xiong,et al. Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[9] J. Hartigan,et al. The Dip Test of Unimodality , 1985 .

[10] Anil K. Jain. Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[11] Isabelle Guyon,et al. A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[12] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[13] George Michailidis,et al. Critical limitations of consensus clustering in class discovery , 2014, Scientific Reports.

[14] Monzoorul Haque Mohammed,et al. Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[15] Ricardo J. G. B. Campello,et al. Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[16] Laurens van der Maaten,et al. Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[17] R. Amann,et al. Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[18] Junhyong Kim,et al. The promise of single-cell sequencing , 2013, Nature Methods.

[19] Giorgio Valentini,et al. Model order selection for clustered biomolecular data , 2006 .

[20] Leping Li,et al. ART: a next-generation sequencing read simulator , 2012, Bioinform..

[21] Ulrike von Luxburg,et al. A tutorial on spectral clustering , 2007, Stat. Comput..

[22] Alexander Sczyrba,et al. Automatic discovery of metagenomic structure , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[23] Paul Wilmes,et al. Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction , 2014, Scientific Reports.

[24] Method of the Year 2013 , 2013, Nature Methods.

[25] Argyris Kalogeratos,et al. Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[26] S. Dudoit,et al. A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[27] Daniel A. Ashlock,et al. MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering , 2009, BMC Bioinformatics.

[28] Vladimir Estivill-Castro,et al. Why so many clustering algorithms: a position paper , 2002, SKDD.

[29] Sergey I. Nikolenko,et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[30] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[31] Alexander Sczyrba,et al. Nonlinear Dimensionality Reduction for Cluster Identification in Metagenomic Samples , 2013, 2013 17th International Conference on Information Visualisation.

[32] Sorin Draghici,et al. MDAsim: A multiple displacement amplification simulator , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.