SureTypeSC - a Random Forest and Gaussian mixture predictor of high confidence genotypes in single-cell data

MOTIVATION Accurate genotyping of DNA from a single cell is required for applications such as de novo mutation detection, linkage analysis and lineage tracing. However, achieving high precision genotyping in the single cell environment is challenging due to the errors caused by whole genome amplification. Two factors make genotyping from single cells using single nucleotide polymorphism (SNP) arrays challenging. The lack of a comprehensive single cell dataset with a reference genotype and the absence of genotyping tools specifically designed to detect noise from the whole genome amplification step. Algorithms designed for bulk DNA genotyping cause significant data loss when used for single cell applications. RESULTS In this study, we have created a resource of 28.7 million SNPs, typed at high confidence from whole genome amplified DNA from single cells using the Illumina SNP bead array technology. The resource is generated from 104 single cells from two cell lines that are available from the Coriell repository. We used mother-father-proband (trio) information from multiple technical replicates of bulk DNA to establish a high quality reference genotype for the two cell lines on the SNP array. This enabled us to develop SureTypeSC - a two-stage machine learning algorithm that filters a substantial part of the noise, thereby retaining the majority of the high quality SNPs. SureTypeSC also provides a simple statistical output to show the confidence of a particular single cell genotype using Bayesian statistics. AVAILABILITY The implementation of SureTypeSC in Python and sample data are available in the GitHub repository: https://github.com/puko818/SureTypeSC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  X. Xie,et al.  Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI) , 2017, Science.

[2]  Bo-Juen Chen,et al.  Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis , 2018, Science.

[3]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[4]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[5]  Andrew Menzies,et al.  Analysis of the Genetic Phylogeny of Multifocal Prostate Cancer Identifies Multiple Independent Clonal Expansions in Neoplastic and Morphologically Normal Prostate Tissue , 2015, Nature Genetics.

[6]  Minseok Kwon,et al.  Linked-read analysis identifies mutations in single-cell DNA-sequencing data , 2019, Nature Genetics.

[7]  Sijia Lu,et al.  Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications. , 2015, Annual review of genomics and human genetics.

[8]  D. Griffin,et al.  Karyomapping: a universal method for genome wide analysis of genetic disease based on mapping crossovers between parental haplotypes , 2009, Journal of Medical Genetics.

[9]  Ruijie Liu,et al.  Comparing genotyping algorithms for Illumina's Infinium whole-genome SNP BeadChips , 2011, BMC Bioinformatics.

[10]  Alex D. Herbert,et al.  “Genome-wide recombination and chromosome segregation in human oocytes and embryos reveal selection for maternal recombination rates” , 2015, Nature Genetics.

[11]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[12]  Tsz-Kwong Man,et al.  Allelic imbalance analysis by high-density single-nucleotide polymorphic allele (SNP) array with whole genome amplified DNA. , 2004, Nucleic acids research.

[13]  Peter J. Park,et al.  Somatic mutation in single human neurons tracks developmental and transcriptional history , 2015, Science.

[14]  Hongyu Zhao,et al.  M3: an improved SNP calling algorithm for Illumina BeadArray data , 2012, Bioinform..

[15]  Hunter B. Fraser,et al.  Common variants spanning PLK4 are associated with mitotic-origin aneuploidy in human embryos , 2015, Science.

[16]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[17]  Ken Chen,et al.  Monovar: single nucleotide variant detection in single cells , 2016, Nature Methods.

[18]  Rafael A. Irizarry,et al.  R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips , 2009, Bioinform..

[19]  Eleni Giannoulatou,et al.  GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population , 2008, Bioinform..

[20]  C. Walsh,et al.  Building a lineage from single cells: genetic techniques for cell lineage tracking , 2017, Nature Reviews Genetics.

[21]  Yves Moreau,et al.  Concurrent whole-genome haplotyping and copy-number profiling of single cells. , 2015, American journal of human genetics.

[22]  W. Lau,et al.  Identification of Four Distinct Regions of Allelic Imbalances on Chromosome 1 by the Combined Comparative Genomic Hybridization and Microsatellite Analysis on Hepatocellular Carcinoma , 2002, Modern Pathology.

[23]  Tao Wang,et al.  Accurate identification of single nucleotide variants in whole genome amplified single cells , 2017, Nature Methods.

[24]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[25]  Jie Qiao,et al.  Probing Meiotic Recombination and Aneuploidy of Single Sperm Cells by Whole-Genome Sequencing , 2012, Science.

[26]  Jakob Grove,et al.  Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios , 2015, Nature Communications.

[27]  Takaya Saito,et al.  Precrec: fast and accurate precision–recall and ROC curve calculations in R , 2016, Bioinform..

[28]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[29]  Alex J. Bladon,et al.  Genome-wide karyomapping accurately identifies the inheritance of single-gene defects in human preimplantation embryos in vitro , 2014, Genetics in Medicine.

[30]  S. Kingsmore,et al.  Comprehensive human genome amplification using multiple displacement amplification , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  D. Altman,et al.  Measuring agreement in method comparison studies , 1999, Statistical methods in medical research.

[33]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  Michael Inouye,et al.  A genotype calling algorithm for the Illumina BeadArray platform , 2007, Bioinform..

[36]  Asif U. Tamuri,et al.  Genome sequencing of normal cells reveals developmental lineages and mutational processes , 2014, Nature.

[37]  A. Handyside,et al.  Tripolar mitosis and partitioning of the genome arrests human preimplantation development in vitro , 2017, Scientific Reports.