A Study of Cell-Free DNA Fragmentation Pattern and Its Application in DNA Sample Type Classification

Plasma cell-free DNA (cfDNA) has certain fragmentation patterns, which can bring non-random base content curves of the sequencing data's beginning cycles. We studied the patterns and found that we could determine whether a sample is cfDNA or not by just looking into the first 10 cycles of its base content curves. We analyzed 3,189 FastQ files, including 1,442 plasma cfDNA, 1,234 genomic DNA, 507 FFPE tumour DNA, and 6 urinary cfDNA. By deep analyzing these data, we found the patterns were stable enough to distinguish cfDNA from other kinds of DNA samples. Based on this finding, we built classification models to recognize cfDNA samples by their sequencing data. Pattern recognition models were then trained with different classification algorithms like k-nearest neighbors (KNN), random forest, and support vector machine (SVM). The result of 1,000 iteration .632+ bootstrapping showed that all these classifiers could give an average accuracy higher than 98 percent, indicating that the cfDNA patterns are unique and can make the dataset highly separable. The best result was obtained using a random forest classifier with a 99.89 percent average accuracy (<inline-formula><tex-math notation="LaTeX">$\sigma =0.00068$</tex-math><alternatives> <inline-graphic xlink:href="chen-ieq1-2723388.gif"/></alternatives></inline-formula>). A tool called CfdnaPattern (<uri> http://github.com/OpenGene/CfdnaPattern</uri>) has been developed to train the model and to predict whether a sample is cfDNA or not.

[1]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[2]  F. O. Fackelmayer,et al.  DNA fragments in the blood plasma of cancer patients: quantitations and evidence for their origin from apoptotic and necrotic cells. , 2001, Cancer research.

[3]  Matthew W. Snyder,et al.  Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin , 2016, Cell.

[4]  フランク ディール,et al.  Circulating mutant dna to assess tumor dynamics , 2015 .

[5]  Jason Li,et al.  Reducing sequence artifacts in amplicon-based massively parallel sequencing of formalin-fixed paraffin-embedded DNA by enzymatic depletion of uracil-containing templates. , 2013, Clinical chemistry.

[6]  T. Weichhart,et al.  Apoptotic cell-free DNA promotes inflammation in haemodialysis patients. , 2012, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[7]  Stephen R Quake,et al.  Analysis of the size distributions of fetal and maternal cell-free DNA by paired-end sequencing. , 2010, Clinical chemistry.

[8]  A. Garin,et al.  Genetic analysis of DNA excreted in urine: a new approach for detecting specific genomic DNA sequences from cells dying in an organism. , 2000, Clinical chemistry.

[9]  Ancha Baranova,et al.  Non-random fragmentation patterns in circulating cell-free DNA reflect epigenetic regulation , 2015, BMC Genomics.

[10]  Terence P. Speed,et al.  Investigating and Correcting Plasma DNA Sequencing Coverage Bias to Enhance Aneuploidy Discovery , 2014, PloS one.

[11]  N. Thorne,et al.  High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA , 2015, BMC Medical Genomics.

[12]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[13]  E. Duhig,et al.  Pleural fluid cell-free DNA integrity index to identify cytologically negative malignant pleural effusions including mesotheliomas , 2012, BMC Cancer.

[14]  Donavan T. Cheng,et al.  Precision medicine at Memorial Sloan Kettering Cancer Center: clinical next-generation sequencing enabling next-generation targeted therapy trials. , 2015, Drug discovery today.

[15]  Juulia Jylhävä,et al.  The concentration of cell-free DNA in focal epilepsy , 2013, Epilepsy Research.

[16]  P Mandel,et al.  Les acides nucleiques du plasma sanguin chez l' homme , 1948 .

[17]  M. Stroun,et al.  About the possible origin and mechanism of circulating DNA apoptosis and active DNA release. , 2001, Clinica chimica acta; international journal of clinical chemistry.

[18]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[21]  Shuang Hou,et al.  A comparison of isolated circulating tumor cells and tissue biopsies using whole-genome sequencing in prostate cancer , 2015, Oncotarget.

[22]  C. Cantor,et al.  Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma , 2008, Proceedings of the National Academy of Sciences.

[23]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[24]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[25]  Peiyong Jiang,et al.  Mutational Profile of the Fetus Maternal Plasma DNA Sequencing Reveals the Genome-Wide Genetic and , 2010 .