Unsupervised detection of fragment length signatures of circulating tumor DNA using non-negative matrix factorization

Sequencing of cell-free DNA (cfDNA) is currently being used to detect cancer by searching both for mutational and non-mutational alterations. Recent work has shown that the length distribution of cfDNA fragments from a cancer patient can inform tumor load and type. Here, we propose non-negative matrix factorization (NMF) of fragment length distributions as a novel and completely unsupervised method for studying fragment length patterns in cfDNA. Using shallow whole-genome sequencing (sWGS) of cfDNA from a cohort of patients with metastatic castration-resistant prostate cancer (mCRPC), we demonstrate how NMF accurately infers the true tumor fragment length distribution as an NMF component - and that the sample weights of this component correlate with ctDNA levels (r=0.75). We further demonstrate how using several NMF components enables accurate cancer detection on data from various early stage cancers (AUC = 0.96). Finally, we show that NMF, when applied across genomic regions, can be used to discover fragment length signatures associated with open chromatin.

[1]  P. Blache,et al.  Circulating nuclear DNA structural features, origins, and complete size profile revealed by fragmentomics , 2021, JCI insight.

[2]  H. Nielsen,et al.  Genome-wide cell-free DNA fragmentation in patients with cancer , 2019, Nature.

[3]  Keval Patel,et al.  Enhanced detection of circulating tumor DNA by fragment size analysis , 2018, Science Translational Medicine.

[4]  J. Lindberg,et al.  Cell-free DNA profiling of metastatic prostate cancer reveals microsatellite instability, structural rearrangements and clonal hematopoiesis , 2018, bioRxiv.

[5]  Nikhil Wagle,et al.  Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors , 2017, Nature Communications.

[6]  Karl Kashofer,et al.  Inferring expressed genes by whole-genome sequencing of plasma DNA , 2016, Nature Genetics.

[7]  Matthew W. Snyder,et al.  Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin , 2016, Cell.

[8]  O. Hofmann,et al.  VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research , 2016, Nucleic acids research.

[9]  Janet Kelso,et al.  leeHom: adaptor trimming and merging for Illumina sequencing reads , 2014, Nucleic acids research.

[10]  Ira M. Hall,et al.  SAMBLASTER: fast duplicate marking and structural variant read extraction , 2014, Bioinform..

[11]  M. Choti,et al.  Detection of Circulating Tumor DNA in Early- and Late-Stage Human Malignancies , 2014, Science Translational Medicine.

[12]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[13]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[14]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[15]  M. Ychou,et al.  High Fragmentation Characterizes Tumour-Derived Circulating DNA , 2011, PloS one.

[16]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[17]  A. Giuliano,et al.  Prediction of breast tumor progression by integrity of free circulating DNA in serum. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[18]  D. Chan,et al.  Increased plasma DNA integrity in cancer patients. , 2003, Cancer research.

[19]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.