Factor Models for Cancer Signatures

We present a novel method for extracting cancer signatures by applying statistical risk models (http://ssrn.com/abstract=2732453) from quantitative finance to cancer genome data. Using 1389 whole genome sequenced samples from 14 cancers, we identify an "overall" mode of somatic mutational noise. We give a prescription for factoring out this noise and source code for fixing the number of signatures. We apply nonnegative matrix factorization (NMF) to genome data aggregated by cancer subtype and filtered using our method. The resultant signatures have substantially lower variability than those from unfiltered data. Also, the computational cost of signature extraction is cut by about a factor of 10. We find 3 novel cancer signatures, including a liver cancer dominant signature (96% contribution) and a renal cell carcinoma signature (70% contribution). Our method accelerates finding new cancer signatures and improves their overall stability. Reciprocally, the methods for extracting cancer signatures could have interesting applications in quantitative finance.

[1]  Anton Zabrodin,et al.  Financial applications of random matrix theory: a short review , 2018 .

[2]  Angela M. Liu,et al.  Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma , 2012, Nature Genetics.

[3]  J. Kench,et al.  Whole genomes redefine the mutational landscape of pancreatic cancer , 2015, Nature.

[4]  T. Lindahl Instability and decay of the primary structure of DNA , 1993, Nature.

[5]  Shibing Deng,et al.  Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer , 2014, Nature Genetics.

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[8]  Heather L. Mulder,et al.  Whole-genome sequencing identifies genetic alterations in pediatric low-grade gliomas , 2013, Nature Genetics.

[9]  C. Harris,et al.  Advances in chemical carcinogenesis: a historical review and prospective. , 2008, Cancer research.

[10]  Zura Kakushadze,et al.  Heterotic Risk Models , 2015, 1508.04883.

[11]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[12]  W. Pierceall,et al.  MOLECULAR MECHANISMS OF ULTRAVIOLET RADIATION CARCINOGENESIS , 1990, Photochemistry and photobiology.

[13]  M. Stratton,et al.  Deciphering Signatures of Mutational Processes Operative in Human Cancer , 2013, Cell reports.

[14]  L. Lorne Campbell,et al.  Minimum Coefficient Rate for Stationary Random Processes , 1960, Inf. Control..

[15]  A. Valencia,et al.  Non-coding recurrent mutations in chronic lymphocytic leukaemia , 2015, Nature.

[16]  Zura Kakushadze Heterotic Risk Models: Heterotic Risk Models , 2015 .

[17]  William N. Goetzmann,et al.  Active Portfolio Management , 1999 .

[18]  Jerry D. Gibson,et al.  Coefficient rate and lossy source coding , 2005, IEEE Transactions on Information Theory.

[19]  W. Sharpe The Sharpe Ratio , 1994 .

[20]  Jing Liu,et al.  Whole-Genome Sequencing Reveals Diverse Models of Structural Variations in Esophageal Squamous Cell Carcinoma , 2016, American journal of human genetics.

[21]  Martin Vetterli,et al.  The effective rank: A measure of effective dimensionality , 2007, 2007 15th European Signal Processing Conference.

[22]  Zura Kakushadze,et al.  Multifactor Risk Models and Heterotic CAPM , 2016, 1602.04902.

[23]  Juliane C. Dohm,et al.  Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia , 2011, Nature.

[24]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[25]  M. Nykter,et al.  The Evolutionary History of Lethal Metastatic Prostate Cancer , 2015, Nature.

[26]  Joshy George,et al.  Whole–genome characterization of chemoresistant ovarian cancer , 2015, Nature.

[27]  Zura Kakushadze,et al.  Statistical Risk Models , 2016, 1602.08070.

[28]  Kiran C. Bobba,et al.  Discovery of Novel Recurrent Mutations in Childhood Early T-Cell Precursor Acute Lymphoblastic Leukemia by Whole Genome Sequencing - a Report From the St Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project , 2011 .

[29]  Edgars Celms,et al.  Variation in genomic landscape of clear cell renal cell carcinoma across Europe , 2014, Nature Communications.

[30]  Dereje D. Jima,et al.  The genetic landscape of mutations in Burkitt lymphoma , 2012, Nature Genetics.

[31]  Li Ding,et al.  Genomic landscape of Ewing sarcoma defines an aggressive subtype with co-association of STAG2 and TP53 mutations. , 2014, Cancer discovery.

[32]  Keith A. Boroevich,et al.  Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer , 2016, Nature Genetics.

[33]  A. Børresen-Dale,et al.  Mutational Processes Molding the Genomes of 21 Breast Cancers , 2012, Cell.

[34]  Angela N. Brooks,et al.  Mapping the Hallmarks of Lung Adenocarcinoma with Massively Parallel Sequencing , 2012, Cell.

[35]  J. Bouchaud,et al.  Financial Applications of Random Matrix Theory: a short review , 2009, 0910.1205.

[36]  F. Urbach Ultraviolet radiation carcinogenesis. , 1983, The Journal of dermatologic surgery and oncology.

[37]  D. Fygenson,et al.  DNA polymerase fidelity: from genetics toward a biochemical understanding. , 1998, Genetics.

[38]  Matthew J. Betts,et al.  Dissecting the genomic complexity underlying medulloblastoma , 2012, Nature.