*K-means and cluster models for cancer signatures

We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means’ computational cost is a fraction of NMF’s. Using 1389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

[1]  N. A. Temiz,et al.  APOBEC3B is an enzymatic source of mutation in breast cancer , 2013, Nature.

[2]  David N. Cooper,et al.  Mechanisms of Base Substitution Mutagenesis in Cancer Genomes , 2014, Genes.

[3]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[4]  Yijun Ruan,et al.  B Cell Super-Enhancers and Regulatory Clusters Recruit AID Tumorigenic Activity , 2014, Cell.

[5]  T. Lindahl Instability and decay of the primary structure of DNA , 1993, Nature.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Willie Yu,et al.  How to combine a billion alphas , 2016, Journal of Asset Management.

[8]  Shibing Deng,et al.  Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer , 2014, Nature Genetics.

[9]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[10]  Adam P Butler,et al.  Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer , 2014, Nature Genetics.

[11]  Juliane C. Dohm,et al.  Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia , 2011, Nature.

[12]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[13]  Martin Vetterli,et al.  The effective rank: A measure of effective dimensionality , 2007, 2007 15th European Signal Processing Conference.

[14]  Jing Liu,et al.  Whole-Genome Sequencing Reveals Diverse Models of Structural Variations in Esophageal Squamous Cell Carcinoma , 2016, American journal of human genetics.

[15]  Steven A. Roberts,et al.  Clustered mutations in yeast and in human cancers can arise from damaged long single-strand DNA regions. , 2012, Molecular cell.

[16]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[17]  Kin Chan,et al.  Clusters of Multiple Mutations: Incidence and Molecular Mechanisms. , 2015, Annual review of genetics.

[18]  Simon Wain-Hobson,et al.  A prevalent cancer susceptibility APOBEC3A hybrid allele bearing APOBEC3B 3′UTR enhances chromosomal DNA damage , 2014, Nature Communications.

[19]  M. Stratton,et al.  Deciphering Signatures of Mutational Processes Operative in Human Cancer , 2013, Cell reports.

[20]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[21]  Dmitry A. Gordenin,et al.  Clustered Mutations in Human Cancer , 2014 .

[22]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[23]  L. K. Hansen,et al.  Feature‐space clustering for fMRI meta‐analysis , 2001, Human brain mapping.

[24]  Dmitry A. Gordenin,et al.  Hypermutation in human cancer genomes: footprints and mechanisms , 2014, Nature Reviews Cancer.

[25]  A. Valencia,et al.  Non-coding recurrent mutations in chronic lymphocytic leukaemia , 2015, Nature.

[26]  S. Sommer,et al.  Epidemiology of Doublet/Multiplet Mutations in Lung Cancers: Evidence that a Subset Arises by Chronocoordinate Events , 2008, PloS one.

[27]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[28]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[29]  Serena Nik-Zainal,et al.  Mechanisms underlying mutational signatures in human cancers , 2014, Nature Reviews Genetics.

[30]  Igor Kuzmin,et al.  High Mutability of the Tumor Suppressor Genes RASSF1 and RBSP3 (CTDSPL) in Cancer , 2009, PloS one.

[31]  Pål Sætrom,et al.  AID expression in B-cell lymphomas causes accumulation of genomic uracil and a distinct AID mutational signature. , 2015, DNA repair.

[32]  J. Bouchaud,et al.  Financial Applications of Random Matrix Theory: a short review , 2009, 0910.1205.

[33]  Angela N. Brooks,et al.  Mapping the Hallmarks of Lung Adenocarcinoma with Massively Parallel Sequencing , 2012, Cell.

[34]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[35]  Joshy George,et al.  Whole–genome characterization of chemoresistant ovarian cancer , 2015, Nature.

[36]  G. Parmigiani,et al.  Heterogeneity of genomic evolution and mutational profiles in multiple myeloma , 2014, Nature Communications.

[37]  N. A. Temiz,et al.  Evidence for APOBEC3B mutagenesis in multiple human cancers , 2013, Nature Genetics.

[38]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[39]  M. Stratton,et al.  DNA deaminases induce break-associated mutation showers with implication of APOBEC3B and 3A in breast cancer kataegis , 2013, eLife.

[40]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[41]  Zura Kakushadze,et al.  Statistical Risk Models , 2016, 1602.08070.

[42]  Edgars Celms,et al.  Variation in genomic landscape of clear cell renal cell carcinoma across Europe , 2014, Nature Communications.

[43]  Heather L. Mulder,et al.  Whole-genome sequencing identifies genetic alterations in pediatric low-grade gliomas , 2013, Nature Genetics.

[44]  S S Sommer,et al.  EGFR somatic doublets in lung cancer are frequent and generally arise from a pair of driver mutations uncommonly seen as singlet mutations: one-third of doublets occur at five pairs of amino acids , 2008, Oncogene.

[45]  Angela M. Liu,et al.  Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma , 2012, Nature Genetics.

[46]  J. Kench,et al.  Whole genomes redefine the mutational landscape of pancreatic cancer , 2015, Nature.

[47]  M. Nykter,et al.  The Evolutionary History of Lethal Metastatic Prostate Cancer , 2015, Nature.

[48]  Zura Kakushadze,et al.  Factor Models for Cancer Signatures , 2016, 1604.08743.

[49]  M. Stratton,et al.  Mutational signatures: the patterns of somatic mutations hidden in cancer genomes , 2014, Current opinion in genetics & development.

[50]  D. Gilbert,et al.  Complex correlations: replication timing and mutational landscapes during cancer and genome evolution. , 2014, Current opinion in genetics & development.

[51]  Gregory Connor,et al.  A Test for the Number of Factors in an Approximate Factor Model , 1993 .

[52]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[53]  Li Ding,et al.  Genomic landscape of Ewing sarcoma defines an aggressive subtype with co-association of STAG2 and TP53 mutations. , 2014, Cancer discovery.

[54]  Keith A. Boroevich,et al.  Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer , 2016, Nature Genetics.

[55]  A. Børresen-Dale,et al.  Mutational Processes Molding the Genomes of 21 Breast Cancers , 2012, Cell.

[56]  Bin Tean Teh,et al.  Mutation signatures of carcinogen exposure: genome-wide detection and new opportunities for cancer prevention , 2014, Genome Medicine.

[57]  Jerry D. Gibson,et al.  Coefficient rate and lossy source coding , 2005, IEEE Transactions on Information Theory.

[59]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[60]  Steven A Roberts,et al.  Clustered and genome‐wide transient mutagenesis in human cancers: Hypermutation without permanent mutators or loss of fitness , 2014, BioEssays : news and reviews in molecular, cellular and developmental biology.

[61]  Zura Kakushadze,et al.  Multifactor Risk Models and Heterotic CAPM , 2016, 1602.04902.

[62]  D. Fygenson,et al.  DNA polymerase fidelity: from genetics toward a biochemical understanding. , 1998, Genetics.

[63]  Zura Kakushadze,et al.  Statistical Industry Classification , 2016, 1607.04883.

[64]  Wei Zheng,et al.  APOBEC3 deletion polymorphism is associated with breast cancer risk among women of European ancestry. , 2013, Carcinogenesis.

[65]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[66]  Dereje D. Jima,et al.  The genetic landscape of mutations in Burkitt lymphoma , 2012, Nature Genetics.

[67]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[68]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[69]  W. Pierceall,et al.  MOLECULAR MECHANISMS OF ULTRAVIOLET RADIATION CARCINOGENESIS , 1990, Photochemistry and photobiology.

[70]  Matthew J. Betts,et al.  Dissecting the genomic complexity underlying medulloblastoma , 2012, Nature.

[71]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[72]  Steven A. Roberts,et al.  An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers , 2013, Nature Genetics.

[73]  M. C. Ortiz,et al.  Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes , 2004 .

[74]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[75]  C. Harris,et al.  Advances in chemical carcinogenesis: a historical review and prospective. , 2008, Cancer research.

[76]  Wei Lu,et al.  A common deletion in the APOBEC3 genes and breast cancer risk. , 2013, Journal of the National Cancer Institute.

[77]  L. Lorne Campbell,et al.  Minimum Coefficient Rate for Stationary Random Processes , 1960, Inf. Control..

[78]  Lawrence A. Donehower,et al.  The somatic genomic landscape of chromophobe renal cell carcinoma. , 2014, Cancer cell.

[79]  William N. Goetzmann,et al.  Active Portfolio Management , 1999 .