Fast randomization of large genomic datasets while preserving alteration counts

Motivation: Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a ‘mutually exclusive’ manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. Results: We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. Availability and implementation: BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html Contact: iorio@ebi.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Christopher A. Miller,et al.  Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors , 2011, BMC Medical Genomics.

[2]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[3]  Alfred Brousseau,et al.  Linear recursion and Fibonacci sequences , 1971 .

[4]  Isabelle Stanton,et al.  Constructing and sampling graphs with a prescribed joint degree distribution , 2011, JEAL.

[5]  C. Yeang,et al.  Combinatorial patterns of somatic gene mutations in cancer , 2008, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[6]  Reka Albert,et al.  Mean-field theory for scale-free random networks , 1999 .

[7]  P. Deloukas,et al.  Signatures of mutation and selection in the cancer genome , 2010, Nature.

[8]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[9]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[10]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[11]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[12]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[13]  J. Besag,et al.  Generalized Monte Carlo significance tests , 1989 .

[14]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[15]  David F. Gleich,et al.  Algorithms and Models for the Web Graph , 2014, Lecture Notes in Computer Science.

[16]  S. Gabriel,et al.  High-throughput oncogene mutation profiling in human cancer , 2007, Nature Genetics.

[17]  Ivo Ponocny,et al.  Nonparametric goodness-of-fit tests for the rasch model , 2001 .

[18]  V. Johnson Studying Convergence of Markov Chain Monte Carlo Algorithms Using Coupled Sample Paths , 1996 .

[19]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[20]  W. Patefield,et al.  An Efficient Method of Generating Random R × C Tables with Given Row and Column Totals , 1981 .

[21]  Q. Cui,et al.  A Network of Cancer Genes with Co-Occurring and Anti-Co-Occurring Mutations , 2010, PloS one.

[22]  A. Sokal Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms , 1997 .

[23]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[24]  Eli Upfal,et al.  De Novo Discovery of Mutated Driver Pathways in Cancer , 2011, RECOMB.

[25]  C. Sander,et al.  Mutual exclusivity analysis identifies oncogenic network modules. , 2012, Genome research.

[26]  J. Wilson,et al.  Methods for detecting non-randomness in species co-occurrences: a contribution , 1987, Oecologia.

[27]  Daniel Simberloff,et al.  The Assembly of Species Communities: Chance or Competition? , 1979 .

[28]  Min Zhang,et al.  Systematic Interpretation of Comutated Genes in Large-Scale Cancer Mutation Profiles , 2010, Molecular Cancer Therapeutics.

[29]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[30]  N. Gotelli Null model analysis of species co-occurrence patterns , 2000 .

[31]  Philip M. Dixon VEGAN, a package of R functions for community ecology , 2003 .

[32]  M. Newman,et al.  On the uniform generation of random graphs with prescribed degree sequences , 2003, cond-mat/0312028.

[33]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[34]  P. Dixon VEGAN, a package of R functions for community ecology , 2003 .

[35]  James M. Boyett,et al.  Random RxC tables with given row and column totals , 1979 .

[36]  János Podani,et al.  RANDOMIZATION OF PRESENCE–ABSENCE MATRICES: COMMENTS AND NEW ALGORITHMS , 2004 .

[37]  S. Ramaswamy,et al.  Systematic identification of genomic markers of drug sensitivity in cancer cells , 2012, Nature.

[38]  Ali Pinar,et al.  Are We There Yet? When to Stop a Markov Chain while Generating Random Graphs , 2012, WAW.

[39]  Ivo Ponocny,et al.  Nonparametric goodness-of-fit tests for the rasch model , 2002 .

[40]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[41]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[42]  T. Hubbard,et al.  Large-Scale Mutagenesis in p19ARF- and p53-Deficient Mice Identifies Cancer Genes and Their Collaborative Networks , 2008, Cell.