Non Parametric Methods for Genomic Inference

This paper grew out of a number of examples arising in data coming from the ENCODE project (Birney et al., 2007). Variations of some of the methods described here have been applied at various places in that paper, as well as in Margulies et al., 2007, for assessing significance and computing confidence bounds for statistics that operate along a genomic sequence. The background on these methods are described in cookbook form in the supplements to these papers, and it is the goal of this paper to describe them in more detail and rigor. We begin with some concrete examples from the data mentioned in the papers above as well as other types of genomic data in Section 1.2, and proceed with a motivated description of our model in Section 2. Our methods are discussed both qualitatively and mathematically in Sections 3 and 4. Sections 5 contain results from simulation studies and real data analysis. Finally, an appendix with proofs of theorems stated in Sections 3 and 4 completes the paper. Essentially, we will argue that, in making inference about statistics computed from “large” stretches of the genome, in the absence of real knowledge about the evolutionary path which led to the genome in question, the best

[1]  Politis,et al.  [Springer Series in Statistics] Subsampling || Subsampling for Stationary Time Series , 1999 .

[2]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[3]  P. Bickel,et al.  On the Choice of m in the m Out of n Bootstrap and its Application to Condence Bounds for Extreme Percentiles y , 2005 .

[4]  F. Götze,et al.  Adaptive choice of bootstrap sample sizes , 2001 .

[5]  G Bernardi,et al.  The mosaic genome of warm-blooded vertebrates. , 1985, Science.

[6]  D. A. Edwards On the existence of probability measures with given marginals , 1978 .

[7]  B. Efron Nonparametric standard errors and confidence intervals , 1981 .

[8]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[9]  R. Beran Prepivoting Test Statistics: A Bootstrap View of Asymptotic Refinements , 1988 .

[10]  Bradley I. Coleman,et al.  An intermediate grade of finished genomic sequence suitable for comparative analyses. , 2004, Genome research.

[11]  Arnold J Stromberg,et al.  Subsampling , 2001, Technometrics.

[12]  D. Siegmund,et al.  Tests for a change-point , 1987 .

[13]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[14]  Han Yu,et al.  Cluster Analyzer for Transcription Sites (CATS): a C++-based program for identifying clustered transcription factor binding sites , 2004, Bioinform..

[15]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[16]  M. Srivastava,et al.  On Tests for Detecting Change in Mean , 1975 .

[17]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[18]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[19]  W Li,et al.  Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. , 1998, Genome research.

[20]  P. Doukhan,et al.  Weak Dependence: With Examples and Applications , 2007 .

[21]  Keunsoo Kang,et al.  Evolutionary Conserved Motif Finder (ECMFinder) for genome-wide identification of clustered YY1- and CTCF-binding sites , 2009, Nucleic acids research.

[22]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[23]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[24]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[25]  Gary A. Churchill,et al.  Hidden Markov Chains and the Analysis of Genome Structure , 1992, Comput. Chem..

[26]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[27]  N. Carter Methods and strategies for analyzing copy number variation using DNA microarrays , 2007, Nature Genetics.

[28]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[29]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[30]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[31]  J. T. Clerc Computers in chemistry , 1982 .

[32]  David Letson,et al.  Better Confidence Intervals: The Double Bootstrap with No Pivot , 1998 .

[33]  R. Curnow,et al.  Maximum likelihood estimation of multiple change points , 1990 .