论文信息 - Gene hunting with hidden Markov model knockoffs

Gene hunting with hidden Markov model knockoffs

&NA; Modern scientific studies often require the identification of a subset of explanatory variables. Several statistical methods have been developed to automate this task, and the framework of knockoffs has been proposed as a general solution for variable selection under rigorous Type I error control, without relying on strong modelling assumptions. In this paper, we extend the methodology of knockoffs to problems where the distribution of the covariates can be described by a hidden Markov model. We develop an exact and efficient algorithm to sample knockoff variables in this setting and then argue that, combined with the existing selective framework, this provides a natural and powerful tool for inference in genome‐wide association studies with guaranteed false discovery rate control. We apply our method to datasets on Crohn's disease and some continuous phenotypes.

[1] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[3] Peng Zhao,et al. On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[4] Laurent Duret,et al. The Impact of Recombination on Nucleotide Substitutions in the Human Genome , 2008, PLoS genetics.

[5] M. Stephens,et al. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[6] E. Lander,et al. The mystery of missing heritability: Genetic interactions create phantom heritability , 2012, Proceedings of the National Academy of Sciences.

[7] S. P. Fodor,et al. Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[8] Marcelo P. Segura-Lepe,et al. Rare and low-frequency coding variants alter human adult height , 2016, Nature.

[9] E. Candès,et al. Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[10] J. Marchini,et al. Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[11] M. Stephens,et al. Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[12] M. Stephens,et al. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[13] Tariq Ahmad,et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci , 2010, Nature Genetics.

[14] C. Hoggart,et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[15] Lucas Janson,et al. Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[16] Anders Krogh,et al. Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[17] P. Elliott,et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[18] P. Donnelly,et al. A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[19] John D. Storey,et al. Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20] Paul Scheet,et al. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[21] Kenneth Lange,et al. Stability selection for genome‐wide association , 2011, Genetic epidemiology.

[22] Eleazar Eskin,et al. Identifying Causal Variants at Loci with Multiple Signals of Association , 2014, Genetics.

[23] S. Geer,et al. On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[24] Judy H. Cho,et al. Finding the missing heritability of complex diseases , 2009, Nature.

[25] Wenguang Sun,et al. Large‐scale multiple testing under dependence , 2009 .

[26] B. Browning,et al. Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[27] M. Waterman,et al. A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[28] P. Donnelly,et al. A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[29] N. Risch,et al. Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[30] Ross M. Fraser,et al. Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[31] Trevor J. Hastie,et al. Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..