Very large scale ReliefF for genome-wide association analysis

Most common diseases are the result of complex nonlinear interactions between multiple genetic and environmental components. There is thus a pressing need for new computational methods capable of detecting nonlinearly interacting single nucleotide polymorphism (SNPs) that are associated with disease, from amidst up to hundreds of thousands of candidate SNPs. Recently, some progress has been made using feature selection algorithms based on weights from the ReliefF data mining algorithm on sets of up to 1500 SNPs. However, the accuracy of ReliefF does not scale up to the sizes needed for truly large genome-scale SNP association studies. We propose a population-based variant dubbed VLSReliefF, which mitigates this performance drop by stochastically applying ReliefF to SNP subsets, and then assigning each SNP the maximum ReliefF weight it achieved in any subset. A heuristic method is proposed for determining the optimal subset size as a function of heritability, sample size, and order of interactions. Aggressive iterative pruning of SNPs with low VLSReliefF weights can be used for nonlinear feature identification in genome scale SNP sets. The method is validated using a variety of computational experiments on synthetic datasets of up to 100,000 SNPs.

[1]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[2]  Leena Peltonen,et al.  Dissecting Human Disease in the Postgenomic Era , 2001, Science.

[3]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[4]  Jason H. Moore,et al.  Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[5]  Jason H. Moore,et al.  A statistical comparison of grammatical evolution strategies in the domain of human genetics , 2005, 2005 IEEE Congress on Evolutionary Computation.

[6]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[7]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[8]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[9]  A. Syvänen Accessing genetic variation: genotyping single nucleotide polymorphisms , 2001, Nature Reviews Genetics.

[10]  George R Uhl,et al.  The burden of complex genetics in brain disorders. , 2004, Archives of general psychiatry.

[11]  Jason H. Moore,et al.  Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge , 2007 .

[12]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[13]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[14]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[15]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[16]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[17]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[18]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[19]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[20]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[21]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[22]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[23]  J. Ott,et al.  Neural network analysis of complex traits , 1997, Genetic epidemiology.

[24]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[25]  K. Merikangas,et al.  Commentary: understanding sources of complexity in chronic diseases--the importance of integration of genetics and epidemiology. , 2006, International journal of epidemiology.

[26]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[27]  Jason H. Moore,et al.  Genomic mining for complex disease traits with “random chemistry” , 2007, Genetic Programming and Evolvable Machines.

[28]  S. Kauffman At Home in the Universe: The Search for the Laws of Self-Organization and Complexity , 1995 .

[29]  J. Ott,et al.  Statistical multilocus methods for disequilibrium analysis in complex traits , 2001, Human mutation.

[30]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.