Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.

[1]  Nikola Kasabov,et al.  Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines , 2002, IEEE Transactions on Neural Networks.

[2]  K Roeder,et al.  Haplotype fine mapping by evolutionary trees. , 2000, American journal of human genetics.

[3]  J. Novembre,et al.  Finding haplotype block boundaries by using the minimum-description-length principle. , 2003, American journal of human genetics.

[4]  Yan V. Sun,et al.  A scan statistic for identifying chromosomal patterns of SNP association , 2006, Genetic epidemiology.

[5]  Ming D. Li,et al.  Fine Mapping Functional Sites or Regions from Case‐Control Data Using Haplotypes of Multiple Linked SNPs , 2005, Annals of human genetics.

[6]  Rongwei Fu,et al.  Bayesian models for the analysis of genetic structure when populations are correlated , 2005, Bioinform..

[7]  R. Elston,et al.  A powerful method of combining measures of association and Hardy–Weinberg disequilibrium for fine‐mapping in case‐control studies , 2006, Statistics in medicine.

[8]  Jiangsheng Yu,et al.  Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data , 2005, ISMB.

[9]  J. Kere,et al.  Data mining applied to linkage disequilibrium mapping. , 2000, American journal of human genetics.

[10]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[11]  Léon Personnaz,et al.  MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling , 2003, J. Mach. Learn. Res..

[12]  M. De Iorio,et al.  Finding Associations in Dense Genetic Maps: A Genetic Algorithm Approach , 2005, Human Heredity.

[13]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[14]  Dan Geiger,et al.  Model-Based Inference of Haplotype Block Variation , 2004, J. Comput. Biol..

[15]  L. Cardon,et al.  The complex interplay among factors that influence allelic association , 2004, Nature Reviews Genetics.

[16]  Peter H. Westfall,et al.  Testing Association of Statistically Inferred Haplotypes with Discrete and Continuous Traits in Samples of Unrelated Individuals , 2002, Human Heredity.

[17]  Mee Young Park,et al.  Regularization Path Algorithms for Detecting Gene Interactions , 2006 .

[18]  Richard Judson,et al.  How many SNPs does a genome-wide haplotype map require? , 2002, Pharmacogenomics.

[19]  Michael Knapp,et al.  A powerful strategy to account for multiple testing in the context of haplotype analysis. , 2004, American journal of human genetics.

[20]  Håvard Rue,et al.  On block updating in Markov random field models for disease mapping. (REVISED, May 2001) , 2000 .

[21]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[22]  J S Witte,et al.  Introduction: Analysis of Sequence Data and Population Structure , 2001, Genetic epidemiology.

[23]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[24]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[25]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[26]  Christopher A. Haiman,et al.  Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study , 2003, Human Heredity.

[27]  Robert M. Hubley,et al.  Evolutionary algorithms for the selection of single nucleotide polymorphisms , 2003, BMC Bioinformatics.

[28]  Debashis Ghosh,et al.  A model-based scan statistic for identifying extreme chromosomal regions of gene expression in human tumors , 2005, Bioinform..

[29]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[30]  D. Hunter Gene–environment interactions in human diseases , 2005, Nature Reviews Genetics.

[31]  Deborah A. Nickerson,et al.  Efficient selection of tagging single-nucleotide polymorphisms in multiple populations , 2006, Human Genetics.

[32]  Duncan C Thomas,et al.  Bayesian Spatial Modeling of Haplotype Associations , 2003, Human Heredity.

[33]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.

[34]  Andrew P Morris,et al.  Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. , 2004, American journal of human genetics.

[35]  Hadar I. Avi-Itzhak,et al.  Selection of Minimum Subsets of Single Nucleotide Polymorphisms to Capture Haplotype Block Diversity , 2003, Pacific Symposium on Biocomputing.

[36]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[37]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[38]  J. Chang-Claude,et al.  Haplotype Sharing Analysis Using Mantel Statistics , 2005, Human Heredity.

[39]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[40]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[41]  Momiao Xiong,et al.  An entropy-based statistic for genomewide association studies. , 2005, American journal of human genetics.

[42]  Jason H. Moore,et al.  Genome-Wide Analysis of Epistasis Using Multifactor Dimensionality Reduction: Feature Selection and Construction in the Domain of Human Genetics , 2009 .

[43]  E Ukkonen,et al.  Minimum description length block finder, a method to identify haplotype blocks and to compare the strength of block boundaries. , 2003, American journal of human genetics.

[44]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[45]  D. Schaid General score tests for associations of genetic markers with disease using cases and their parents , 1996, Genetic epidemiology.

[46]  Tianhua Niu,et al.  A coalescence-guided hierarchical Bayesian method for haplotype inference. , 2006, American journal of human genetics.

[47]  S Wallenstein,et al.  An approximation for the distribution of the scan statistic. , 1987, Statistics in medicine.

[48]  Kun Zhang,et al.  HaploBlockFinder: Haplotype Block Analyses , 2003, Bioinform..

[49]  Scott M. Williams,et al.  New strategies for identifying gene-gene interactions in hypertension , 2002, Annals of medicine.

[50]  Russell Schwartz,et al.  Genome-Wide Association Studies Optimal Haplotype Block-Free Selection of Tagging SNPs for Material Supplemental , 2004 .

[51]  Gregory A. Poland,et al.  Score tests for association of traits with haplotypes when linkage phase is ambiguous , 2002 .

[52]  P. Marjoram,et al.  Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. , 2003, American journal of human genetics.

[53]  Jinko Graham,et al.  A Note on Inference of Trait Associations with SNP Haplotypes and Other Attributes in Generalized Linear Models , 2004, Human Heredity.

[54]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Paul R Burton,et al.  Key concepts in genetic epidemiology , 2005, The Lancet.

[56]  Jason H. Moore,et al.  Exploiting Expert Knowledge in Genetic Programming for Genome-Wide Genetic Analysis , 2006, PPSN.

[57]  Ting Chen,et al.  Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. , 2004, Genome research.

[58]  B Müller-Myhsok,et al.  Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. , 2005, American journal of human genetics.

[59]  Claudio J. Verzilli,et al.  Bayesian graphical models for genomewide association studies. , 2006, American journal of human genetics.

[60]  Luísa Azevedo,et al.  Epistatic interactions: how strong in disease and evolution? , 2006, Trends in genetics : TIG.

[61]  A D Long,et al.  Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework , 2001, The Journal of Biological Chemistry.

[62]  Frank Dudbridge,et al.  Evaluation of Nyholt’s Procedure for Multiple Testing Correction , 2005, Human Heredity.

[63]  Rui Mei,et al.  Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation , 2005, Human Genomics.

[64]  Alex Zelikovsky,et al.  MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression , 2006, Bioinform..

[65]  D. Geiger,et al.  Modeling Haplotype Block Variation Using Markov Chains , 2006, Genetics.

[66]  P. Sham,et al.  The future of association studies: gene-based analysis and replication. , 2004, American journal of human genetics.

[67]  Stuart G. Baker A Simple Loglinear Model for Haplotype Effects in a Case-Control Study Involving Two Unphased Genotypes , 2005, Statistical applications in genetics and molecular biology.

[68]  Zhaohui S. Qin,et al.  TagSNP Selection Based on Pairwise LD Criteria and Power Analysis in Association Studies , 2005, Pacific Symposium on Biocomputing.

[69]  N. Schork,et al.  Generalized genomic distance-based regression methodology for multilocus association analysis. , 2006, American journal of human genetics.

[70]  Tao Jiang,et al.  Genetics and population analysis Haplotype-based linkage disequilibrium mapping via direct data mining , 2005 .

[71]  Paola Sebastiani,et al.  Minimal haplotype tagging , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Michael Krawczak,et al.  Entropy-based SNP selection for genetic association studies , 2003, Human Genetics.

[73]  Gérard Dreyfus,et al.  Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model , 2000, Neurocomputing.

[74]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[75]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[76]  Frank Dudbridge,et al.  Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. , 2004, American journal of human genetics.

[77]  Marylyn D. Ritchie,et al.  GPNN: Power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease , 2006 .

[78]  J. Ott,et al.  Scan statistics to scan markers for susceptibility genes. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[80]  L. Cardon,et al.  Association study designs for complex diseases , 2001, Nature Reviews Genetics.

[81]  Dmitri V Zaykin,et al.  Ranks of Genuine Associations in Whole-Genome Scans , 2005, Genetics.

[82]  Jason H. Moore,et al.  An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene Interactions on risk of myocardial infarction: The importance of model validation , 2004, BMC Bioinformatics.

[83]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[84]  M. W. Foster,et al.  Integrating ethics and science in the International HapMap Project , 2004, Nature Reviews Genetics.

[85]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[86]  P. Donnelly,et al.  Inference in molecular population genetics , 2000 .

[87]  M Knapp,et al.  Multiple Testing in the Context of Haplotype Analysis Revisited: Application to Case‐Control Data , 2005, Annals of human genetics.

[88]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[89]  B. Horne,et al.  Principal component analysis for selection of optimal SNP‐sets that capture intragenic genetic variation , 2004, Genetic epidemiology.

[90]  H. Zou,et al.  The doubly regularized support vector machine , 2006 .

[91]  Nikola Kasabov,et al.  Evolving connectionist systems , 2002 .

[92]  J. Ott,et al.  Neural networks and disease association studies. , 2001, American journal of medical genetics.

[93]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[94]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[95]  Arpad Kelemen,et al.  Temporal gene expression classification with regularised neural network , 2005, Int. J. Bioinform. Res. Appl..

[96]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[97]  R. Altman,et al.  Finding haplotype tagging SNPs by use of principal components analysis. , 2004, American journal of human genetics.

[98]  Sio Iong Ao,et al.  CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs , 2005, Bioinform..

[99]  John Fulcher,et al.  Computational Intelligence: An Introduction , 2008, Computational Intelligence: A Compendium.

[100]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[101]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[102]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[103]  Lon R. Cardon,et al.  Efficient selective screening of haplotype tag SNPs , 2003, Bioinform..

[104]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[105]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[106]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[107]  William Stafford Noble,et al.  Analysis of strain and regional variation in gene expression in mouse brain , 2001, Genome Biology.

[108]  Chuhsing Kate Hsiao,et al.  Regression-based association analysis with clustered haplotypes through use of genotypes. , 2006, American journal of human genetics.

[109]  H. Rue,et al.  On Block Updating in Markov Random Field Models for Disease Mapping , 2002 .

[110]  John S Witte,et al.  Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[111]  C. Sabatti,et al.  Bayesian analysis of haplotypes for linkage disequilibrium mapping. , 2001, Genome research.

[112]  N. Chatterjee,et al.  Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. , 2006, American journal of human genetics.

[113]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[114]  Iñaki Inza,et al.  Gene selection by sequential search wrapper approaches in microarray cancer class prediction , 2002, J. Intell. Fuzzy Syst..

[115]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[116]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[117]  Paolo Vineis,et al.  A road map for efficient and reliable human genome epidemiology , 2006, Nature Genetics.

[118]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[119]  Shili Lin,et al.  Multilocus LD measure and tagging SNP selection with generalized mutual information , 2005, Genetic epidemiology.