Methods of integrating data to uncover genotype–phenotype interactions

Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration — including meta-dimensional and multi-staged analyses — which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.

[1]  W. Dupont,et al.  Power and sample size calculations. A review and computer program. , 1990, Controlled clinical trials.

[2]  Dan Boneh,et al.  On genetic algorithms , 1995, COLT '95.

[3]  N. Roodi,et al.  Association of cytochrome P450 1B1 (CYP1B1) polymorphism with steroid receptor status in breast cancer. , 1998, Cancer research.

[4]  L. Almasy,et al.  Multipoint quantitative-trait linkage analysis in general pedigrees. , 1998, American journal of human genetics.

[5]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[6]  G. Abecasis,et al.  A general test of association for quantitative traits in nuclear families. , 2000, American journal of human genetics.

[7]  A. Børresen-Dale,et al.  Molecular epidemiology of breast cancer: genetic variation in steroid hormone metabolism. , 2000, Mutation research.

[8]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[9]  N. Laird,et al.  The family based association test method: strategies for studying general genotype–phenotype associations , 2001, European Journal of Human Genetics.

[10]  H Vainio,et al.  Glutathione S-transferase M1, M3, P1, and T1 genetic polymorphisms and susceptibility to breast cancer. , 2001, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[11]  K. Roeder,et al.  Unbiased methods for population‐based association studies , 2001, Genetic epidemiology.

[12]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .

[13]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[14]  Sorin Draghici,et al.  Predicting HIV drug resistance with neural networks , 2003, Bioinform..

[15]  David M. Reif,et al.  Integrated analysis of genetic, genomic and proteomic data , 2004, Expert review of proteomics.

[16]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[17]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[18]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[19]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[20]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[21]  T. Yu,et al.  GENETIC PROGRAMMING : THEORY AND PRACTICE , 2005 .

[22]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[23]  M. Ritchie,et al.  Estrogens, Enzyme Variants, and Breast Cancer: A Risk Model , 2006, Cancer Epidemiology Biomarkers & Prevention.

[24]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[25]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[26]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[27]  Vladimir Shulaev,et al.  Metabolomics technology and bioinformatics , 2006, Briefings Bioinform..

[28]  Andreas Martin Lisewski,et al.  Graph sharpening plus graph integration: a synergy that improves protein functional classification , 2007, Bioinform..

[29]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[30]  M. Eileen Dolan,et al.  A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity , 2007, Proceedings of the National Academy of Sciences.

[31]  P. Donnelly,et al.  Replicating genotype–phenotype associations , 2007, Nature.

[32]  Eric E. Schadt,et al.  Moving toward a system genetics view of disease , 2007, Mammalian Genome.

[33]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[34]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[35]  Shiwei Duan,et al.  Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans , 2008, Molecular Cancer Therapeutics.

[36]  Bernhard Sendhoff,et al.  Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[37]  Bing Ren,et al.  Genome-wide mapping of allele-specific protein-DNA interactions in human cells , 2008, Nature Methods.

[38]  Rachel B. Brem,et al.  Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks , 2008, Nature Genetics.

[39]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[40]  J. Bähler,et al.  Cellular and Molecular Life Sciences REVIEW RNA-seq: from technology to biology , 2022 .

[41]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[42]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[43]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[44]  S. Mi,et al.  Population-specific genetic variants important in susceptibility to cytarabine arabinoside cytotoxicity. , 2009, Blood.

[45]  M. Daly,et al.  Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions , 2009, PLoS genetics.

[46]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[47]  Casey S. Greene,et al.  Failure to Replicate a Genetic Association May Provide Important Clues About Genetic Architecture , 2009, PloS one.

[48]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[49]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[50]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[51]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[52]  Marylyn D. Ritchie,et al.  ATHENA: A knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait Loci , 2010, BioData Mining.

[53]  Peter Kraft,et al.  Quality control and quality assurance in genotypic data for genome‐wide association studies , 2010, Genetic epidemiology.

[54]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[55]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[56]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[57]  M. Marra,et al.  Next generation sequencing based approaches to epigenomics. , 2010, Briefings in functional genomics.

[58]  Alex A. Freitas,et al.  Evolutionary Computation , 2002 .

[59]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[60]  D. Pe’er,et al.  An Integrated Approach to Uncover Drivers of Cancer , 2010, Cell.

[61]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[62]  Dana C Crawford,et al.  Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality , 2011, Genetic epidemiology.

[63]  Rongling Li,et al.  Quality Control Procedures for Genome‐Wide Association Studies , 2011, Current protocols in human genetics.

[64]  Chris Sander,et al.  Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles , 2011, PloS one.

[65]  Donald Eugene. Farrar,et al.  Multicollinearity in Regression Analysis; the Problem Revisited , 2011 .

[66]  Joel T Dudley,et al.  Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes , 2011, BMC gastroenterology.

[67]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[68]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.

[69]  Marylyn D. Ritchie,et al.  Comparison of Methods for Meta-dimensional Data Analysis Using in Silico and Biological Data Sets , 2012, EvoBIO.

[70]  M. Ritchie,et al.  Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. , 2012, Pharmacogenomics.

[71]  Andreas Zell,et al.  Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics , 2012, Lecture Notes in Computer Science.

[72]  Rachel B. Brem,et al.  Stitching together Multiple Data Dimensions Reveals Interacting Metabolomic and Transcriptomic Networks That Modulate Cell Regulation , 2012, PLoS biology.

[73]  Ren-Hua Chung,et al.  A Two-Stage Random Forest-Based Pathway Analysis Method , 2012, PloS one.

[74]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[75]  C. Sander,et al.  Integrative Subtype Discovery in Glioblastoma Using iCluster , 2012, PloS one.

[76]  Yusuke Nakamura,et al.  A genome-wide association study identifies locus at 10q22 associated with clinical outcomes of adjuvant tamoxifen therapy for breast cancer patients in Japanese , 2022 .

[77]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[78]  Ju Han Kim,et al.  Synergistic effect of different levels of genomic data for cancer clinical outcome prediction , 2012, J. Biomed. Informatics.

[79]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[80]  Steven P. Lund,et al.  A Bayesian Integrative Genomic Model for Pathway Analysis of Complex Traits , 2012, Genetic epidemiology.

[81]  L. Peelman,et al.  Experimental validation of in silico predicted KCNA1, KCNA2, KCNA6 and KCNQ2 genes for association studies of peripheral nerve hyperexcitability syndrome in Jack Russell Terriers , 2012, Neuromuscular Disorders.

[82]  Manolis Kellis,et al.  HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants , 2011, Nucleic Acids Res..

[83]  David H Perlman,et al.  Quantitative measurement of allele-specific protein expression in a diploid yeast hybrid by LC-MS , 2012, Molecular systems biology.

[84]  E. Shapiro,et al.  Single-cell sequencing-based technologies will revolutionize whole-organism science , 2013, Nature Reviews Genetics.

[85]  Jonathan K. Pritchard,et al.  Identification of Genetic Variants That Affect Histone Modifications in Human Cells , 2013, Science.

[86]  Jason H. Moore,et al.  Genetic Programming Theory and Practice X , 2013, Genetic and Evolutionary Computation.

[87]  Bjarni J. Vilhjálmsson,et al.  The nature of confounding in genome-wide association studies , 2012, Nature Reviews Genetics.

[88]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[89]  Martin J. Aryee,et al.  Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in Rheumatoid Arthritis , 2013, Nature Biotechnology.

[90]  Marylyn D. Ritchie,et al.  ATHENA: A Tool for Meta-Dimensional Analysis Applied to Genotypes and Gene Expression Data to Predict HDL Cholesterol Levels , 2012, Pacific Symposium on Biocomputing.

[91]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[92]  Kevin C. Dorff,et al.  GobyWeb: Simplified Management and Analysis of Gene Expression and DNA Methylation Sequencing Data , 2013, PloS one.

[93]  Jiang Gui,et al.  Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors , 2014, BioData Mining.

[94]  Patrick Neven,et al.  Genome-wide association studies identify four ER negative–specific breast cancer risk loci , 2013, Nature Genetics.

[95]  W. Han,et al.  Common genetic determinants of breast-cancer risk in East Asian women: a collaborative study of 23 637 breast cancer cases and 25 579 controls. , 2013, Human molecular genetics.

[96]  Dan Xie,et al.  Extensive Variation in Chromatin States Across Humans , 2013, Science.

[97]  Marylyn D. Ritchie,et al.  ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network , 2013, BioData Mining.

[98]  Jason H. Moore,et al.  Genetic Analysis of Prostate Cancer Using Computational Evolution, Pareto-Optimization and Post-processing , 2013 .

[99]  Hiroshi Tanaka,et al.  PathAct: a novel method for pathway analysis using gene expression profiles , 2013, Bioinformation.

[100]  Jaana M. Hartikainen,et al.  Large-scale genotyping identifies 41 new loci associated with breast cancer risk , 2013, Nature Genetics.

[101]  Simon White,et al.  Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline , 2014, BMC Bioinformatics.

[102]  Xiangfeng Wang,et al.  A Computational Workflow to Identify Allele-specific Expression and Epigenetic Modification in Maize , 2013, Genom. Proteom. Bioinform..

[103]  A. Heck,et al.  Next-generation proteomics: towards an integrative view of proteome dynamics , 2012, Nature Reviews Genetics.

[104]  M. Stephens,et al.  Efficient multivariate linear mixed model algorithms for genome-wide association studies. , 2014, Nature methods.

[105]  Robert L. Grossman,et al.  Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets , 2014, J. Am. Medical Informatics Assoc..

[106]  Marylyn D. Ritchie,et al.  ATHENA: the analysis tool for heritable and environmental network associations , 2014, Bioinform..

[107]  Supplemental Information 2: Kyoto Encyclopedia of genes and genomes. , 2022 .