Big data challenges in genomics

Abstract With the recent development in biotechnology, especially next-generation sequencing in genomics, there is an explosion of genomic data generated. The data are big in terms of both volume and diversity. The big data contain much more information and also pose unprecedented challenges in data analysis. In this article, we discuss the big data challenges and opportunities in genomics research. We also discuss possible solutions for these challenges, which can serve as the basis for future research.

[1]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[2]  M. Eileen Dolan,et al.  A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity , 2007, Proceedings of the National Academy of Sciences.

[3]  Chris Sander,et al.  Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles , 2011, PloS one.

[4]  D. Pe’er,et al.  An Integrated Approach to Uncover Drivers of Cancer , 2010, Cell.

[5]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[6]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[7]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[8]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[9]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[10]  Steven P. Lund,et al.  A Bayesian Integrative Genomic Model for Pathway Analysis of Complex Traits , 2012, Genetic epidemiology.

[11]  J. Brookfield,et al.  Positive identification of an immigration test-case using human DNA fingerprints , 1985, Nature.

[12]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[13]  Shiwei Duan,et al.  Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans , 2008, Molecular Cancer Therapeutics.

[14]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[15]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[16]  Xiaoling Wang,et al.  Differential methylation tests of regulatory regions , 2016, Statistical applications in genetics and molecular biology.

[17]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[18]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[19]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[20]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[21]  Marzia A. Cremona,et al.  Functional data analysis for computational biology , 2019, Bioinform..

[22]  Siuli Mukhopadhyay,et al.  Variable selection method for quantitative trait analysis based on parallel genetic algorithm , 2010, Annals of human genetics.

[23]  Swee Lay Thein,et al.  Hypervariable ‘minisatellite’ regions in human DNA , 1985, Nature.

[24]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[25]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[26]  Ju Han Kim,et al.  Synergistic effect of different levels of genomic data for cancer clinical outcome prediction , 2012, J. Biomed. Informatics.

[27]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[28]  Jens Bollerslev,et al.  Shape information from glucose curves: Functional data analysis compared with traditional summary measures , 2013, BMC Medical Research Methodology.

[29]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[30]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[31]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[32]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[33]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[34]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .