Big data challenges and opportunities in high-throughput sequencing

The advent of high-throughput sequencing, coupled with advances in computational methods, has enabled genome-wide dissection of genetics, evolution, and disease, with nucleotide resolution. The discoveries derived from genomics promise benefits to basic research, biotechnology, and medicine; however, the speed and affordability of sequencing has resulted in a flood of “big data” in the life sciences. In addition, the current heterogeneity of sequencing platforms and diversity of applications complicate the development of tools for analysis, and this has slowed widespread adoption of the technology. Making sense of the data and delivering actionable insight requires improved computational infrastructure, new methods for interpreting the data, and unique collaborative approaches. Here we review the role of big data in genomics, its impact on the development of tools for collaborative analysis of genomes, and successes and ongoing challenges in coping with big data.

[1]  Jungsuk Kim,et al.  Recent advances in nanopore sequencing , 2012, Electrophoresis.

[2]  Dan M. Bolser,et al.  The SEQanswers wiki: a wiki database of tools for high-throughput sequencing analysis , 2011, Nucleic Acids Res..

[3]  Francis S Collins,et al.  Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. , 2007, Scientific American.

[4]  Heidi Ledford Collaborations: With all good intentions , 2008, Nature.

[5]  Ofer Isakov,et al.  Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection , 2013, Briefings Bioinform..

[6]  Kees-Jan Françoijs,et al.  Linear amplification for deep sequencing , 2011, Nature Protocols.

[7]  Robert Schmieder,et al.  SEQanswers: an open access community for collaboratively decoding genomes , 2012, Bioinform..

[8]  R. Gibbs,et al.  Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities , 2011, Genome Biology.

[9]  Mitch Waldrop,et al.  Big data: Wikiomics , 2008, Nature.

[10]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[11]  Marco Capocasa,et al.  Mine, Yours, Ours? Sharing Data on Human Genetic Variation , 2012, PloS one.

[12]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[13]  R. Gibbs,et al.  Analysis of Microsatellite Variation in Drosophila melanogaster with Population-Scale Genome Sequencing , 2012, PloS one.

[14]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[15]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[16]  Nicholas Eriksson,et al.  Efficient Replication of over 180 Genetic Associations with Self-Reported Medical Data , 2011, PloS one.

[17]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[18]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[19]  Nikhil Wagle,et al.  High-throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel sequencing. , 2012, Cancer discovery.

[20]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[21]  Jizhong Zhou,et al.  Reproducibility and quantitation of amplicon sequencing-based detection , 2011, The ISME Journal.

[22]  Hugo Y. K. Lam,et al.  Detecting and annotating genetic variations using the HugeSeq pipeline , 2012, Nature Biotechnology.

[23]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[24]  Brian T. Naughton,et al.  Web-Based, Participant-Driven Studies Yield Novel Genetic Associations for Common Traits , 2010, PLoS genetics.

[25]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[26]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[27]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[28]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[29]  R. Drmanac The advent of personal genome sequencing , 2011, Genetics in Medicine.

[30]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[31]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[32]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[33]  Benjamin F. Jones,et al.  Supporting Online Material Materials and Methods Figs. S1 to S3 References the Increasing Dominance of Teams in Production of Knowledge , 2022 .

[34]  Erika Check Hayden,et al.  International genome project launched , 2008, Nature.

[35]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[36]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[37]  J. Lupski,et al.  Clan Genomics and the Complex Architecture of Human Disease , 2011, Cell.

[38]  D. Kwiatkowski,et al.  Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes , 2012, BMC Genomics.

[39]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[40]  Philip E. Bourne,et al.  Ten Simple Rules for a Successful Collaboration , 2007, PLoS Comput. Biol..

[41]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[42]  Neil R Smalheiser,et al.  Guidelines for Negotiating Scientific Collaboration , 2005, PLoS biology.

[43]  Brent S. Pedersen,et al.  BioStar: An Online Question & Answer Resource for the Bioinformatics Community , 2011, PLoS Comput. Biol..

[44]  Katherine H. Huang,et al.  The Human Microbiome Project: A Community Resource for the Healthy Human Microbiome , 2012, PLoS biology.

[45]  Francis S. Collins,et al.  Mapping the cancer genome , 2007 .

[46]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[47]  Rick L. Stevens,et al.  Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project , 2010, Standards in genomic sciences.

[48]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[49]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[50]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[51]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.