Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies

Sequencing the human genome began in 1994, and 10 years of work were necessary in order to provide a nearly complete sequence. Nowadays, NGS technologies allow sequencing of a whole human genome in a few days. This deluge of data challenges scientists in many ways, as they are faced with data management issues and analysis and visualization drawbacks due to the limitations of current bioinformatics tools. In this paper, we describe how the NGS Big Data revolution changes the way of managing and analysing data. We present how biologists are confronted with abundance of methods, tools, and data formats. To overcome these problems, focus on Big Data Information Technology innovations from web and business intelligence. We underline the interest of NoSQL databases, which are much more efficient than relational databases. Since Big Data leads to the loss of interactivity with data during analysis due to high processing time, we describe solutions from the Business Intelligence that allow one to regain interactivity whatever the volume of data is. We illustrate this point with a focus on the Amadea platform. Finally, we discuss visualization challenges posed by Big Data and present the latest innovations with JavaScript graphic libraries.

[1]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[2]  Laura L. Elo,et al.  Comparison of software packages for detecting differential expression in RNA-seq studies , 2013, Briefings Bioinform..

[3]  Alfredo Pulvirenti,et al.  Comprehensive Reconstruction and Visualization of Non-Coding Regulatory Networks in Human , 2014, Front. Bioeng. Biotechnol..

[4]  Yike Guo,et al.  High dimensional biological data retrieval optimization with NoSQL technology , 2014, BMC Genomics.

[5]  Carole Goble,et al.  A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control , 2014, Biodiversity data journal.

[6]  Ivan Merelli,et al.  Managing, Analysing, and Integrating Big Data in Medical Bioinformatics: Open Problems and Future Perspectives , 2014, BioMed research international.

[7]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[8]  C. Neuvéglise,et al.  Genome Sequence of the Yeast Cyberlindnera fabianii (Hansenula fabianii) , 2014, Genome Announcements.

[9]  Gianmauro Cuccuru,et al.  BioBlend.objects: metacomputing with Galaxy , 2014, Bioinform..

[10]  J. Vandesompele,et al.  Early Targets of miR-34a in Neuroblastoma* , 2014, Molecular & Cellular Proteomics.

[11]  Eric Nestler,et al.  ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases , 2014, BMC Genomics.

[12]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[13]  Jacqueline Weber-Lehmann,et al.  Finding the needle in the haystack: differentiating "identical" twins in paternity testing and forensics by ultra-deep next generation sequencing. , 2014, Forensic science international. Genetics.

[14]  Michael Y. Galperin,et al.  The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection , 2013, Nucleic Acids Res..

[15]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[16]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[17]  Lars Juhl Jensen,et al.  Are graph databases ready for bioinformatics? , 2013, Bioinform..

[18]  Shen Jean Lim,et al.  Simple re-instantiation of small databases using cloud computing , 2013, BMC Genomics.

[19]  Mathieu Almeida,et al.  Dietary intervention impact on gut microbial gene richness , 2013, Nature.

[20]  C. Neuvéglise,et al.  Genome Sequence of the Food Spoilage Yeast Zygosaccharomyces bailii CLIB 213T , 2013, Genome Announcements.

[21]  Thorsten Meinl,et al.  KNIME-CDK: Workflow-driven cheminformatics , 2013, BMC Bioinformatics.

[22]  Enis Afgan,et al.  BioBlend: automating pipeline analyses within Galaxy and CloudMan , 2013, Bioinform..

[23]  Anton Nekrutenko,et al.  Web-based visual analysis for high-throughput genomics , 2013, BMC Genomics.

[24]  Maria Jesus Martin,et al.  BioJS: an open source JavaScript framework for biological data visualization , 2013, Bioinform..

[25]  Oscar Westesson,et al.  Visualizing next-generation sequencing data with JBrowse , 2013, Briefings Bioinform..

[26]  Wendy A. Warr,et al.  Scientific workflow systems: Pipeline Pilot and KNIME , 2012, Journal of Computer-Aided Molecular Design.

[27]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[28]  J. Poland,et al.  Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach , 2012, PloS one.

[29]  C. Bouveyron,et al.  HDclassif: an R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data , 2012 .

[30]  Duncan Temple Lang,et al.  Interactive and Animated Scalable Vector Graphics and R Data Displays , 2012 .

[31]  Matthew Berriman,et al.  Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data , 2011, Bioinform..

[32]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[33]  Ioannis Xenarios,et al.  Visualization and quality assessment of de novo genome assemblies , 2011, Bioinform..

[34]  Pierre Lindenbaum,et al.  Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME† , 2011, Bioinform..

[35]  Bernd Wiswedel,et al.  Extending KNIME for next-generation sequencing data analysis , 2011, Bioinform..

[36]  C. Gaillardin,et al.  The intronome of budding yeasts. , 2011, Comptes rendus biologies.

[37]  Russ B. Altman,et al.  2010 Translational bioinformatics year in review , 2011, J. Am. Medical Informatics Assoc..

[38]  Elaine R. Mardis,et al.  A decade’s perspective on DNA sequencing technology , 2011, Nature.

[39]  E. Lander Initial impact of the sequencing of the human genome , 2011, Nature.

[40]  Thomas D. Otto,et al.  RATT: Rapid Annotation Transfer Tool , 2011, Nucleic acids research.

[41]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[42]  Jeffrey Heer,et al.  D³ Data-Driven Documents , 2011, IEEE Transactions on Visualization and Computer Graphics.

[43]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[44]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[45]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[46]  Emmanuel Barillot,et al.  girafe – an R/Bioconductor package for functional exploration of aligned next-generation sequencing reads , 2010, Bioinform..

[47]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[48]  Michael Brudno,et al.  Savant: genome browser for high-throughput sequencing data , 2010, Bioinform..

[49]  Fan Wang,et al.  CisGenome Browser: a flexible tool for genomic data visualization , 2010, Bioinform..

[50]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[51]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[52]  Xiaokun Li,et al.  MagicViewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation , 2010, Nucleic Acids Res..

[53]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[54]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[55]  Y. Hayashizaki,et al.  NGSView: an extensible open source editor for next-generation sequencing data , 2009, Bioinform..

[56]  Adam Lith,et al.  Investigating storage solutions for large data - A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data , 2010 .

[57]  Pauline C Ng,et al.  Whole genome sequencing. , 2010, Methods in molecular biology.

[58]  A. Malpertuy,et al.  Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments , 2010, BMC Genomics.

[59]  Stefan Engelen,et al.  MicroScope: a platform for microbial genome annotation and comparative genomics , 2009, Database J. Biol. Databases Curation.

[60]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[61]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[62]  M. Goddard,et al.  Mapping genes for complex traits in domestic animals and their use in breeding programmes , 2009, Nature Reviews Genetics.

[63]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[64]  Alejandro A. Schäffer,et al.  Database indexing for production MegaBLAST searches , 2008, Bioinform..

[65]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[66]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[67]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[68]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[69]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[70]  D. Pinkel,et al.  Regional copy number–independent deregulation of transcription in cancer , 2006, Nature Genetics.

[71]  David Rogers,et al.  Cheminformatics analysis and learning in a data pipelining environment , 2006, Molecular Diversity.

[72]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[73]  Vasily Tcherepanov,et al.  Genome Annotation Transfer Utility (GATU): rapid annotation of viral genomes using a closely related reference genome , 2006, BMC Genomics.

[74]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[75]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[76]  Ana Tereza Ribeiro de Vasconcelos,et al.  A System for Automated Bacterial (genome) Integrated Annotation - SABIA , 2004, Bioinform..

[77]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[78]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[79]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[80]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[81]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[82]  Emmanuel Barillot,et al.  Selecting biomedical data sources according to user preferences , 2004, ISMB/ECCB.

[83]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[84]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[85]  M. Gerstein,et al.  What is bioinformatics ? An introduction and overview , 2001 .

[86]  Catherine Letondal,et al.  A Web interface generator for molecular biology programs in Unix , 2001, Bioinform..

[87]  N. Dovichi,et al.  How Capillary Electrophoresis Sequenced the Human Genome This Essay is based on a lecture given at the Analytica 2000 conference in Munich (Germany) on the occasion of the Heinrich-Emanuel-Merck Prize presentation. , 2000, Angewandte Chemie.

[88]  Kim Rutherford,et al.  Artemis: sequence visualization and annotation , 2000, Bioinform..

[89]  S. Ziebland,et al.  Analysing qualitative data , 2000, BMJ : British Medical Journal.

[90]  Andreas D. Baxevanis,et al.  The Molecular Biology Database Collection: an online compilation of relevant database resources , 2000, Nucleic Acids Res..

[91]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[92]  C. Pope,et al.  Qualitative Research in Health Care , 1999 .

[93]  S Subramaniam,et al.  The biology workbench—A seamless database and analysis environment for the biologist , 1998, Proteins.

[94]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[95]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[96]  Jean-Jacques Codani,et al.  LASSAP, a LArge Scale Sequence compArison Package , 1997, Comput. Appl. Biosci..

[97]  Philip E. Bourne,et al.  [30] Macromolecular crystallographic information file , 1997 .

[98]  P E Bourne,et al.  Macromolecular Crystallographic Information File. , 1997, Methods in enzymology.

[99]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[100]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[101]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[102]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[103]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[104]  R. Sinsheimer,et al.  The Santa Cruz Workshop--May 1985. , 1989, Genomics.

[105]  Christine McGourty,et al.  Johns Hopkins as international host , 1989, Nature.

[106]  C. McGourty Databases: Johns Hopkins as international host. , 1989, Nature.

[107]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[108]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[109]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .