Human Analysts at Superhuman Scales: What Has Friendly Software To Do?

As analysts are expected to process a greater amount of information in a shorter amount of time, creators of big data software are challenged with the need for improved efficiency. Ray, our group's usable, scalable genome assembler, addresses big data problems by using optimal resources and producing one, correct and conservative, timely solution. Only by abstracting the size of the data from both the computers and the humans can the real scientific question, often complex in itself, eventually be solved. To draw a curtain over the specific computational machinery of big data, we developed RayPlatform, a programming framework that allows users to concentrate on their domain-specific problems. RayPlatform is a parallel message-passing software framework that runs on clouds, supercomputers, and desktops alike. Using established technologies such as C++ and MPI (message-passing interface), we handle the genomes of hundreds of species, from viruses to plants, using machines ranging from desktop computers to supercomputers. From this experience, we present insights on making computer time more useful-and user time much more valuable.

[1]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[2]  Vivien Marx Genomics in the clouds , 2013, Nature Methods.

[3]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[4]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[5]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[6]  E. Mardis The $1,000 genome, the $100,000 analysis? , 2010, Genome Medicine.

[7]  Mateusz Zotkiewicz,et al.  Robust routing and optimal partitioning of a traffic demand polytope , 2011, Int. Trans. Oper. Res..

[8]  Srinivas Aluru,et al.  Parallel short sequence assembly of transcriptomes , 2009, BMC Bioinformatics.

[9]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[10]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[11]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[12]  C. D. Pham Comparison of message aggregation strategies for parallel simulations on a high performance cluster , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[13]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[15]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[16]  J. Gallant The complexity of the overlap method for sequencing biopolymers. , 1983, Journal of theoretical biology.

[17]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[18]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[19]  Walid Ben-Ameur,et al.  Routing of Uncertain Traffic Demands , 2005 .

[20]  Alex Bateman,et al.  Cloud computing , 2009, Bioinform..

[21]  Peter J. H. King,et al.  Querying multi-dimensional data indexed using the Hilbert space-filling curve , 2001, SGMD.

[22]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[23]  R. Knight,et al.  Worlds within worlds: evolution of the vertebrate gut microbiota , 2008, Nature Reviews Microbiology.

[24]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[25]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[26]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[27]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[28]  Srinivas Aluru,et al.  Assembly of Large Genomes from Paired Short Reads , 2009, BICoB.

[29]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[30]  Jill P. Mesirov,et al.  GenomeSpace: an environment for frictionless bioinformatics , 2013 .

[31]  R. Knight,et al.  Global patterns in bacterial diversity , 2007, Proceedings of the National Academy of Sciences.

[32]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[33]  G. Hutchinson,et al.  Evaluation of polymer sequence fragment data using graph theory. , 1969, The Bulletin of mathematical biophysics.

[34]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[35]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[36]  W. Ben-Ameur Between fully dynamic routing and robust stable routing , 2007, 2007 6th International Workshop on Design and Reliable Communication Networks.

[37]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[38]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[39]  Konstantinos Krampis,et al.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community , 2012, BMC Bioinformatics.

[40]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[41]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[42]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[43]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..

[44]  Rick L. Stevens,et al.  High-throughput generation, optimization and analysis of genome-scale metabolic models , 2010, Nature Biotechnology.

[45]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..

[46]  Elaine R Mardis,et al.  New strategies and emerging technologies for massively parallel sequencing: applications in medical research , 2009, Genome Medicine.

[47]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[48]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[49]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[50]  Jun Kawai,et al.  Metagenomic Diagnosis of Bacterial Infections , 2008, Emerging infectious diseases.

[51]  J. Rothberg,et al.  The development and impact of 454 sequencing , 2008, Nature Biotechnology.

[52]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[53]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[54]  Otto X. Cordero,et al.  Ecology drives a global network of gene exchange connecting the human microbiome , 2011, Nature.

[55]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[56]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[57]  Shaun D Jackman,et al.  Assembling genomes using short-read sequencing technology , 2010, Genome Biology.

[58]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[59]  Joel T Dudley,et al.  In silico research in the era of cloud computing , 2010, Nature Biotechnology.

[60]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[61]  Eugene W Myers,et al.  On the sequencing and assembly of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Judy Qiu,et al.  Cloud Technologies for Bioinformatics Applications , 2011, IEEE Trans. Parallel Distributed Syst..

[63]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.

[64]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[65]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[66]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[67]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[68]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[69]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[70]  Kim Rutherford,et al.  Artemis: sequence visualization and annotation , 2000, Bioinform..

[71]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[72]  Andrew Rambaut,et al.  Evolutionary analysis of the dynamics of viral infectious disease , 2009, Nature Reviews Genetics.

[73]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[74]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[75]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[76]  Ching-Hsing Yu,et al.  SciNet: Lessons Learned from Building a Power-efficient Top-20 System and Data Centre , 2010 .

[77]  Rob Knight,et al.  UniFrac – An online tool for comparing microbial community diversity in a phylogenetic context , 2006, BMC Bioinformatics.

[78]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.