Efficient Gene Assembly and Identification for Many Genome Samples

The development of the next generation sequencing technology (NGS) has advanced the genomics research in many application domains. Metagenomics is one such powerful approach to study large community of microbial species. For the unknown species in the metagenomic samples, gene assembly and identification without a reference genome is a very challenging problem. To overcome this issue, distributed gene assembly software handling multiple metagenome samples can be used. In this paper, based on our previously developed highly scalable gene assembly software SWAP, we present a work flow called WFswap to assemble large genomic data based on many samples and to identify more genes. Our results suggested that WFswap is able to identify 94.2% of the bench-mark genes when tested on the 19 metagenomic samples that contain Bifidobacterium animalis subsp. lactis CNCM I-2494. Our proposed work-flow WFswap showed better performance than WFsoap, a similar workflow that used SOAPdenovo2 for gene assembly.

[1]  Xiaoqiu Huang,et al.  Generating a Genome Assembly with PCAP , 2005, Current protocols in bioinformatics.

[2]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[3]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[4]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[5]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[6]  Gregory Gutin,et al.  When the greedy algorithm fails , 2004, Discret. Optim..

[7]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[10]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[11]  Pavan Balaji,et al.  SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[12]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[13]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[14]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[15]  Pavan Balaji,et al.  SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores -- Practice and Experience , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[16]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[17]  Melissa Bastide,et al.  Assembling Genomic DNA Sequences with PHRAP , 2007, Current protocols in bioinformatics.

[18]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[19]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[20]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[21]  Pavan Balaji,et al.  SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores , 2014, BMC Bioinformatics.

[22]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[23]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Xiangke Liao,et al.  High Performance Interconnect Network for Tianhe System , 2015, Journal of Computer Science and Technology.

[25]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[26]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[27]  Jia Gu,et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor , 2018, bioRxiv.

[28]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[29]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[30]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[31]  Xiaoming Zhang,et al.  Hybrid hierarchy storage system in MilkyWay-2 supercomputer , 2014, Frontiers of Computer Science.

[32]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[33]  C. Grimaldi,et al.  Genome Sequence of the Probiotic Strain Bifidobacterium animalis subsp. lactis CNCM I-2494 , 2011, Journal of bacteriology.

[34]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[35]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[36]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[37]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[38]  Rodrigo Lopez,et al.  The EBI search engine: EBI search as a service—making biological data accessible for all , 2017, Nucleic Acids Res..

[39]  Pavan Balaji,et al.  Scalable Assembly for Massive Genomic Graphs , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[40]  Jian Wang,et al.  The YH database: the first Asian diploid genome database , 2008, Nucleic Acids Res..