NGS-FC: A Next-Generation Sequencing Data Format Converter

With the widespread implementation of next-generation sequencing (NGS) technologies, millions of sequences have been produced. A lot of databases were created to store and organize the high-throughput sequencing data. Numerous analysis software programs and tools have been developed over the past years. Most of them use specific formats for data representation and storage. Data interoperability becomes a crucial challenge and many tools have been developed to convert NGS data from one format to another. However, most of them were developed for specific and limited formats. Here, we present NGS-FC (Next-Generation Sequencing Format Converter), which provides a framework to support the conversion between several formats. It supports 14 formats now and provides interfaces to enable users to improve the existing converters and add new ones. Moreover, NGS-FC achieved the overall competitive performance in comparison with some existing converters in terms of RAM usage and running time. The software is written in Java and can be executed standalone. The source code and documentation are freely available at http://sysbio.suda.edu.cn/NGS-FC.

[1]  Michael Hackenberg,et al.  NGSmethDB: a database for next-generation sequencing single-cytosine-resolution DNA methylation data , 2010, Nucleic Acids Res..

[2]  Antony V. Cox,et al.  The Ensembl Web site: mechanics of a genome browser. , 2004, Genome research.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Pavel V. Baranov,et al.  DARNED: a DAtabase of RNa EDiting in humans , 2010, Bioinform..

[5]  Borja Sotomayor,et al.  Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses , 2014, J. Biomed. Informatics.

[6]  David Galas,et al.  Complexity of the microRNA repertoire revealed by next-generation sequencing. , 2010, RNA.

[7]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[8]  Eugene Kindler,et al.  Object-oriented simulation of systems with sophisticated control , 2005, Int. J. Gen. Syst..

[9]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[10]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[11]  Yunlong Liu,et al.  NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets , 2013, Bioinform..

[12]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[13]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[14]  Li Chen,et al.  hmChIP: a database and web server for exploring publicly available human and mouse ChIP-seq and ChIP-chip data , 2011, Bioinform..

[15]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[16]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[17]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[18]  Christoph Endrullat,et al.  Standardization and quality management in next-generation sequencing , 2016, Applied & translational genomics.

[19]  Jong-Il Kim,et al.  TIARA: a database for accurate analysis of multiple personal genomes based on cross-technology , 2010, Nucleic Acids Res..

[20]  Florent E. Angly,et al.  Next Generation Sequence Assembly with AMOS , 2011, Current protocols in bioinformatics.

[21]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[22]  Hui Zhou,et al.  ncRNAimprint: a comprehensive database of mammalian imprinted noncoding RNAs. , 2010, RNA.

[23]  Alexandre G. de Brevern,et al.  Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies , 2015, BioMed research international.

[24]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[25]  Chris T. A. Evelo,et al.  The systems biology format converter , 2016, BMC Bioinformatics.

[26]  Hui Zhou,et al.  starBase: a database for exploring microRNA–mRNA interaction maps from Argonaute CLIP-Seq and Degradome-Seq data , 2010, Nucleic Acids Res..

[27]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[28]  Hui Zhou,et al.  deepBase: a database for deeply annotating and mining deep sequencing data , 2009, Nucleic Acids Res..

[29]  Tzy-Hwa Kathy Tzeng,et al.  Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools , 2016, PloS one.

[30]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[31]  Suzanne Rose Huge Data-Sharing Project Launched. , 2016, Cancer discovery.

[32]  Mohsen Khorshid,et al.  CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins , 2010, Nucleic Acids Res..

[33]  John C. Mitchell,et al.  Concepts in programming languages , 2002 .

[34]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.

[35]  Kessy Abarenkov,et al.  Towards standardization of the description and publication of next-generation sequencing datasets of fungal communities. , 2011, The New phytologist.

[36]  Gunnar Rätsch,et al.  rQuant.web: a tool for RNA-Seq-based transcript quantitation , 2010, Nucleic Acids Res..