46 47 Currently the National Center of Biotechnology Information (NCBI) assigns individual 48 taxonomy identifiers to each distinct influenza virus isolate submitted to GenBank. To 49 support this practice, individual flu isolates must be manually added to the NCBI 50 taxonomy database and unique taxonomy identifiers generated. This added layer of 51 manual processing is unique to influenza virus and prevents automatization of the flu 52 sequence submission process. Here we outline a new NCBI policy that normalizes 53 Influenza virus taxonomy processing but maintains features supported by the previous 54 approach. This change will reduce the amount of manual handling necessary for flu 55 submissions and pave the way for increased automation of the submissions process. 56 While this automation may disrupt some historic practices, it will better align influenza 57 virus data processing with other viruses and ultimately lower the submission burden on 58 data providers. 59 60 61 62 63 Introduction 64 65 GenBank is a member of the International Nucleotide Sequence Database Collaboration 66 (INSDC) (Cochrane et al. 2016) data repositories dedicated to providing public access 67 to biological sequence data. Viral taxonomy within INSDC databases follows the 68 guidelines provided by the International Committee on the Taxonomy of Viruses (ICTV). 69 The scope of the ICTV mandate extends from species to higher level taxa, and no 70 subspecific taxa are maintained by the ICTV (Adams et al. 2017). 71 72 All viral sequences submitted to GenBank and other INSDC repositories are assigned to 73 a species. Sequences from characterized viruses are assigned to their pre-existing 74 species. Sequences from novel viruses are assigned to newly created, unclassified 75 species. Typically, subspecific taxonomic ranks are not created at the time of submission, 76 though some formally unranked subspecific taxa are made during post-submission 77 taxonomic revisions. Creation of new viral taxa within the NCBI taxonomy database 78 whether families, species, or subspecific ranks requires manual validation and database 79 operations. 80 81 There are currently more than 550,000 Influenzavirus A, B, and C nucleotide sequences 82 in GenBank nearly twenty percent of the entire viral nucleotide sequence content of this 83 database (see Table 1). These sequences represent a coordinated effort by the 84 international scientific community to share critical public health data (Bao et al. 2008), 85 and it is imperative that GenBank provides efficient data distribution pathways to support 86 this and similar efforts. Given the number of influenza virus sequences generated by the 87 scientific community, efficient distribution to GenBank can only be sustained through 88 increased automation of the submissions process. 89 90 91 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3428v1 | CC BY 4.0 Open Access | rec: 22 Nov 2017, publ: 22 Nov 2017
[1]
R. L. Harrison,et al.
50 years of the International Committee on Taxonomy of Viruses: progress and prospects
,
2017,
Archives of Virology.
[2]
Tatiana A. Tatusova,et al.
BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata
,
2011,
Nucleic Acids Res..
[3]
G. Cochrane,et al.
The International Nucleotide Sequence Database Collaboration
,
2011,
Nucleic Acids Res..
[4]
T. Tatusova,et al.
The Influenza Virus Resource at the National Center for Biotechnology Information
,
2007,
Journal of Virology.