Mining the NCBI Influenza Sequence Database: adaptive grouping of BLAST results using precalculated neighbor indexing

The Influenza Virus Resource and other Virus Variation Resources at NCBI provide enhanced visualization web tools for exploratory analysis for influenza sequence data. Despite the improvements in data analysis, the initial data retrieval remains unsophisticated, frequently producing huge and imbalanced datasets due to the large number of identical and nearly-identical sequences in the database. We propose a data mining algorithm to organize reported sequences into groups based on their relatedness to the query sequence and to each other. The algorithm uses BLAST to find database sequences related to the query. Neighbor lists precalculated from pairwise BLAST alignments between database sequences are used to organize results in groups of nearly-identical and strongly related sequences. We propose to use a non-symmetric dissimilarity measure well crafted for dealing with sequences of different length (fragments). A balanced and representative data set produced by this tool can be used for further analysis, i.e. multiple sequence alignment and phylogenetic trees. The algorithm is implemented for protein coding sequences and is being integrated with the NCBI Influenza Virus Resource.

[1]  Gavin J. D. Smith,et al.  Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic , 2009, Nature.

[2]  Leonid Zaslavsky,et al.  Virus variation resources at the National Center for Biotechnology Information: dengue virus , 2009, BMC Microbiology.

[3]  Tatiana A. Tatusova,et al.  Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation , 2008, BMC Bioinformatics.

[4]  T. Tatusova,et al.  The Influenza Virus Resource at the National Center for Biotechnology Information , 2007, Journal of Virology.

[5]  Hugh E. Williams,et al.  Clustered Sequence Representation for Fast Homology Search , 2007, J. Comput. Biol..

[6]  Tatiana A. Tatusova,et al.  An Adaptive Resolution Tree Visualization of Large Influenza Virus Sequence Datasets , 2007, ISBRA.

[7]  S. Salzberg,et al.  Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution , 2005, Nature.

[8]  Anthony S. Fauci,et al.  Race against time , 2005, Nature.

[9]  Peter A. Spiro,et al.  A Local Alignment Metric for Accelerating Biosequence Database Search , 2004, J. Comput. Biol..

[10]  J. Taubenberger Influenza virus hemagglutinin cleavage into HA1, HA2: no laughing matter. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.