Querying the public databases for sequences using complex keywords contained in the feature lines

BackgroundHigh throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.ResultsWe show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.ConclusionAlthough not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.

[1]  J. Rakeman,et al.  Multilocus DNA Sequence Comparisons Rapidly Identify Pathogenic Molds , 2005, Journal of Clinical Microbiology.

[2]  M. Sudagidan,et al.  Identification of staphylococci by 16S internal transcribed spacer rRNA gene restriction fragment length polymorphism. , 2005, Journal of medical microbiology.

[3]  P. D'Addabbo,et al.  GeneRecords: a relational database for GenBank flat file parsing and data manipulation in personal computers , 2004, Bioinform..

[4]  R. Barton,et al.  Identification of Medically Important Molds by an Oligonucleotide Array , 2005, Journal of Clinical Microbiology.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[7]  H. König,et al.  Fast protocols for the 5S rDNA and ITS-2 based identification of Oenococcus oeni. , 2005, FEMS microbiology letters.

[8]  T. Nishikawa,et al.  Genetic identification and detection of human pathogenic Rhizopus species, a major mucormycosis agent, by multiplex PCR based on internal transcribed spacer region of rRNA gene. , 2005, Journal of dermatological science.

[9]  P. Shukla,et al.  Use of PCR Targeting of Internal Transcribed Spacer Regions and Single-Stranded Conformation Polymorphism Analysis of Sequence Variation in Different Regions of rRNA Genes in Fungi for Rapid Diagnosis of Mycotic Keratitis , 2005, Journal of Clinical Microbiology.

[10]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[11]  D. Naumann,et al.  Oligonucleotide microarray for identification of Bacillus anthracis based on intergenic transcribed spacers in ribosomal DNA. , 2004, FEMS microbiology letters.

[12]  Marcella Attimonelli,et al.  ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage , 1985, Comput. Appl. Biosci..

[13]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[14]  T. Lott,et al.  Assessment of Ribosomal Large-Subunit D1-D2, Internal Transcribed Spacer 1, and Internal Transcribed Spacer 2 Regions as Targets for Molecular Identification of Medically Important Aspergillus Species , 2005, Journal of Clinical Microbiology.

[15]  E. Delong,et al.  Phylogenetic Screening of Ribosomal RNA Gene-Containing Clones in Bacterial Artificial Chromosome (BAC) Libraries from Different Depths in Monterey Bay , 2004, Microbial Ecology.

[16]  M. Jiménez,et al.  Study of Spanish Grape Mycobiota and Ochratoxin A Production by Isolates of Aspergillus tubingensis and Other Members of Aspergillus Section Nigri , 2005, Applied and Environmental Microbiology.

[17]  Cheolmin Kim,et al.  Detection and Genotyping of Mycobacterium Species from Clinical Isolates and Specimens by Oligonucleotide Array , 2005, Journal of Clinical Microbiology.

[18]  E. Goldstein,et al.  16S-23S rRNA gene internal transcribed spacer sequences for analysis of the phylogenetic relationships among species of the genus Porphyromonas. , 2005, International journal of systematic and evolutionary microbiology.

[19]  Torbjørg Bjelland,et al.  Fungal Diversity in Rock Beneath a Crustose Lichen as Revealed by Molecular Markers , 2005, Microbial Ecology.