Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets

MOTIVATION PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of oligonucleotide signatures in whole genome or marker gene sequences. Although genome and gene databases are generally available and regularly updated, collections of valuable signatures are rare. Even for single requests, the search for signatures becomes computationally expensive when working with large collections of target (and non-target) sequences. Moreover, with growing dataset sizes, the chance of finding exact group-matching signatures decreases, necessitating the application of relaxed search methods. The resultant substantial increase in complexity is exacerbated by the dearth of algorithms able to solve these problems efficiently. RESULTS We have developed CaSSiS, a fast and scalable method for computing comprehensive collections of sequence- and sequence group-specific oligonucleotide signatures from large sets of hierarchically clustered nucleic acid sequence data. Based on the ARB Positional Tree (PT-)Server and a newly developed BGRT data structure, CaSSiS not only determines sequence-specific signatures and perfect group-covering signatures for every node within the cluster (i.e. target groups), but also signatures with maximal group coverage (sensitivity) within a user-defined range of non-target hits (specificity) for groups lacking a perfect common signature. An upper limit of tolerated mismatches within the target group, as well as the minimum number of mismatches with non-target sequences, can be predefined. Test runs with one of the largest phylogenetic gene sequence datasets available indicate good runtime and memory performance, and in silico spot tests have shown the usefulness of the resulting signature sequences as blueprints for group-specific oligonucleotide probes. AVAILABILITY Software and Supplementary Material are available at http://cassis.in.tum.de/.

[1]  R. Amann,et al.  Combination of 16S rRNA-targeted oligonucleotide probes with flow cytometry for analyzing mixed microbial populations , 1990, Applied and environmental microbiology.

[2]  Allan Cooper,et al.  Oligonucleotide probe design — a new approach , 1994, Nature.

[3]  J. Vaun McArthur,et al.  16S rRNA Gene Probes for Deinococcus species , 1996 .

[4]  J. Fry,et al.  PRIMROSE: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the RDP-II database. , 2002, Nucleic acids research.

[5]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[6]  M. Zuker,et al.  OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. , 2003, Nucleic acids research.

[7]  Andreas Krause,et al.  Development and implementation of a parallel algorithm for the fast design of oligonucleotide probe sets for diagnostic DNA microarrays , 2004, Concurr. Pract. Exp..

[8]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[9]  Didier Raoult,et al.  What does the future hold for clinical microbiology? , 2004, Nature Reviews Microbiology.

[10]  Henrik Bjørn Nielsen,et al.  OligoWiz 2.0—integrating sequence feature annotation into the design of microarray probes , 2005, Nucleic Acids Res..

[11]  Yong-Ha Park,et al.  Design of long oligonucleotide probes for functional gene detection in a microbial community , 2005, Bioinform..

[12]  Eric K. Nordberg,et al.  YODA: selecting signature oligonucleotides , 2005, Bioinform..

[13]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[14]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[15]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[16]  Shengzhong Feng,et al.  A fast and flexible approach to oligonucleotide probe design for genomes and gene families , 2007, Bioinform..

[17]  Fred C Tenover,et al.  Rapid detection and identification of bacterial pathogens using novel molecular technologies: infection control and beyond. , 2007, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[18]  Michael Wagner,et al.  Improved 16S rRNA-targeted probe set for analysis of sulfate-reducing bacteria by fluorescence in situ hybridization. , 2007, Journal of microbiological methods.

[19]  Michael Wagner,et al.  probeBase—an online resource for rRNA-targeted oligonucleotide probes: new features 2007 , 2006, Nucleic Acids Res..

[20]  Adam M. Phillippy,et al.  Comprehensive DNA Signature Discovery and Validation , 2007, PLoS Comput. Biol..

[21]  R. Amann,et al.  Single-cell identification in microbial communities by improved fluorescence in situ hybridization techniques , 2008, Nature Reviews Microbiology.

[22]  Thomas Rattei,et al.  probeCheck – a central resource for evaluating oligonucleotide probe coverage and specificity , 2008, Environmental microbiology.

[23]  D. Noguera,et al.  Systematic evaluation of single mismatch stability predictors for fluorescence in situ hybridization. , 2008, Environmental microbiology.

[24]  Hubert Rehrauer,et al.  16S rRNA gene-based phylogenetic microarray for simultaneous identification of members of the genus Burkholderia. , 2009, Environmental microbiology.

[25]  Chuan Yi Tang,et al.  A parallel and incremental algorithm for efficient unique signature discovery on DNA databases , 2009, BMC Bioinformatics.

[26]  Gianluca De Bellis,et al.  ORMA: a tool for identification of species-specific variations in 16S rRNA gene and oligonucleotides design , 2009, Nucleic acids research.

[27]  Adam M. Phillippy,et al.  Insignia: a DNA signature search web server for diagnostic assay development , 2009, Nucleic Acids Res..

[28]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[29]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[30]  Matthew J. Rutherford,et al.  Distributed Stream Processing with DUP , 2010, NPC.

[31]  Rudolf Amann,et al.  Development of a 16S rRNA-targeted probe set for Verrucomicrobia and its application for fluorescence in situ hybridization in a humic lake. , 2010, Systematic and applied microbiology.

[32]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[33]  Harald Meier,et al.  46. ARB: A Software Environment for Sequence Data , 2011 .

[34]  G. Garrity Bergey’s Manual® of Systematic Bacteriology , 2012, Springer New York.